I have a simple job with trigger=15 seconds, Source=Kafka and Sink=S3. Is it possible to find how much time did it take to download messages from Kafka? Or say if I had Sink=Console, it bring back data on the driver, is it possible to find how much time to download data from Kafka and how much time to bring it back to driver?
From driver I get these for query while writing to S3. Is it possible to understand how much time did it spend in downloading 99998 rows from Kafka out of triggerExecution = 44 seconds?
Streaming query made progress: {
id : 1383g52b-8de4-4e95-a3s9-aea73qe3ea56,
runId : 1206f5tc-t503-44r0-bc0c-26ce404w6724,
name : null,
timestamp : 2017-08-25T01:42:10.000Z,
numInputRows : 99998,
inputRowsPerSecond : 1666.6333333333334,
processedRowsPerSecond : 2263.9860535669814,
durationMs : {
addBatch : 42845,
getBatch : 3,
getOffset : 68,
queryPlanning : 6,
triggerExecution : 44169,
walCommit : 1245
},
stateOperators : [ ],
sources : [ {
description : KafkaSource[Subscribe[kafka_topic]],
startOffset : {
kafka_topic : {
2 : 20119244,
4 : 20123550,
1 : 20124601,
3 : 20113622,
0 : 20114208
}
},
endOffset : {
kafka_topic : {
2 : 20139245,
4 : 20143531,
1 : 20144592,
3 : 20133663,
0 : 20134192
}
},
numInputRows : 99998,
inputRowsPerSecond : 1666.6333333333334,
processedRowsPerSecond : 2263.9860535669814
} ],
sink : {
description : FileSink[s3://s3bucket]
}
}
Thanks!
You should find the answers to your questions by reviewing
StreamingQuery.lastProgress.durationMs
.In the order of their calculation the following durations tell you:
getOffset
is the time to get the offsets from all the sourcesgetBatch
is the time to get the streaming Datasets (aka batches) from all the sources (one by one, sequentially).addBatch
is the time to write the streaming Dataset to a sinkWith that said...
That's
addBatch
duration (since that's when the Dataset gets executed as an RDD on executors)You'd have to sum
addBatch
durations fromStreamingQuery.recentProgress
array.