I cannot simulate this, so a quick check on a non-Streaming situation, just DF or RDD regular processing:

  • If a Spark Worker Node fails
    • and thus a given RDD computation/computations is/are lost
      • and there is no caching, checkpointing, etc. applied,
        • then for the recompute,
          • how does this pan out if the data at source has changed and it could mean that in fact the other Nodes would need some extra data due to re-partitioning?
          • what does it mean in terms of performance of the initial read that may have been a lot of data followed by a repartition?

I.e. we are talking about non-deterministic situation here.

1 Answers

0
Community On Best Solutions

Update - If we consider a source like JDBC, the query would be executed[1] against the DB during re-compute. If the records change it would lead to skewed results. I don't think the job would fail.

[1] - This is based on JdbcRDD code.


Regarding your first question, Spark's partitions are very similar (in fact built out of Hadoop's InputSplit from InputFormat). Each FileSplit generally contains the following properties

  • InputPath
  • StartOffset
  • Length (generally block size on the cluster)

So let's consider the following cases when you say the data at the source has changed

+--------------------------+-------------------------------------------------------------+
|         Scenario         |                        What happens                         |
+--------------------------+-------------------------------------------------------------+
| New file get's added     | The new files are not part of the input splits              |
|                          | so they're not captured as part of the partitions.          |
| Existing file is deleted | This will cause the job to fail with FileNotFoundException. |
+--------------------------+-------------------------------------------------------------+

Regarding your second question, When you say repartition there're again two ways. With shuffling=true and without.

Without shuffling it's is really just bunching a list of InputSplits together into a single partition (if new numPartitions < existing partitions). In case of re-evaluation they would be read again from the source.

If you had shuffling=true during repartition, the spark does book keeping required to find the missing partitions and re-run the tasks. You can read more about that in here. So the same situation as above applies while re-reading the partitions from the input.

PS: I've assumed the source is a Hadoop compatible Filesystem.