You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Flatten of BOUNDED and UNBOUNDED PCollections in the Spark runner is implemented by applying SparkContext#union(RDD...) inside a DStream.transform() which causes the same RDD to be "unionized" into each micro-batch and so multiplying it's content in the resulting stream (x number of batches).
Spark does not seem to provide any out-of-the-box implementations for this.
One approach I tried was to create a stream from Queue (single RDD stream) but this is not an option since this fails checkpointing.
Another approach would be to create a custom InputDStream that does this.
An important note here is that the challenge here is to find a solution that holds with checkpointing and recovery from failure.
Imported from Jira BEAM-1444. Original Jira may contain additional context.
Reported by: amitsela.
The text was updated successfully, but these errors were encountered:
Flatten of BOUNDED and UNBOUNDED PCollections in the Spark runner is implemented by applying
SparkContext#union(RDD...)
inside aDStream.transform()
which causes the same RDD to be "unionized" into each micro-batch and so multiplying it's content in the resulting stream (x number of batches).Spark does not seem to provide any out-of-the-box implementations for this.
One approach I tried was to create a stream from Queue (single RDD stream) but this is not an option since this fails checkpointing.
Another approach would be to create a custom
InputDStream
that does this.An important note here is that the challenge here is to find a solution that holds with checkpointing and recovery from failure.
Imported from Jira BEAM-1444. Original Jira may contain additional context.
Reported by: amitsela.
The text was updated successfully, but these errors were encountered: