[Spark] Restore memory sensitive GBK translation (#33520) #33521

JozoVilcek · 2025-01-07T12:19:30Z

Restore priority of memory sensitive GBK translation in spark-runner to avoid OOM #33520

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2025-01-07T13:10:37Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @damccorm added as fallback since no labels match configuration

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

je-ik

We definitely should add some sort of test for the behavior, because otherwise we risk precisely the same situation - some future optimization will break this one. I'd also like to solve the conditions under which the optimization is correct, I'll be happy to assist with that. :)

runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java

…tion

damccorm · 2025-01-13T21:46:59Z

waiting on author

je-ik

Thanks for the changes! It makes sense to me. There are still two issues with the PR:
a) it effectively blocks the "groupByKeyInGlobalWindow" optimization
b) we are missing test that either of the optimizations is actually applied

More serious is the part a), because in the current implementation this PR actually creates a performance regression for cases when the GBK values actually fit into memory - first it uses the more expensive sort version and it also shuffles the (global) window along with the data.

I created list thread to discuss this further.

github-actions · 2025-01-21T12:14:19Z

Reminder, please take a look at this pr: @damccorm

github-actions · 2025-01-23T12:14:36Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damccorm added as fallback since no labels match configuration

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

damccorm · 2025-01-23T14:53:22Z

waiting on author

…values

JozoVilcek · 2025-01-26T16:13:06Z

Run Java PreCommit

je-ik

LGTM, thanks!

[spark] Restore memory sensitive GBK translation

30d0863

github-actions bot added runners spark labels Jan 7, 2025

JozoVilcek changed the title ~~[Spark] Restore memory sensitive GBK translation (#33520)~~ [Spark] Restore memory sensitive GBK translation #33520 Jan 7, 2025

JozoVilcek changed the title ~~[Spark] Restore memory sensitive GBK translation #33520~~ [Spark] Restore memory sensitive GBK translation (#33520) Jan 7, 2025

[spark] spotless

231aa02

je-ik self-requested a review January 7, 2025 13:05

github-actions bot added the Next Action: Reviewers label Jan 7, 2025

je-ik reviewed Jan 7, 2025

View reviewed changes

runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java Show resolved Hide resolved

runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java Show resolved Hide resolved

[spark] Exclude CoGBK transform from group by key and window optimisa…

e74b4e7

…tion

github-actions bot added Next Action: Author Next Action: Reviewers and removed Next Action: Reviewers Next Action: Author labels Jan 13, 2025

[spark] spotless

e551b4c

je-ik reviewed Jan 14, 2025

View reviewed changes

github-actions bot added the slow-review label Jan 21, 2025

github-actions bot removed the slow-review label Jan 23, 2025

github-actions bot added Next Action: Author and removed Next Action: Reviewers labels Jan 23, 2025

[spark] Add config option for enable GBK translation to support huge …

75a880b

…values

github-actions bot added Next Action: Reviewers and removed Next Action: Author labels Jan 26, 2025

je-ik approved these changes Jan 27, 2025

View reviewed changes

je-ik merged commit 73d254b into apache:master Jan 27, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Restore memory sensitive GBK translation (#33520) #33521

[Spark] Restore memory sensitive GBK translation (#33520) #33521

JozoVilcek commented Jan 7, 2025 •

edited

Loading

github-actions bot commented Jan 7, 2025

je-ik left a comment

damccorm commented Jan 13, 2025

je-ik left a comment

github-actions bot commented Jan 21, 2025

github-actions bot commented Jan 23, 2025

damccorm commented Jan 23, 2025

JozoVilcek commented Jan 26, 2025

je-ik left a comment

[Spark] Restore memory sensitive GBK translation (#33520) #33521

[Spark] Restore memory sensitive GBK translation (#33520) #33521

Conversation

JozoVilcek commented Jan 7, 2025 • edited Loading

GitHub Actions Tests Status (on master branch)

github-actions bot commented Jan 7, 2025

je-ik left a comment

Choose a reason for hiding this comment

damccorm commented Jan 13, 2025

je-ik left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 21, 2025

github-actions bot commented Jan 23, 2025

damccorm commented Jan 23, 2025

JozoVilcek commented Jan 26, 2025

je-ik left a comment

Choose a reason for hiding this comment

JozoVilcek commented Jan 7, 2025 •

edited

Loading