[GOBBLIN-1947] Send workUnitChange event when helix task consistently fail #3832

hanghangliu · 2023-11-16T19:44:43Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-1947] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1947

Description

Here are some details about my PR, including screenshots (if applicable):
When YarnAutoScalingManager detect helix task consistently fail, give an option to send WorkUnitChangeEvent to let GobblinHelixJobLauncher handle the event and split the work unit during runtime. This can help resolving consistent failing containers issue(like OOM) during runtime instead of relying on replaner to restart the whole pipeline

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
Updated test cases and tested in cluster

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

codecov-commenter · 2023-11-22T01:09:14Z

Codecov Report

Attention: 59 lines in your changes are missing coverage. Please review.

Comparison is base (dd17bed) 47.53% compared to head (978f989) 45.96%.
Report is 21 commits behind head on master.

Files	Patch %	Lines
...pache/gobblin/cluster/GobblinHelixJobLauncher.java	3.27%	59 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #3832      +/-   ##
============================================
- Coverage     47.53%   45.96%   -1.57%     
+ Complexity    11035     2181    -8854     
============================================
  Files          2156      416    -1740     
  Lines         85377    18044   -67333     
  Branches       9491     2199    -7292     
============================================
- Hits          40581     8294   -32287     
+ Misses        41099     8865   -32234     
+ Partials       3697      885    -2812

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ZihanLi58 · 2023-11-22T00:56:37Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java

+    final JobState jobState = this.jobContext.getJobState();
+    List<WorkUnit> workUnits = workUnitChangeEvent.getNewWorkUnits();
+    // Use old task Id to recalculate new work units
+    if(workUnits == null || workUnits.isEmpty()) {


In which scenario, we will have no new work units here?

When we send workUnitChangeEvent from YarnAutoScalingManager. The yarn service only has the information of helix, so may not easy to pre-calculate the new workUnit as this process need KafkaSource, which yarn isn't aware of.

ZihanLi58 · 2023-11-22T00:58:51Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java

+      //todo: emit some event to indicate there is an error handling this event that may cause starvation
+      log.error("Failed to process WorkUnitChangeEvent with old tasks {} and new workunits {}.",
+          workUnitChangeEvent.getOldTaskIds(), workUnits, e);
+      throw new InvocationTargetException(e);


did we test what's the behavior for throw this exception? are we able to catch it and fail the whole application and restart directly? Or it will finally fail silently and starve?

Also curious why do we throw InvocationTargetException?

Throwing error actually won't fail the application, so we rely on the Retryer and replanner(if Retryer also failed) here. I've run the test for a week and didn't see any error throwing, but I do agree this may be hard to debug.
I've tried to restart the whole workflow but it's not very straightforward

For the InvocationTargetException, it's actually inherited from the super class which written by you :)

ZihanLi58 · 2023-11-22T01:05:27Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java

                                                                                    RetryException {
    String jobName = this.jobContext.getJobId();
    try (ParallelRunner stateSerDeRunner = new ParallelRunner(this.stateSerDeRunnerThreads, this.fs)) {
-      for (String workUnitId : workUnitIdsToRemove) {
+      for (String helixTaskId : helixTaskIdsToRemove) {
+        String workUnitId = helixIdTaskConfigMap.get(helixTaskId).getConfigMap().get(ConfigurationKeys.TASK_ID_KEY);
        taskRetryer.call(new Callable<Boolean>() {
          @Override
          public Boolean call() throws Exception {
            String taskId = workUnitToHelixConfig.get(workUnitId).getId();


do you still need this?

ZihanLi58 · 2023-11-22T01:06:21Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java

@@ -514,6 +599,7 @@ private TaskConfig getTaskConfig(WorkUnit workUnit, ParallelRunner stateSerDeRun
    rawConfigMap.put(GobblinClusterConfigurationKeys.TASK_SUCCESS_OPTIONAL_KEY, "true");
    TaskConfig taskConfig = TaskConfig.Builder.from(rawConfigMap);
    workUnitToHelixConfig.put(workUnit.getId(), taskConfig);


same here, do we still need this? I feel you want to use helix TaskId instead of work unit Id here

Basically seems helixIdTaskConfigMap and workUnitToHelixConfig are similar, can we just use one of them to reduce complexity?

removed workUnitToHelixConfig

ZihanLi58 · 2023-11-22T01:10:46Z

gobblin-runtime/src/main/java/org/apache/gobblin/runtime/AbstractJobLauncher.java

-        Boolean canCleanUp = this.canCleanStagingData(this.jobContext.getJobState());
-        workUnitStream = closer.register(new DestinationDatasetHandlerService(jobState, canCleanUp, this.eventSubmitter))
-            .executeHandlers(workUnitStream);
+        this.canCleanUpStagingData = this.canCleanStagingData(this.jobContext.getJobState());


Is this from other change? seems not related to this PR

this is actually needed and may easily get overlooked. It's a process that make sure the workUnit can handle shards for target directory. Took me a while the figure out and debug when testing in the cluster...

I'm a little confused for the issue we are trying to solve here. Can you add comment in code to explain that?

ZihanLi58

+1

… fail (apache#3832) * Send WorkUnitChangeEvent when helix task consistently fail * make lancher and scheduler correctly process work unit change event * change back pack config key * correctly process workunit stream before run * only use helix task map * update WorkUnitPreparator for job launcher * update log * use workunit id for state store

hanghangliu added 3 commits November 16, 2023 11:38

Send WorkUnitChangeEvent when helix task consistently fail

a9de744

make lancher and scheduler correctly process work unit change event

9c1e510

change back pack config key

46145fb

ZihanLi58 reviewed Nov 22, 2023

View reviewed changes

hanghangliu added 5 commits December 1, 2023 09:44

correctly process workunit stream before run

d31d3e0

only use helix task map

744dafb

update WorkUnitPreparator for job launcher

dfd0d76

update log

9ad98bd

use workunit id for state store

978f989

ZihanLi58 approved these changes Jan 4, 2024

View reviewed changes

ZihanLi58 merged commit f8880ed into apache:master Jan 4, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GOBBLIN-1947] Send workUnitChange event when helix task consistently fail #3832

[GOBBLIN-1947] Send workUnitChange event when helix task consistently fail #3832

hanghangliu commented Nov 16, 2023

codecov-commenter commented Nov 22, 2023 •

edited

Loading

ZihanLi58 Nov 22, 2023

hanghangliu Dec 1, 2023

ZihanLi58 Nov 22, 2023

hanghangliu Dec 1, 2023

hanghangliu Dec 1, 2023

ZihanLi58 Nov 22, 2023

hanghangliu Dec 1, 2023

ZihanLi58 Nov 22, 2023

ZihanLi58 Nov 22, 2023

hanghangliu Dec 1, 2023

ZihanLi58 Nov 22, 2023

hanghangliu Dec 1, 2023

ZihanLi58 Dec 4, 2023

ZihanLi58 left a comment

[GOBBLIN-1947] Send workUnitChange event when helix task consistently fail #3832

[GOBBLIN-1947] Send workUnitChange event when helix task consistently fail #3832

Conversation

hanghangliu commented Nov 16, 2023

JIRA

Description

Tests

Commits

codecov-commenter commented Nov 22, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZihanLi58 left a comment

Choose a reason for hiding this comment

codecov-commenter commented Nov 22, 2023 •

edited

Loading