Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1947] Send workUnitChange event when helix task consistently fail #3832

Conversation

hanghangliu
Copy link
Contributor

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
    When YarnAutoScalingManager detect helix task consistently fail, give an option to send WorkUnitChangeEvent to let GobblinHelixJobLauncher handle the event and split the work unit during runtime. This can help resolving consistent failing containers issue(like OOM) during runtime instead of relying on replaner to restart the whole pipeline

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Updated test cases and tested in cluster

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link

codecov-commenter commented Nov 22, 2023

Codecov Report

Attention: 59 lines in your changes are missing coverage. Please review.

Comparison is base (dd17bed) 47.53% compared to head (978f989) 45.96%.
Report is 21 commits behind head on master.

Files Patch % Lines
...pache/gobblin/cluster/GobblinHelixJobLauncher.java 3.27% 59 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3832      +/-   ##
============================================
- Coverage     47.53%   45.96%   -1.57%     
+ Complexity    11035     2181    -8854     
============================================
  Files          2156      416    -1740     
  Lines         85377    18044   -67333     
  Branches       9491     2199    -7292     
============================================
- Hits          40581     8294   -32287     
+ Misses        41099     8865   -32234     
+ Partials       3697      885    -2812     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

final JobState jobState = this.jobContext.getJobState();
List<WorkUnit> workUnits = workUnitChangeEvent.getNewWorkUnits();
// Use old task Id to recalculate new work units
if(workUnits == null || workUnits.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which scenario, we will have no new work units here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we send workUnitChangeEvent from YarnAutoScalingManager. The yarn service only has the information of helix, so may not easy to pre-calculate the new workUnit as this process need KafkaSource, which yarn isn't aware of.

//todo: emit some event to indicate there is an error handling this event that may cause starvation
log.error("Failed to process WorkUnitChangeEvent with old tasks {} and new workunits {}.",
workUnitChangeEvent.getOldTaskIds(), workUnits, e);
throw new InvocationTargetException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we test what's the behavior for throw this exception? are we able to catch it and fail the whole application and restart directly? Or it will finally fail silently and starve?

Also curious why do we throw InvocationTargetException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing error actually won't fail the application, so we rely on the Retryer and replanner(if Retryer also failed) here. I've run the test for a week and didn't see any error throwing, but I do agree this may be hard to debug.
I've tried to restart the whole workflow but it's not very straightforward

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the InvocationTargetException, it's actually inherited from the super class which written by you :)

RetryException {
String jobName = this.jobContext.getJobId();
try (ParallelRunner stateSerDeRunner = new ParallelRunner(this.stateSerDeRunnerThreads, this.fs)) {
for (String workUnitId : workUnitIdsToRemove) {
for (String helixTaskId : helixTaskIdsToRemove) {
String workUnitId = helixIdTaskConfigMap.get(helixTaskId).getConfigMap().get(ConfigurationKeys.TASK_ID_KEY);
taskRetryer.call(new Callable<Boolean>() {
@Override
public Boolean call() throws Exception {
String taskId = workUnitToHelixConfig.get(workUnitId).getId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you still need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -514,6 +599,7 @@ private TaskConfig getTaskConfig(WorkUnit workUnit, ParallelRunner stateSerDeRun
rawConfigMap.put(GobblinClusterConfigurationKeys.TASK_SUCCESS_OPTIONAL_KEY, "true");
TaskConfig taskConfig = TaskConfig.Builder.from(rawConfigMap);
workUnitToHelixConfig.put(workUnit.getId(), taskConfig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, do we still need this? I feel you want to use helix TaskId instead of work unit Id here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically seems helixIdTaskConfigMap and workUnitToHelixConfig are similar, can we just use one of them to reduce complexity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed workUnitToHelixConfig

Boolean canCleanUp = this.canCleanStagingData(this.jobContext.getJobState());
workUnitStream = closer.register(new DestinationDatasetHandlerService(jobState, canCleanUp, this.eventSubmitter))
.executeHandlers(workUnitStream);
this.canCleanUpStagingData = this.canCleanStagingData(this.jobContext.getJobState());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this from other change? seems not related to this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually needed and may easily get overlooked. It's a process that make sure the workUnit can handle shards for target directory. Took me a while the figure out and debug when testing in the cluster...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused for the issue we are trying to solve here. Can you add comment in code to explain that?

Copy link
Contributor

@ZihanLi58 ZihanLi58 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ZihanLi58 ZihanLi58 merged commit f8880ed into apache:master Jan 4, 2024
6 checks passed
arjun4084346 pushed a commit to arjun4084346/gobblin that referenced this pull request Jan 10, 2024
… fail (apache#3832)

* Send WorkUnitChangeEvent when helix task consistently fail

* make lancher and scheduler correctly process work unit change event

* change back pack config key

* correctly process workunit stream before run

* only use helix task map

* update WorkUnitPreparator for job launcher

* update log

* use workunit id for state store
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants