[GOBBLIN-1979] Pare down `TaskStateCollectorService` failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ #3850

phet · 2023-12-21T01:08:41Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1979

Description

Here are some details about my PR, including screenshots (if applicable):

Logging Task state collector failure at the granularity of every task is impractical, when tasks number in the 100k's. Moreover the stacktrace from retrieval failure provides no meaningful info--even more so when faced w/ thousands of repetitive logs like (pre-change):

2023/12/20 22:08:17.894 +0000 WARN [ParallelRunner] [Azkaban] Task failed: Deserialize state for task_name_of_job_task_goes_here_0_1703105111101_51.tst
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
	at java.util.ArrayList.rangeCheck(ArrayList.java:657)
	at java.util.ArrayList.get(ArrayList.java:433)
	at org.apache.gobblin.runtime.TaskStateCollectorService$2.call(TaskStateCollectorService.java:216)
	at org.apache.gobblin.runtime.TaskStateCollectorService$2.call(TaskStateCollectorService.java:213)
	at org.apache.gobblin.util.executors.MDCPropagatingCallable.call(MDCPropagatingCallable.java:42)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Therefore only log sparsely (but present a running total) and omit the stack trace.

Details

This arose because the dest-side volume enforced the namespace quota, which left over 100k+ WUs failing. so while not every day, this is a normal occurrence and therefore deserves graceful handling.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
existing unit tests

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

…ing logs during widespread failure, e.g. O(1k)+

Will-Lo · 2023-12-21T23:27:38Z

gobblin-runtime/src/main/java/org/apache/gobblin/runtime/TaskStateCollectorService.java

-            TaskState taskState = taskStateStore.getAll(taskStateTableName, taskStateName).get(0);
-            taskStateQueue.add(taskState);
+            List<TaskState> matchingTaskStates = taskStateStore.getAll(taskStateTableName, taskStateName);
+            if (matchingTaskStates.isEmpty()) {


This PR only reduces logs in the case where we are missing task states, but if multiple tasks have the same reason of failure they are still not pared down?

correct: this solely addresses cases where the state store does not retrieve the task state, but otherwise exits normally. perhaps in another sort of failure, a state store impl might throw. this consolidation still permits such failure to pass through uninterrupted.

since the state store already gave us the list of task state names on line 244, I'd expect any other such failure to be ephemeral (else an abject logical bug in the state store). either way, I've avoided over-engineering the solution, precisely, as you point out, because we'd lose valuable debugging info by conflating dissimilar errors.

if a future failure scenario should arise from which we gain a concrete grasp on what kind of errors these might be, I'd suggest at that time to extend this solution.

codecov-commenter · 2023-12-21T23:30:36Z

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (18fba9e) 47.60% compared to head (ea0cd42) 47.59%.

Files	Patch %	Lines
...che/gobblin/runtime/TaskStateCollectorService.java	40.00%	5 Missing and 1 partial ⚠️
...n/exception/RuntimeExceptionWithoutStackTrace.java	0.00%	5 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #3850      +/-   ##
============================================
- Coverage     47.60%   47.59%   -0.01%     
- Complexity    11065    11067       +2     
============================================
  Files          2159     2160       +1     
  Lines         85464    85476      +12     
  Branches       9497     9499       +2     
============================================
+ Hits          40682    40684       +2     
- Misses        41080    41090      +10     
  Partials       3702     3702

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

… to avoid flooding logs during widespread failure, e.g. O(1k)+ (apache#3850) * Pare down `TaskStateCollectorService` failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ * Log final total of missing files

phet added 2 commits December 20, 2023 16:43

Pare down TaskStateCollectorService failure logging, to avoid flood…

b22550a

…ing logs during widespread failure, e.g. O(1k)+

Log final total of missing files

ea0cd42

Will-Lo reviewed Dec 21, 2023

View reviewed changes

phet changed the title ~~Pare down TaskStateCollectorService failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+~~ [GOBBLIN-1979] Pare down TaskStateCollectorService failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ Dec 22, 2023

Will-Lo approved these changes Dec 24, 2023

View reviewed changes

Will-Lo merged commit 3302569 into apache:master Dec 24, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GOBBLIN-1979] Pare down `TaskStateCollectorService` failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ #3850

[GOBBLIN-1979] Pare down `TaskStateCollectorService` failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ #3850

phet commented Dec 21, 2023 •

edited

Loading

Will-Lo Dec 21, 2023

phet Dec 22, 2023 •

edited

Loading

codecov-commenter commented Dec 21, 2023 •

edited

Loading

[GOBBLIN-1979] Pare down TaskStateCollectorService failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ #3850

[GOBBLIN-1979] Pare down TaskStateCollectorService failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ #3850

Conversation

phet commented Dec 21, 2023 • edited Loading

JIRA

Description

Details

Tests

Commits

Will-Lo Dec 21, 2023

Choose a reason for hiding this comment

phet Dec 22, 2023 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Dec 21, 2023 • edited Loading

Codecov Report

[GOBBLIN-1979] Pare down `TaskStateCollectorService` failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ #3850

[GOBBLIN-1979] Pare down `TaskStateCollectorService` failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ #3850

phet commented Dec 21, 2023 •

edited

Loading

phet Dec 22, 2023 •

edited

Loading

codecov-commenter commented Dec 21, 2023 •

edited

Loading