Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1979] Pare down TaskStateCollectorService failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ #3850

Merged

Conversation

phet
Copy link
Contributor

@phet phet commented Dec 21, 2023

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

Logging Task state collector failure at the granularity of every task is impractical, when tasks number in the 100k's. Moreover the stacktrace from retrieval failure provides no meaningful info--even more so when faced w/ thousands of repetitive logs like (pre-change):

2023/12/20 22:08:17.894 +0000 WARN [ParallelRunner] [Azkaban] Task failed: Deserialize state for task_name_of_job_task_goes_here_0_1703105111101_51.tst
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
	at java.util.ArrayList.rangeCheck(ArrayList.java:657)
	at java.util.ArrayList.get(ArrayList.java:433)
	at org.apache.gobblin.runtime.TaskStateCollectorService$2.call(TaskStateCollectorService.java:216)
	at org.apache.gobblin.runtime.TaskStateCollectorService$2.call(TaskStateCollectorService.java:213)
	at org.apache.gobblin.util.executors.MDCPropagatingCallable.call(MDCPropagatingCallable.java:42)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Therefore only log sparsely (but present a running total) and omit the stack trace.

Details

This arose because the dest-side volume enforced the namespace quota, which left over 100k+ WUs failing. so while not every day, this is a normal occurrence and therefore deserves graceful handling.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    existing unit tests

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

TaskState taskState = taskStateStore.getAll(taskStateTableName, taskStateName).get(0);
taskStateQueue.add(taskState);
List<TaskState> matchingTaskStates = taskStateStore.getAll(taskStateTableName, taskStateName);
if (matchingTaskStates.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR only reduces logs in the case where we are missing task states, but if multiple tasks have the same reason of failure they are still not pared down?

Copy link
Contributor Author

@phet phet Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct: this solely addresses cases where the state store does not retrieve the task state, but otherwise exits normally. perhaps in another sort of failure, a state store impl might throw. this consolidation still permits such failure to pass through uninterrupted.

since the state store already gave us the list of task state names on line 244, I'd expect any other such failure to be ephemeral (else an abject logical bug in the state store). either way, I've avoided over-engineering the solution, precisely, as you point out, because we'd lose valuable debugging info by conflating dissimilar errors.

if a future failure scenario should arise from which we gain a concrete grasp on what kind of errors these might be, I'd suggest at that time to extend this solution.

@codecov-commenter
Copy link

codecov-commenter commented Dec 21, 2023

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (18fba9e) 47.60% compared to head (ea0cd42) 47.59%.

Files Patch % Lines
...che/gobblin/runtime/TaskStateCollectorService.java 40.00% 5 Missing and 1 partial ⚠️
...n/exception/RuntimeExceptionWithoutStackTrace.java 0.00% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3850      +/-   ##
============================================
- Coverage     47.60%   47.59%   -0.01%     
- Complexity    11065    11067       +2     
============================================
  Files          2159     2160       +1     
  Lines         85464    85476      +12     
  Branches       9497     9499       +2     
============================================
+ Hits          40682    40684       +2     
- Misses        41080    41090      +10     
  Partials       3702     3702              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@phet phet changed the title Pare down TaskStateCollectorService failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ [GOBBLIN-1979] Pare down TaskStateCollectorService failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ Dec 22, 2023
@Will-Lo Will-Lo merged commit 3302569 into apache:master Dec 24, 2023
6 checks passed
arjun4084346 pushed a commit to arjun4084346/gobblin that referenced this pull request Jan 10, 2024
… to avoid flooding logs during widespread failure, e.g. O(1k)+ (apache#3850)

* Pare down `TaskStateCollectorService` failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+

* Log final total of missing files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants