-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-1979] Pare down TaskStateCollectorService
failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+
#3850
Conversation
…ing logs during widespread failure, e.g. O(1k)+
TaskState taskState = taskStateStore.getAll(taskStateTableName, taskStateName).get(0); | ||
taskStateQueue.add(taskState); | ||
List<TaskState> matchingTaskStates = taskStateStore.getAll(taskStateTableName, taskStateName); | ||
if (matchingTaskStates.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR only reduces logs in the case where we are missing task states, but if multiple tasks have the same reason of failure they are still not pared down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct: this solely addresses cases where the state store does not retrieve the task state, but otherwise exits normally. perhaps in another sort of failure, a state store impl might throw. this consolidation still permits such failure to pass through uninterrupted.
since the state store already gave us the list of task state names on line 244, I'd expect any other such failure to be ephemeral (else an abject logical bug in the state store). either way, I've avoided over-engineering the solution, precisely, as you point out, because we'd lose valuable debugging info by conflating dissimilar errors.
if a future failure scenario should arise from which we gain a concrete grasp on what kind of errors these might be, I'd suggest at that time to extend this solution.
Codecov ReportAttention:
Additional details and impacted files
☔ View full report in Codecov by Sentry. |
TaskStateCollectorService
failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+TaskStateCollectorService
failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+
… to avoid flooding logs during widespread failure, e.g. O(1k)+ (apache#3850) * Pare down `TaskStateCollectorService` failure logging, to avoid flooding logs during widespread failure, e.g. O(1k)+ * Log final total of missing files
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Logging Task state collector failure at the granularity of every task is impractical, when tasks number in the 100k's. Moreover the stacktrace from retrieval failure provides no meaningful info--even more so when faced w/ thousands of repetitive logs like (pre-change):
Therefore only log sparsely (but present a running total) and omit the stack trace.
Details
This arose because the dest-side volume enforced the namespace quota, which left over 100k+ WUs failing. so while not every day, this is a normal occurrence and therefore deserves graceful handling.
Tests
existing unit tests
Commits