Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EDU-3819: docs: tried to clarify the notes on failures #3274

Merged
merged 6 commits into from
Jan 22, 2025

Conversation

GSmithApps
Copy link
Contributor

@GSmithApps GSmithApps commented Jan 6, 2025

What does this PR do?

Tried to clarify some parts of the Failures page. Specifically regarding the Application Failures

@fairlydurable fairlydurable added the work-in-progress This PR is not yet ready for technical or team review. Ideally it should be converted to a draft. label Jan 6, 2025
@GSmithApps GSmithApps force-pushed the fix-temporal-failures-clarification branch from 6255f9f to 39a9802 Compare January 21, 2025 17:38
@GSmithApps GSmithApps marked this pull request as ready for review January 21, 2025 17:40
@GSmithApps GSmithApps requested a review from a team as a code owner January 21, 2025 17:40
@GSmithApps
Copy link
Contributor Author

Hi @fairlydurable thanks for marking this as in-progress a few weeks ago. It is ready for review whenever time permits. Thank you!!

Side note: I see that vercel says the deployment failed -- is that just flaky CICD, or has something gone wrong? I don't think I've changed anything that would cause deployment to fail

Lastly, please let me know if I can help out at all - cheers!

@fairlydurable
Copy link
Contributor

Vercel is failing because of a broken anchor: linking to /workers#activity-task-execution

The new link is /tasks#activity-task-execution

Hope that helps

@fairlydurable fairlydurable added the cross-team This issue or PR was submitted from within Temporal label Jan 21, 2025
@fairlydurable fairlydurable changed the title docs: tried to clarify the notes on failures EDU-3819: docs: tried to clarify the notes on failures Jan 21, 2025
@fairlydurable fairlydurable added the tracking-internally A JIRA ticket has been generated and the EDU number is attached to the title of this issue or PR label Jan 21, 2025
@@ -18,7 +18,7 @@ import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/rela
A Retry Policy works in cooperation with the timeouts to provide fine controls to optimize the execution experience.

A Retry Policy is a collection of attributes that instructs the Temporal Server how to retry a failure of a [Workflow Execution](/workflows#workflow-execution) or an [Activity Task Execution](/tasks#activity-task-execution).
(Retry Policies do not apply to [Workflow Task Executions](/tasks#workflow-task-execution), which always retry indefinitely.)
(Retry Policies do not apply to [Workflow Task Executions](/tasks#workflow-task-execution), which retry until the Workflow Execution Timeout, which is unlimited by default.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry Policies do not apply to [Workflow Task Executions](/tasks#workflow-task-execution), which retry until the Workflow Execution Timeout.
The Execution Timeout for Workflow Task Executions is unlimited by default.

Won't they apply if the default is changed? What happens then? Can we stick in a "normally", get rid of the parentheses, and break this down into a couple of sentences?

Suggested change
(Retry Policies do not apply to [Workflow Task Executions](/tasks#workflow-task-execution), which retry until the Workflow Execution Timeout, which is unlimited by default.)
Retry Policies normally do not apply to [Workflow Task Executions](/tasks#workflow-task-execution) unless their Workflow Execution Timeout is changed.
Workflow Task Executions will retry until reaching their Workflow Execution Timeout.
The default value for this timeout is unlimited, so a Workflow Task Execution will normally retry forever.

I'm not sure if I got the semantics correct here, so please confirm.

Also, does this need a tech reviewer to confirm your changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow task executions are different from activities and workflow executions. I'm not sure precisely what their retry policy is, so I didn't change any of that in this sentence. I just added the part I was quite sure of, which is that they don't always retry forever -- they retry until the workflow execution timeout, which is by default forever

@@ -52,21 +52,22 @@ This is the only type of Temporal Failure created and thrown by user code.
An error in a Workflow can cause either a **Workflow Task Failure** (the Task will be retried) or a **Workflow Execution Failure** (the Workflow is marked as failed).

Only Workflow exceptions that are Temporal Failures cause the Workflow Execution to fail; all other exceptions cause the Workflow Task to fail and be retried (in Go, any error returned from the Workflow fails the Workflow Execution, and a panic fails the Workflow Task).
Most types of Temporal Failures occur automatically, like a [Cancelled Failure](#cancelled-failure) when the Workflow is Cancelled or an [Activity Failure](#activity-failure) when an Activity Fails.
You can also explicitly fail the Workflow Execution by throwing an Application Failure (returning any error in Go).
Most types of Temporal Failures are raised by the Temporal Service, like a [Cancelled Failure](#cancelled-failure) when the Workflow is Cancelled or an [Activity Failure](#activity-failure) when an Activity Fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider agreement between the articles for Workflow and Activity ("when a Workflow is Cancelled", "when an Activity fails").

Should F be capitalized here in "Fails"? I assume it is correct for "Cancelled Failure" and "Activity Failure", although Failure isn't in my Temporal caps list.

If you want to get rid of passive voice, you can move Temporal Service ("The Temporal Service raises most types of Temporal failures, like a...")

Copy link
Contributor Author

@GSmithApps GSmithApps Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on not capitalizing the F!

Regarding the argeement on articles, yeah same here -- I felt like something was clunky when I read it too. I couldn't quite tell what it was. But as far as the article agreement goes ("when a Workflow is Cancelled", "when an Activity fails"), I actually think this is intentional. From the reader/user's perspective, the cancellation is probably a manual intervention, so it will feel like it was cancelled, whereas with an Activity Failure, it wouldn't happen because of manual intervention, so that would feel more passive and more distant or out of their control. Does that align with how the prose reads to you? (Also, I didn't change that part, but I figure you're mentioning it because we should fix it while we're here, right?)

Still thinking on the passive voice thing. I don't think I really tried to change the voice on that -- I just expanded more, but maybe that changes the voice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We srsly are not going to block on passive voice at this time. Jwahir did a lovely ticket over Docscember clearing out passive voice but we just want to improve things right now and not throw obstacles and we are pretty low-friction at this point with a great new manager.

This explanation you've added is an obvious win that servers the user. I'll let you know what can be improved, what should be improved or fixed, and what I find confusing. Beyond that? ¯_(ツ)_/¯ We'll get to it. The guiding philosophy right now as I see it is keep moving things forward to serve our users better and incremental improvements are better than blocked tickets.

Most types of Temporal Failures occur automatically, like a [Cancelled Failure](#cancelled-failure) when the Workflow is Cancelled or an [Activity Failure](#activity-failure) when an Activity Fails.
You can also explicitly fail the Workflow Execution by throwing an Application Failure (returning any error in Go).
Most types of Temporal Failures are raised by the Temporal Service, like a [Cancelled Failure](#cancelled-failure) when the Workflow is Cancelled or an [Activity Failure](#activity-failure) when an Activity Fails.
In contrast, the application developer can explicitly fail the Workflow Execution by throwing an Application Failure (returning any error in Go).
Copy link
Contributor

@fairlydurable fairlydurable Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "you can explicitly fail a Workflow Execution from code by throwing an Application Failure. In Go, returning any error (?error instance? ?error type?) will fail a Workflow Execution." to get rid of the parenthesized Go special case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on switching "application developer" to "you"!

Yeah I agree that the parenthesis felt a little odd. I did it like this to mirror the description two sentences ago, because it's a similar situation in which Go behaves differently from how the other SDKs do.

Lol I had the same question about what exactly constitutes an error in Go, and I'm pretty darn sure the precise behavior is that any non-nil value as the last return value will cause the failure, but I think every Go programmer would describe that as returning an error. It doesn't have to be any specific instance or type... I know, it's weird


#### Workflow Task Failures

A **Workflow Task Failure** is an unexpected situation failing to process a Workflow Task.
This could be triggered by raising an exception in your Workflow code.
This could be triggered by a non-Temporal exception being raised (panicking in Go) in your Workflow code.
Copy link
Contributor

@fairlydurable fairlydurable Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "a non-Temporal exception being raised, or panicking in Go, in your Workflow code" to get rid of the parenthesized Go special case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, I was following the existing pattern for how Go one-offs were called out. Happy to do whatever on these, but I do think it's worth trying to draw the reader's attention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine. Let it be.

Any exception that does not extend Temporal's `FailureError` exception is considered a Workflow Task Failure.
These types of failures will cause the Workflow Task to be retried.
These types of failures will cause the Workflow Task to be retried until the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest rewording (assuming I got the logic right above). "These types of failures cause a Workflow Task to be retried indefinitely unless the Workflow Execution Timeout was changed from its default of "unlimited" time." Or something like that.

Copy link
Contributor Author

@GSmithApps GSmithApps Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your style on making the sentence flow more easily! On my read, I slightly prefer emphasizing that the retry will be until the Workflow Execution Timeout, then calling out the default almost as an afterthought -- rather than stating that they're retried indefinitely then including the things about Workflow Execution Timeout as the afterthought. And how we have it in this PR is how Max described it in an error handling explanation in an engineering bootcamp video. Idk though I think you're the writing expert

And yep the logic I see in your note and my understanding match!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my understanding, Workflow Executions are not retried unless a retry policy is set. So even my rewording is still confusing if I got that part right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh I see the confusion -- this is talking about workflow tasks, not workflow executions, which are different. workflow executions are not retried, but workflow tasks are

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, gotcha! Thanks


#### Workflow Execution Failures

An `ApplicationError`, an extension of `FailureError`, can be raised in a Workflow to fail the Workflow Execution.
Workflow Execution Failures put the Workflow Execution into the "Failed" state and no more attempts will be made in progressing this execution.
If you are creating custom exceptions you would either need to extend the `ApplicationError` class—a child class of `FailureError`— or explicitly state that this exception is a Workflow Execution Failure by raising a new `ApplicationError`.
If you are creating custom exceptions you would need to extend the `ApplicationError` class—a child class of `FailureError`.
Copy link
Contributor

@fairlydurable fairlydurable Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help to link to the API references for these two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea! I think links are the way to go too. In this specific case, maybe it's a little tricky because this is a general article that isn't specific to any one language. And the most language-neutral description of these two things is above in this article. So we could do an internal link if we want. Or maybe we could bullet out all of the languages and link to their API docs? Idk I figure we won't do that.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying! Sounds to me like it's an Event History entry name. I see that it's code-faced on the failures reference page, although I'm having "thoughts" about that unless it is a globally defined constant across SDKs.

Would these links work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree - I'm not totally clear on our language about errors vs failures. But I'm pretty sure that this is a keyword across SDKs. My understanding is that whatever the SDK calls it (application error, application failure), it gets converted by the SDK into the language-neutral format of application failure.

Yeah I like the links -- I'm making the fix now! I think the first one should be https://docs.temporal.io/references/failures#temporal-failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay yeah I added them. I don't think it's perfect, but I think it's probably the best we can do given that creating custom exceptions is slightly SDK-specific and this is an SDK-agnostic article. I'm happy with it if you are!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enough is a great thing.

@@ -82,7 +83,7 @@ During conversion, the following Application Failure fields are set:
- `next_retry_delay` is left unset.
- call stack is copied.

When an [Activity Execution](/activities#activity-execution) fails, the Application Failure from the last Activity Task is the `cause` field of the [ActivityFailure](#activity-failure) thrown in the Workflow.
When an [Activity Execution](/activities#activity-execution) (an Activity Execution is the full chain of [Activity Task Executions](/tasks#activity-task-execution)) fails, the Application Failure from the last Activity Task is the `cause` field of the [ActivityFailure](#activity-failure). This Activity Failure is thrown by the Workflow's call to execute the Activity, and this call site is where Activity Failure can be handled by the application developer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very long and combines sentences on a single line. (Let me know if you want the copy-pasta of why we use our one sentence per line rules.) Suggest rewording to simplify the concepts because I found it confusing to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That copy-pasta would be awesome!! I'm learning about technical writing, so anything like that would be awesome. I went ahead and fixed that sentence-break

Yeah it's wordy and hard to read. I can work on it a little more if we want. Honestly, the original of how we had it was correct and fine, so if we want, we can go back to that too. I just expanded a little bit, but I may have been overzealous and it might not be a necessary expansion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why we put items on separate lines, nominally sentences:

  • Improves version control diffs for easier line-by-line inspection.
  • Focuses feedback and review discussions on single points, preventing overlap with other content that may need changes.
  • Reduces overlap and merge conflicts with other tickets.
  • Enhances the precision of our automated lint tooling, such as Vale (vale /path/to/file.mdx), allowing you to more easily locate and address issues by line number.

@fairlydurable
Copy link
Contributor

Thank you for submitting the PR. This is great stuff. There were parts that confused me and I tried to tease out the meanings but I'm not sure I did that correctly or not. Really appreciate your time. Let me know if any of the feedback (I made it non-blocking) makes sense to you or if you can approach my confusion points in your own way.

Comment on lines -55 to +56
Most types of Temporal Failures occur automatically, like a [Cancelled Failure](#cancelled-failure) when the Workflow is Cancelled or an [Activity Failure](#activity-failure) when an Activity Fails.
You can also explicitly fail the Workflow Execution by throwing an Application Failure (returning any error in Go).
Most types of Temporal Failures are raised by the Temporal Service, like a [Cancelled Failure](#cancelled-failure) when the Workflow is Cancelled or an [Activity Failure](#activity-failure) when an Activity fails.
In contrast, you can explicitly fail the Workflow Execution by throwing an Application Failure (returning any error in Go) in Workflow Definition code.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned up some things in response to your comment, and that seems to have wiped out your comment, so I'm just putting it back in case we want to continue discussing :)

Erica:
Consider agreement between the articles for Workflow and Activity ("when a Workflow is Cancelled", "when an Activity fails").

Should F be capitalized here in "Fails"? I assume it is correct for "Cancelled Failure" and "Activity Failure", although Failure isn't in my Temporal caps list.

If you want to get rid of passive voice, you can move Temporal Service ("The Temporal Service raises most types of Temporal failures, like a...")

Grant:
Agreed on not capitalizing the F!

Regarding the argeement on articles, yeah same here -- I felt like something was clunky when I read it too. I couldn't quite tell what it was. But as far as the article agreement goes ("when a Workflow is Cancelled", "when an Activity fails"), I actually think this is intentional. From the reader/user's perspective, the cancellation is probably a manual intervention, so it will feel like it was cancelled, whereas with an Activity Failure, it wouldn't happen because of manual intervention, so that would feel more passive and more distant or out of their control. Does that align with how the prose reads to you? (Also, I didn't change that part, but I figure you're mentioning it because we should fix it while we're here, right?)

Still thinking on the passive voice thing. I don't think I really tried to change the voice on that -- I just expanded more, but maybe that changes the voice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just me being writerly. If you feel strongly, no worries. That's why I didn't make the feedback blocking. I'll need to go through and do a pass.

In case you didn't know, we're between "regimes", so I don't have a style book to go by other than my experience. Because of that I'm not going to make a fuss about things like parentheticals or m-dashes because I'll just go with the flow of whatever gets decided later. (I even held myself back from complaining about modal auxiliaries!)

BTW, the easiest way to fix passive is to find the actor (typically at the end of the sentence) and flip: "The Temporal Service raises most of the Temporal Failure types you encounter. This includes Canceled Failures when the..."

Speaking of which, is Canceled Failure a command? A Task History entry? Does it need code facing for consistency? What role does it play and should it be capped (I expect so based on the context)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. I'm going to with your gut on the articles because this isn't a hill to die on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol yeah I am behind on m-dashes and modal auxiliaries. Not a clue! I could look them up though if you want a fellow discussion buddy :)

I agree on the question of what specifically a Canceled Failure is. I think the way I'd describe it is our language-neutral term for what in each SDK we'd call a CancelledError, CancelledFailure, or CancelledException. I left it not codefaced because Application Failure was also not code faced, and those two behave the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

@fairlydurable
Copy link
Contributor

Let me know when everything is updated and I'll do a pass for egregious issues. Thank you for your work on this ticket. Very appreciated.

Copy link
Contributor

@fairlydurable fairlydurable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All blocking issues addressed.

Approving PR

@fairlydurable fairlydurable merged commit 9c1e3fe into main Jan 22, 2025
4 checks passed
@fairlydurable fairlydurable deleted the fix-temporal-failures-clarification branch January 22, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cross-team This issue or PR was submitted from within Temporal tracking-internally A JIRA ticket has been generated and the EDU number is attached to the title of this issue or PR work-in-progress This PR is not yet ready for technical or team review. Ideally it should be converted to a draft.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants