-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Expose argo synchronization options #6553
Comments
@Bobgy @capri-xiyue: if this request makes sense, let me know if there's a good place to add these options in the dsl, and I can write up a patch. |
Hello @jmcarp , what is the CUJ for setting limitation of synchronization on 1)workflow level, 2) step level? |
Hi @zijianjoy, the typical use case in either case is to prevent concurrent execution of tasks that could otherwise change some shared resource. For example, if there's a race condition between tasks, or between multiple runs of the same task, it's helpful to use argo synchronization to prevent those tasks from running concurrently. This is a useful argo feature, and it would be helpful to have it in kfp as well. |
Hi @jmcarp , in your use case, can you add synchronization logic for the shared resources? I think when it comes to shared resources, it also makes sense that the synchronization logic is added to the shared resources instead of the workflows. |
I can implement my own synchronization, but that's the kind of feature I expect the workflow engine to provide--it's easy to get wrong, so ideally it's implemented once at the workflow layer and not independently by each user of the workflow tool. That's why tools like argo workflows and apache airflow include synchronization primitives, for example. |
I see a 👍 from @Bobgy. Would you accept a patch for this? |
I think that sounds like a fair argument, but regarding project status, we are transitioning to v2, so it's better reconsider this feature request after other critical features are ready. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Is there any update on that feature? We are currently using kfp 1.8 and are evaluating possibilities of migration towards v2.x. However, we rely heavily on Argo synchronization features, which we were able to utilize by patching Argo YAML that Kubeflow compiler was outputting. With IR YAML this will become impossible, since we won't have direct control over Argo YAML |
@Bobgy hi. Since 2.0 is officially released for quite some time, should this feature that allows semaphore usage get more attention? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
gonna pick this up / not stale |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
working on this |
/assign @gregsheremeta |
I've been working on this, and I don't think template/component level is going to be trivial to get working. Turns out all Components share a single template in the generated Workflow, and it looks roughly like this:
The Argo Workflows synchronization block for templates goes at that top level, alongside We do have a dynamic way to set the things that change per each Component (command, image, volumes, etc.) -- Possible paths forward that I can think of:
|
Not sure if any interested parties are still following this issue, but if so, I have some design questions. Per my previous comment, I'm focusing on Pipeline (Workflow) level synchronization (locking) for now. I have a prototype working, but I have some questions about how we should move forward. The pipeline code looks like this (note
The semaphore configmap looks like this:
When I almost-simultaneously start 3 runs of this pipeline, I expect that the first one will start, and the other two will wait until the lock frees up; then the second one will start and complete; and then the last one will go. And this is what happens. This flow renders quite nicely in the Argo Workflows UI (reverse chronological order): When everything is fully complete, you can see how the start and end times were staggered. However, in KFP, we don't know anything about what's holding things up in Argo Workflows. So KFP thinks all 3 started at the same time. It thinks run 1 takes ~40 seconds. It thinks run 2 takes ~80 seconds. It thinks run 3 takes ~120 seconds. That renders like this: Another thing that's making me ask this question is what happens when the semaphore isn't on the cluster. The Workflow fails very quickly when this happens. In the Argo UI, we get a very nice message about the reason. In the KFP UI, there's just a generic pipeline failed message -- nothing about the lock configmap missing. Do folks think it's acceptable to treat locks as "just another random thing to set on the workflow" and thereby have KFP be basically ignorant of what's going on, or would it be better to make KFP more aware of locking and add first-class support for it to the apiserver and the UI? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove-lifecycle stale |
Feature Area
/area sdk
What feature would you like to see?
Argo includes useful synchronization features for workflows and templates: https://argoproj.github.io/argo-workflows/synchronization/. It would be helpful to expose these in kfp without having to manipulate manifests as yaml.
I'd be happy to work on this if the request makes sense. It seems like workflow and template level synchronization options would live on the pipeline config and container op classes respectively, but let me know if there's a better place to put them.
What is the use case or pain point?
Allow users to set a semaphore on a workflow or a step.
Is there a workaround currently?
Edit generated yaml.
Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.
The text was updated successfully, but these errors were encountered: