-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC] Project6: Push-based Metrics Collection for Katib #2340
Labels
Comments
/assign |
1 task
This was referenced Jun 23, 2024
This was referenced Aug 1, 2024
This was referenced Sep 4, 2024
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove-lifecycle stale |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Goal
The project aims to provide a Python SDK API interface for users to push metrics to Katib DB directly.
The current implementation of Metrics Collector is pull-based, raising design problems such as determining the frequency at which we scrape the metrics, performance issues like the overhead caused by too many sidecar containers, and restrictions on developing environments that must support sidecar containers and admission webhooks.
Thus, we decided to implement a new API for Katib Python SDK to offer users a push-based way to store metrics directly into the Kaitb DB and resolve those issues raised by pull-based metrics collection.
What I did in GSoC Project & Ongoing Works
This issue tracks the progress of developing push-based metrics collection for katib during the GSoC coding phase.
I raised numerous PRs for the Katib and Training-Operator project. Some of them are related to my GSoC project, and others may contribute to the completeness of UTs, simplicity of dependency package, and the compatibility of UI component, etc.
Also, I raised some issues not only to describe the problems and bugs I met during the coding period, but also to suggest the future enhancement direction for Katib and Training-Operator.
PRs concerned with the project:
tune
function: [GSoC] Add New Parameter intune
#2369report_metrics
in Python SDK: [GSoC] New Interfacereport_metrics
in Python SDK #2371Other PRs:
protocmp
ingoogle.golang.org/protobuf/testing/protocmp
. #2391inject_webhook_test.go
according to the Developer Guide #2401wait_for_job_conditions
training-operator#2196Issues I raised:
/pkg/webhook/v1beta1/pod/inject_webhook_test.go
according to Developer Guide. #2388google.golang.org/protobuf/testing/protocmp
#2389tune
#2402metrics
Field inreport_metrics()
Interface #2421git+https
#2422Please let me know if you have any suggestions @kubeflow/wg-automl-leads !
The Lesson I learned during the Project
Think Twice, Code Once: @andreyvelich taught me that we should think of the API specification and all the related details before coding. This can significantly reduce the workload of the coding period and avoid big refactor of the project. Meanwhile, my understanding of Katib got clear gradually during the over-and-over rounds of re-think and re-design of the architecture.
Dive into the Source Code: Engineering projects nowadays are extremely complex and need much effort to understand them. The best way to get familiar with the project is to dive into the source code and run several examples.
Communication: Communication is the most important thing when we collaborate with others. Expressing your idea precisely and making others understand you easily are significant skills not only in open source community but also in various scenes such as company and group works.
In the End
Special Thanks:
I hold a firm belief that every small step counts, and everybody in the community is unique and of great significance. There is no doubt that our joint efforts will surely contribute to the flourishing of our Kubeflow Community, make it the world-best community managing AI lifecycle on Kubernetes, and attract much more attention from the industry. Then, more and more new comers will pour in and work along with us.
Again, I'll continue to contribute to Kubeflow.
The text was updated successfully, but these errors were encountered: