Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(nvidia/xid, hw slowdown, fuse, PCI): use common events store, simplify nvml poller initialization #336

Closed
wants to merge 1 commit into from

Conversation

gyuho
Copy link
Collaborator

@gyuho gyuho commented Jan 26, 2025

  • Use components/db to persist NVML-originated Xid events
  • Switch to components/db to persist HW slowdown events

Should be after #321

@gyuho gyuho added this to the v0.4.0 milestone Jan 26, 2025
@gyuho gyuho self-assigned this Jan 26, 2025
@gyuho gyuho added the wip - do not merge working in progress label Jan 26, 2025
Copy link

codecov bot commented Jan 26, 2025

Codecov Report

Attention: Patch coverage is 1.10497% with 179 lines in your changes missing coverage. Please review.

Project coverage is 20.68%. Comparing base (760ef3b) to head (2ff3017).

Files with missing lines Patch % Lines
internal/server/server.go 0.00% 37 Missing ⚠️
components/fuse/component_output.go 0.00% 32 Missing ⚠️
components/accelerator/nvidia/query/nvml/xid.go 0.00% 24 Missing ⚠️
components/diagnose/scan.go 0.00% 19 Missing ⚠️
components/fuse/component.go 0.00% 18 Missing ⚠️
components/accelerator/nvidia/query/nvml/nvml.go 0.00% 14 Missing ⚠️
...nents/accelerator/nvidia/query/nvidia_smi_query.go 0.00% 11 Missing ⚠️
components/accelerator/nvidia/query/query.go 0.00% 8 Missing ⚠️
...omponents/accelerator/nvidia/query/nvml/options.go 0.00% 4 Missing ⚠️
components/accelerator/nvidia/query/options.go 0.00% 4 Missing ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #336      +/-   ##
==========================================
- Coverage   21.47%   20.68%   -0.79%     
==========================================
  Files         300      296       -4     
  Lines       27186    26530     -656     
==========================================
- Hits         5837     5489     -348     
+ Misses      20710    20424     -286     
+ Partials      639      617      -22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gyuho gyuho changed the title feat(nvidia/xid, hw slowdown): use common events store feat(nvidia/xid, hw slowdown): use common events store, simplify nvml poller initialization Jan 26, 2025
@gyuho gyuho force-pushed the xid-from-nvml-to-events-db branch 2 times, most recently from 071b7d5 to b2d5e17 Compare January 27, 2025 01:26
@gyuho gyuho removed the wip - do not merge working in progress label Jan 27, 2025
@gyuho gyuho force-pushed the xid-from-nvml-to-events-db branch 3 times, most recently from 5baab10 to 35c443f Compare January 27, 2025 15:57
@gyuho gyuho changed the title feat(nvidia/xid, hw slowdown): use common events store, simplify nvml poller initialization feat(nvidia/xid, hw slowdown, PCI): use common events store, simplify nvml poller initialization Jan 28, 2025
@xiang90
Copy link
Contributor

xiang90 commented Jan 28, 2025

can we somehow split this into multiple PRs?

@gyuho gyuho changed the title feat(nvidia/xid, hw slowdown, PCI): use common events store, simplify nvml poller initialization feat(nvidia/xid, hw slowdown, fuse, PCI): use common events store, simplify nvml poller initialization Jan 28, 2025
…mplify nvml poller initialization

Signed-off-by: Gyuho Lee <[email protected]>
@gyuho gyuho closed this Jan 28, 2025
@gyuho gyuho deleted the xid-from-nvml-to-events-db branch January 28, 2025 13:29
@gyuho gyuho removed the wip - do not merge working in progress label Jan 28, 2025
@gyuho gyuho removed this from the v0.4.0 milestone Jan 28, 2025
gyuho added a commit that referenced this pull request Feb 3, 2025
c.f., #336

---------

Signed-off-by: Gyuho Lee <[email protected]>
gyuho added a commit that referenced this pull request Feb 3, 2025
gyuho added a commit that referenced this pull request Feb 5, 2025
…sable NVML Xid event watcher in favor of "dmesg" watcher, deprecate redundant "error-xid-sxid" component (#343)

- use common events DB for NVML-based xid watcher
- disable NVML Xid event watcher in favor of "dmesg" watcher
- deprecate redundant "error-xid-sxid" component

c.f., #336

---------

Signed-off-by: Gyuho Lee <[email protected]>
gyuho added a commit that referenced this pull request Feb 6, 2025
c.f., #336

Requires #341.

Tested

<img width="1102" alt="Screenshot 2025-01-28 at 11 01 33 PM"
src="https://github.com/user-attachments/assets/78c12072-42df-4c17-8789-2eb900577eb5"
/>

<img width="1537" alt="Screenshot 2025-01-28 at 11 01 45 PM"
src="https://github.com/user-attachments/assets/9a520798-9aab-499f-9f03-433f1c8a9295"
/>

Signed-off-by: Gyuho Lee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants