Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(infiniband): use events db to persist ibstat check thresholds #370

Closed
wants to merge 2 commits into from

Conversation

gyuho
Copy link
Collaborator

@gyuho gyuho commented Feb 7, 2025

  • test(events): test purge disable
  • feat(infiniband): use events db to persist ibstat check thresholds

gyuho added 2 commits February 8, 2025 01:05
This addresses the problem where GPUd forgets the last known ibstat
check threshold during GPUd restarts

Signed-off-by: Gyuho Lee <[email protected]>
@gyuho gyuho self-assigned this Feb 7, 2025
@gyuho gyuho changed the title events db for recording cp events feat(infiniband): use events db to persist ibstat check thresholds Feb 7, 2025
@gyuho gyuho added the wip - do not merge working in progress label Feb 7, 2025
Copy link

codecov bot commented Feb 7, 2025

Codecov Report

Attention: Patch coverage is 80.32787% with 12 lines in your changes missing coverage. Please review.

Project coverage is 23.01%. Comparing base (ab3785a) to head (1a5499a).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...celerator/nvidia/infiniband/threshold/threshold.go 80.32% 8 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #370      +/-   ##
==========================================
+ Coverage   22.94%   23.01%   +0.07%     
==========================================
  Files         289      290       +1     
  Lines       25883    25944      +61     
==========================================
+ Hits         5938     5971      +33     
- Misses      19335    19364      +29     
+ Partials      610      609       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


// Updates the current threshold, if and only if the current threshold is not found
// or the new threshold is different from the current threshold.
func (s *Store) CompareAndSet(ctx context.Context, threshold Threshold) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use the write lock for the entire function.

)

// Defines the minimum number of ports and the expected rate in Gb/sec.
type Threshold struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support syncing node hardware spec like ib threshold from server? /cc @cardyok

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

case "updateConfig":
if payload.UpdateConfig != nil {
for componentName, value := range payload.UpdateConfig {
log.Logger.Infow("Update config received for component", "component", componentName, "config", value)
switch componentName {
case nvidia_infiniband_id.Name:
var updateCfg nvidia_infiniband.ExpectedPortStates
if err := json.Unmarshal([]byte(value), &updateCfg); err != nil {
log.Logger.Warnw("failed to unmarshal update config", "error", err)
} else {
nvidia_infiniband.SetDefaultExpectedPortStates(updateCfg)
}
default:

(Current issue is this update will reset on GPUd restart)

return &th, nil
}

func (th Threshold) Event(time time.Time) (components.Event, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious that why not store these data to a separate file which stores all node hardware spec or node related thresholds?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do that as well (separate table might be easier, since we already have all the persistence via sqlite file)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but it would be better to use a text file to store this configuration, as it would be more maintainable.

@gyuho
Copy link
Collaborator Author

gyuho commented Feb 9, 2025

Closing for now

@gyuho gyuho closed this Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wip - do not merge working in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants