-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(infiniband): use events db to persist ibstat check thresholds #370
Conversation
gyuho
commented
Feb 7, 2025
- test(events): test purge disable
- feat(infiniband): use events db to persist ibstat check thresholds
Signed-off-by: Gyuho Lee <[email protected]>
This addresses the problem where GPUd forgets the last known ibstat check threshold during GPUd restarts Signed-off-by: Gyuho Lee <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #370 +/- ##
==========================================
+ Coverage 22.94% 23.01% +0.07%
==========================================
Files 289 290 +1
Lines 25883 25944 +61
==========================================
+ Hits 5938 5971 +33
- Misses 19335 19364 +29
+ Partials 610 609 -1 ☔ View full report in Codecov by Sentry. |
|
||
// Updates the current threshold, if and only if the current threshold is not found | ||
// or the new threshold is different from the current threshold. | ||
func (s *Store) CompareAndSet(ctx context.Context, threshold Threshold) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use the write lock for the entire function.
) | ||
|
||
// Defines the minimum number of ports and the expected rate in Gb/sec. | ||
type Threshold struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support syncing node hardware spec like ib threshold from server? /cc @cardyok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
gpud/internal/session/serve.go
Lines 170 to 183 in 9fe57e6
case "updateConfig": | |
if payload.UpdateConfig != nil { | |
for componentName, value := range payload.UpdateConfig { | |
log.Logger.Infow("Update config received for component", "component", componentName, "config", value) | |
switch componentName { | |
case nvidia_infiniband_id.Name: | |
var updateCfg nvidia_infiniband.ExpectedPortStates | |
if err := json.Unmarshal([]byte(value), &updateCfg); err != nil { | |
log.Logger.Warnw("failed to unmarshal update config", "error", err) | |
} else { | |
nvidia_infiniband.SetDefaultExpectedPortStates(updateCfg) | |
} | |
default: |
(Current issue is this update will reset on GPUd restart)
return &th, nil | ||
} | ||
|
||
func (th Threshold) Event(time time.Time) (components.Event, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious that why not store these data to a separate file which stores all node hardware spec or node related thresholds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do that as well (separate table might be easier, since we already have all the persistence via sqlite file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but it would be better to use a text file to store this configuration, as it would be more maintainable.
Closing for now |