-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Managed Infrastructure Maintenance Operator - Milestone 1 #3571
Conversation
Please rebase pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started a review, and reached my ingestion limit. I'll keep reviewing later.
if err, ok := err.(*cosmosdb.Error); ok && err.StatusCode == http.StatusConflict { | ||
err.StatusCode = http.StatusPreconditionFailed | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we overwriting the http status condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like to me it's because of line 143. We're saying that in case of a conflict we want to change it to a status that will have the cosmosdb Retry function retry the request. If this is the case I think commenting this would be helpful in-case the functionality of functions that use the cosmosdb Retry function change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a lot of the code here which is copy pasted we should probably open up things to improve this consistently.
pkg/mimo/actuator/manager.go
Outdated
docs, err := i.Next(ctx, -1) | ||
if err != nil { | ||
return false, err | ||
} | ||
if docs == nil { | ||
break | ||
} | ||
|
||
docList = append(docList, docs.MaintenanceManifestDocuments...) | ||
} | ||
|
||
manifestsToAction := make([]*api.MaintenanceManifestDocument, 0) | ||
|
||
sort.SliceStable(docList, func(i, j int) bool { | ||
if docList[i].MaintenanceManifest.RunAfter != docList[j].MaintenanceManifest.RunAfter { | ||
return docList[i].MaintenanceManifest.Priority < docList[j].MaintenanceManifest.Priority | ||
} | ||
|
||
return docList[i].MaintenanceManifest.RunAfter < docList[j].MaintenanceManifest.RunAfter | ||
}) | ||
|
||
evaluationTime := a.now() | ||
|
||
// Check for manifests that have timed out first | ||
for _, doc := range docList { | ||
if evaluationTime.After(time.Unix(int64(doc.MaintenanceManifest.RunBefore), 0)) { | ||
// timed out, mark as such | ||
a.log.Infof("marking %v as outdated: %v older than %v", doc.ID, doc.MaintenanceManifest.RunBefore, evaluationTime.UTC()) | ||
|
||
_, err := a.mmf.Patch(ctx, a.clusterID, doc.ID, func(d *api.MaintenanceManifestDocument) error { | ||
d.MaintenanceManifest.State = api.MaintenanceManifestStateTimedOut | ||
d.MaintenanceManifest.StatusText = fmt.Sprintf("timed out at %s", evaluationTime.UTC()) | ||
return nil | ||
}) | ||
if err != nil { | ||
a.log.Error(err) | ||
} | ||
} else { | ||
// not timed out, do something about it | ||
manifestsToAction = append(manifestsToAction, doc) | ||
} | ||
} | ||
|
||
// Nothing to do, don't dequeue | ||
if len(manifestsToAction) == 0 { | ||
return false, nil | ||
} | ||
|
||
// Dequeue the document | ||
oc, err := a.oc.Get(ctx, a.clusterID) | ||
if err != nil { | ||
return false, err | ||
} | ||
|
||
oc, err = a.oc.DoDequeue(ctx, oc) | ||
if err != nil { | ||
return false, err // This will include StatusPreconditionFaileds | ||
} | ||
|
||
taskContext := newTaskContext(a.env, a.log, oc) | ||
|
||
// Execute on the manifests we want to action | ||
for _, doc := range manifestsToAction { | ||
// here | ||
f, ok := a.tasks[doc.MaintenanceManifest.MaintenanceSetID] | ||
if !ok { | ||
a.log.Infof("not found %v", doc.MaintenanceManifest.MaintenanceSetID) | ||
continue | ||
} | ||
|
||
// Attempt a dequeue | ||
doc, err = a.mmf.Lease(ctx, a.clusterID, doc.ID) | ||
if err != nil { | ||
// log and continue if it doesn't work | ||
a.log.Error(err) | ||
continue | ||
} | ||
|
||
// if we've tried too many times, give up | ||
if doc.Dequeues > maxDequeueCount { | ||
err := fmt.Errorf("dequeued %d times, failing", doc.Dequeues) | ||
_, leaseErr := a.mmf.EndLease(ctx, doc.ClusterID, doc.ID, api.MaintenanceManifestStateTimedOut, to.StringPtr(err.Error())) | ||
if leaseErr != nil { | ||
a.log.Error(err) | ||
} | ||
continue | ||
} | ||
|
||
// Perform the task | ||
state, msg := f(ctx, taskContext, doc, oc) | ||
_, err = a.mmf.EndLease(ctx, doc.ClusterID, doc.ID, state, &msg) | ||
if err != nil { | ||
a.log.Error(err) | ||
} | ||
} | ||
|
||
// release the OpenShiftCluster | ||
_, err = a.oc.EndLease(ctx, a.clusterID, oc.OpenShiftCluster.Properties.ProvisioningState, api.ProvisioningStateMaintenance, nil) | ||
return true, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to split the logic into private funcs to improve readability, something like:
func (a *actuator) Process(ctx context.Context) (bool, error) {
// Fetch manifests
manifests, err := a.fetchManifests(ctx)
if err != nil {
return false, err
}
// Evaluate and segregate manifests
expiredManifests, actionableManifests := a.evaluateManifests(manifests)
// Handle expired manifests
a.handleExpiredManifests(ctx, expiredManifests)
// If no actionable manifests, return
if len(actionableManifests) == 0 {
return false, nil
}
// Dequeue the cluster document
oc, err := a.oc.DequeueCluster(ctx, a.clusterID)
if err != nil {
return false, err
}
// Execute tasks
taskContext := newTaskContext(a.env, a.log, oc)
a.executeTasks(ctx, taskContext, actionableManifests)
// Release the cluster lease
return true, a.oc.EndClusterLease(ctx, a.clusterID, oc)
}
func (a *actuator) fetchManifests(ctx context.Context) ([]*api.MaintenanceManifestDocument, error) {
// Fetch manifests logic here
}
func (a *actuator) evaluateManifests(manifests []*api.MaintenanceManifestDocument) ([]*api.MaintenanceManifestDocument, []*api.MaintenanceManifestDocument) {
// Evaluation logic here
}
func (a *actuator) handleExpiredManifests(ctx context.Context, expiredManifests []*api.MaintenanceManifestDocument) {
// Handling expired manifests logic here
}
func (a *actuator) executeTasks(ctx context.Context, taskContext tasks.TaskContext, manifests []*api.MaintenanceManifestDocument) {
// Task execution logic here
}
pkg/mimo/actuator/manager.go
Outdated
// timed out, mark as such | ||
a.log.Infof("marking %v as outdated: %v older than %v", doc.ID, doc.MaintenanceManifest.RunBefore, evaluationTime.UTC()) | ||
|
||
_, err := a.mmf.Patch(ctx, a.clusterID, doc.ID, func(d *api.MaintenanceManifestDocument) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we implement a retry logic here? just to make the patch action more robust?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, left some comments for potential improvments, please have a check
6ab5f5d
to
e5c05b6
Compare
Co-authored-by: Kipp Morris <[email protected]>
e7b9b5e
to
e93a3f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM I had a few comments but I don't think they're blockers
pkg/api/mimo.go
Outdated
MaintenanceManifestStatePending MaintenanceManifestState = "Pending" | ||
MaintenanceManifestStateInProgress MaintenanceManifestState = "InProgress" | ||
MaintenanceManifestStateCompleted MaintenanceManifestState = "Completed" | ||
MaintenanceManifestStateFailed MaintenanceManifestState = "Failed" | ||
MaintenanceManifestStateTimedOut MaintenanceManifestState = "TimedOut" | ||
MaintenanceManifestStateCancelled MaintenanceManifestState = "Cancelled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've got these consts in 2 places, can we shrink that down to 1?
pkg/api/mimo.go
Outdated
StatusText string `json:"statusText,omitempty"` | ||
|
||
MaintenanceSetID string `json:"maintenanceSetID,omitempty"` | ||
Priority int `json:"priority,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably need some documentation on what order or meaning Priority has, for example is there negatives? Does a higher value have more priority than a lower one? I know there will be code for the priority somewhere but having it documented alongside the definition would be helpful I think.
LSN int `json:"_lsn,omitempty"` | ||
Metadata map[string]interface{} `json:"_metadata,omitempty"` | ||
|
||
ClusterResourceID string `json:"clusterResourceID,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between ClusterResourceID
and ResourceID
|
||
LeaseOwner string `json:"leaseOwner,omitempty" deep:"-"` | ||
LeaseExpires int `json:"leaseExpires,omitempty" deep:"-"` | ||
Dequeues int `json:"dequeues,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does Dequeues mean? A count of times the manifest has not been executed?
Adds the MIMO milestone 1 ready for work in deployment testing
Adds the MIMO milestone 1 ready for work in deployment testing
Which issue this PR addresses:
Part of https://issues.redhat.com/browse/ARO-4895.
What this PR does / why we need it:
This PR is the initial feature branch for the MIMO M1 milestone.
Is there any documentation that needs to be updated for this PR?
Yes, see https://issues.redhat.com/browse/ARO-4895 .
How do you know this will function as expected in production?
Telemetry, monitoring, and documentation will need to be fleshed out. See https://issues.redhat.com/browse/ARO-4895 for details.