Managed Infrastructure Maintenance Operator - Milestone 1 #3571

hawkowl · 2024-05-10T08:09:47Z

Which issue this PR addresses:

Part of https://issues.redhat.com/browse/ARO-4895.

What this PR does / why we need it:

This PR is the initial feature branch for the MIMO M1 milestone.

Is there any documentation that needs to be updated for this PR?

Yes, see https://issues.redhat.com/browse/ARO-4895 .

How do you know this will function as expected in production?

Telemetry, monitoring, and documentation will need to be fleshed out. See https://issues.redhat.com/browse/ARO-4895 for details.

github-actions · 2024-05-15T16:33:33Z

Please rebase pull request.

jaitaiwan

I've started a review, and reached my ingestion limit. I'll keep reviewing later.

go.mod

pkg/api/mimodocument.go

pkg/database/mimo.go

jaitaiwan · 2024-05-23T01:05:20Z

pkg/database/mimo.go

+	if err, ok := err.(*cosmosdb.Error); ok && err.StatusCode == http.StatusConflict {
+		err.StatusCode = http.StatusPreconditionFailed
+	}


Why are we overwriting the http status condition?

It looks like to me it's because of line 143. We're saying that in case of a conflict we want to change it to a status that will have the cosmosdb Retry function retry the request. If this is the case I think commenting this would be helpful in-case the functionality of functions that use the cosmosdb Retry function change in the future.

I think a lot of the code here which is copy pasted we should probably open up things to improve this consistently.

pkg/database/mimo.go

pkg/api/mimo.go

pkg/deploy/generator/resources_rp.go

ArrisLee · 2024-06-18T00:46:09Z

pkg/mimo/actuator/manager.go

+		docs, err := i.Next(ctx, -1)
+		if err != nil {
+			return false, err
+		}
+		if docs == nil {
+			break
+		}
+
+		docList = append(docList, docs.MaintenanceManifestDocuments...)
+	}
+
+	manifestsToAction := make([]*api.MaintenanceManifestDocument, 0)
+
+	sort.SliceStable(docList, func(i, j int) bool {
+		if docList[i].MaintenanceManifest.RunAfter != docList[j].MaintenanceManifest.RunAfter {
+			return docList[i].MaintenanceManifest.Priority < docList[j].MaintenanceManifest.Priority
+		}
+
+		return docList[i].MaintenanceManifest.RunAfter < docList[j].MaintenanceManifest.RunAfter
+	})
+
+	evaluationTime := a.now()
+
+	// Check for manifests that have timed out first
+	for _, doc := range docList {
+		if evaluationTime.After(time.Unix(int64(doc.MaintenanceManifest.RunBefore), 0)) {
+			// timed out, mark as such
+			a.log.Infof("marking %v as outdated: %v older than %v", doc.ID, doc.MaintenanceManifest.RunBefore, evaluationTime.UTC())
+
+			_, err := a.mmf.Patch(ctx, a.clusterID, doc.ID, func(d *api.MaintenanceManifestDocument) error {
+				d.MaintenanceManifest.State = api.MaintenanceManifestStateTimedOut
+				d.MaintenanceManifest.StatusText = fmt.Sprintf("timed out at %s", evaluationTime.UTC())
+				return nil
+			})
+			if err != nil {
+				a.log.Error(err)
+			}
+		} else {
+			// not timed out, do something about it
+			manifestsToAction = append(manifestsToAction, doc)
+		}
+	}
+
+	// Nothing to do, don't dequeue
+	if len(manifestsToAction) == 0 {
+		return false, nil
+	}
+
+	// Dequeue the document
+	oc, err := a.oc.Get(ctx, a.clusterID)
+	if err != nil {
+		return false, err
+	}
+
+	oc, err = a.oc.DoDequeue(ctx, oc)
+	if err != nil {
+		return false, err // This will include StatusPreconditionFaileds
+	}
+
+	taskContext := newTaskContext(a.env, a.log, oc)
+
+	// Execute on the manifests we want to action
+	for _, doc := range manifestsToAction {
+		// here
+		f, ok := a.tasks[doc.MaintenanceManifest.MaintenanceSetID]
+		if !ok {
+			a.log.Infof("not found %v", doc.MaintenanceManifest.MaintenanceSetID)
+			continue
+		}
+
+		// Attempt a dequeue
+		doc, err = a.mmf.Lease(ctx, a.clusterID, doc.ID)
+		if err != nil {
+			// log and continue if it doesn't work
+			a.log.Error(err)
+			continue
+		}
+
+		// if we've tried too many times, give up
+		if doc.Dequeues > maxDequeueCount {
+			err := fmt.Errorf("dequeued %d times, failing", doc.Dequeues)
+			_, leaseErr := a.mmf.EndLease(ctx, doc.ClusterID, doc.ID, api.MaintenanceManifestStateTimedOut, to.StringPtr(err.Error()))
+			if leaseErr != nil {
+				a.log.Error(err)
+			}
+			continue
+		}
+
+		// Perform the task
+		state, msg := f(ctx, taskContext, doc, oc)
+		_, err = a.mmf.EndLease(ctx, doc.ClusterID, doc.ID, state, &msg)
+		if err != nil {
+			a.log.Error(err)
+		}
+	}
+
+	// release the OpenShiftCluster
+	_, err = a.oc.EndLease(ctx, a.clusterID, oc.OpenShiftCluster.Properties.ProvisioningState, api.ProvisioningStateMaintenance, nil)
+	return true, err
+}


suggest to split the logic into private funcs to improve readability, something like:

func (a *actuator) Process(ctx context.Context) (bool, error) { // Fetch manifests manifests, err := a.fetchManifests(ctx) if err != nil { return false, err } // Evaluate and segregate manifests expiredManifests, actionableManifests := a.evaluateManifests(manifests) // Handle expired manifests a.handleExpiredManifests(ctx, expiredManifests) // If no actionable manifests, return if len(actionableManifests) == 0 { return false, nil } // Dequeue the cluster document oc, err := a.oc.DequeueCluster(ctx, a.clusterID) if err != nil { return false, err } // Execute tasks taskContext := newTaskContext(a.env, a.log, oc) a.executeTasks(ctx, taskContext, actionableManifests) // Release the cluster lease return true, a.oc.EndClusterLease(ctx, a.clusterID, oc) } func (a *actuator) fetchManifests(ctx context.Context) ([]*api.MaintenanceManifestDocument, error) { // Fetch manifests logic here } func (a *actuator) evaluateManifests(manifests []*api.MaintenanceManifestDocument) ([]*api.MaintenanceManifestDocument, []*api.MaintenanceManifestDocument) { // Evaluation logic here } func (a *actuator) handleExpiredManifests(ctx context.Context, expiredManifests []*api.MaintenanceManifestDocument) { // Handling expired manifests logic here } func (a *actuator) executeTasks(ctx context.Context, taskContext tasks.TaskContext, manifests []*api.MaintenanceManifestDocument) { // Task execution logic here }

pkg/api/mimo.go

pkg/database/mimo.go

pkg/deploy/generator/resources_rp.go

pkg/mimo/actuator/manager.go

ArrisLee · 2024-06-25T09:56:54Z

pkg/mimo/actuator/manager.go

+			// timed out, mark as such
+			a.log.Infof("marking %v as outdated: %v older than %v", doc.ID, doc.MaintenanceManifest.RunBefore, evaluationTime.UTC())
+
+			_, err := a.mmf.Patch(ctx, a.clusterID, doc.ID, func(d *api.MaintenanceManifestDocument) error {


shall we implement a retry logic here? just to make the patch action more robust?

ArrisLee

LGTM overall, left some comments for potential improvments, please have a check

Co-authored-by: Kipp Morris <[email protected]>

jaitaiwan

LGTM I had a few comments but I don't think they're blockers

jaitaiwan · 2024-07-30T04:30:56Z

pkg/api/mimo.go

+	MaintenanceManifestStatePending    MaintenanceManifestState = "Pending"
+	MaintenanceManifestStateInProgress MaintenanceManifestState = "InProgress"
+	MaintenanceManifestStateCompleted  MaintenanceManifestState = "Completed"
+	MaintenanceManifestStateFailed     MaintenanceManifestState = "Failed"
+	MaintenanceManifestStateTimedOut   MaintenanceManifestState = "TimedOut"
+	MaintenanceManifestStateCancelled  MaintenanceManifestState = "Cancelled"


We've got these consts in 2 places, can we shrink that down to 1?

jaitaiwan · 2024-07-30T04:35:02Z

pkg/api/mimo.go

+	StatusText string                   `json:"statusText,omitempty"`
+
+	MaintenanceSetID string `json:"maintenanceSetID,omitempty"`
+	Priority         int    `json:"priority,omitempty"`


Probably need some documentation on what order or meaning Priority has, for example is there negatives? Does a higher value have more priority than a lower one? I know there will be code for the priority somewhere but having it documented alongside the definition would be helpful I think.

jaitaiwan · 2024-07-30T04:37:53Z

pkg/api/mimodocument.go

+	LSN         int                    `json:"_lsn,omitempty"`
+	Metadata    map[string]interface{} `json:"_metadata,omitempty"`
+
+	ClusterResourceID   string              `json:"clusterResourceID,omitempty"`


What's the difference between ClusterResourceID and ResourceID

jaitaiwan · 2024-07-30T04:39:51Z

pkg/api/mimodocument.go

+
+	LeaseOwner   string `json:"leaseOwner,omitempty" deep:"-"`
+	LeaseExpires int    `json:"leaseExpires,omitempty" deep:"-"`
+	Dequeues     int    `json:"dequeues,omitempty"`


What does Dequeues mean? A count of times the manifest has not been executed?

Adds the MIMO milestone 1 ready for work in deployment testing

This was referenced May 10, 2024

[turbo WIP] MIMO PoC #3210

Closed

MIMO M1 (Database) #3560

Closed

github-actions bot added the needs-rebase branch needs a rebase label May 15, 2024

jaitaiwan reviewed May 23, 2024

View reviewed changes

github-actions bot removed the needs-rebase branch needs a rebase label May 27, 2024

hawkowl force-pushed the hawkowl/mimo-m1 branch from 6d97470 to 335c6fd Compare June 6, 2024 02:15

github-actions bot added the needs-rebase branch needs a rebase label Jun 6, 2024

hawkowl force-pushed the hawkowl/mimo-m1 branch from 335c6fd to 94fb144 Compare June 11, 2024 00:59

github-actions bot added needs-rebase branch needs a rebase and removed needs-rebase branch needs a rebase labels Jun 11, 2024

ArrisLee reviewed Jun 15, 2024

View reviewed changes

pkg/api/mimo.go Show resolved Hide resolved

ArrisLee reviewed Jun 15, 2024

View reviewed changes

pkg/api/mimo.go Show resolved Hide resolved

ArrisLee reviewed Jun 18, 2024

View reviewed changes

pkg/deploy/generator/resources_rp.go Show resolved Hide resolved

ArrisLee reviewed Jun 18, 2024

View reviewed changes

yjst2012 reviewed Jun 18, 2024

View reviewed changes

ArrisLee reviewed Jun 25, 2024

View reviewed changes

pkg/mimo/actuator/manager.go Show resolved Hide resolved

ArrisLee reviewed Jun 25, 2024

View reviewed changes

pkg/mimo/actuator/manager.go Show resolved Hide resolved

ArrisLee reviewed Jun 25, 2024

View reviewed changes

ArrisLee previously approved these changes Jun 25, 2024

View reviewed changes

hawkowl mentioned this pull request Jun 26, 2024

Clean up some duplicated code in cmd/ #3648

Merged

hawkowl dismissed ArrisLee’s stale review via 046a230 July 2, 2024 04:27

hawkowl force-pushed the hawkowl/mimo-m1 branch from 94fb144 to 046a230 Compare July 2, 2024 04:27

github-actions bot removed the needs-rebase branch needs a rebase label Jul 2, 2024

hawkowl force-pushed the hawkowl/mimo-m1 branch from 046a230 to 9272c68 Compare July 4, 2024 03:04

hawkowl force-pushed the hawkowl/mimo-m1 branch 4 times, most recently from 6ab5f5d to e5c05b6 Compare July 22, 2024 07:44

hawkowl force-pushed the hawkowl/mimo-m1 branch from e5c05b6 to f22f9e5 Compare July 26, 2024 05:40

hawkowl and others added 25 commits December 4, 2024 13:55

review cleanups and neatening things up

831ae8a

try and fix e2e race condition

02bd417

fix name

1e9aa10

more docs

7fd07b6

more docs

2bd811b

log err

8de6cc5

comments

843f0dd

minor cleanups

48adea7

add missing platformworkloadidentitydocuments

52e09fa

allow deleting maintenancemanifests even if the cluster is deleted

dac4b67

fix up naming since it's not all metrics

a5a26ff

try and fix e2e

d14362f

database and admin API updates for queue things

dc5b3d4

emit mimo queue length metrics

0947e08

db test code

3bd9d7f

clean up admin based on review, and add a queue check for all clusters

6de9f5c

Update pkg/mimo/actuator/manager.go

1bef2dc

Co-authored-by: Kipp Morris <[email protected]>

fix to wrap error

acd8773

de-flake e2e maybe

ee9552a

Add conversion of the issueDate into the admin API, and add a comment

0232ca7

minor ACR token refactoring, direct unit tests

5aa5916

fixup

d7fbf5c

cleanups to do with service duplication

b302afd

fix lint

6761a1b

more e2e deflake attempts

e93a3f1

hawkowl force-pushed the hawkowl/mimo-m1 branch from e7b9b5e to e93a3f1 Compare December 4, 2024 02:55

jaitaiwan approved these changes Dec 9, 2024

View reviewed changes

jaitaiwan merged commit 7e94614 into master Dec 9, 2024
21 checks passed

ehvs pushed a commit that referenced this pull request Jan 23, 2025

Managed Infrastructure Maintenance Operator - Milestone 1 (#3571)

e00a445

Adds the MIMO milestone 1 ready for work in deployment testing

ArrisLee pushed a commit that referenced this pull request Feb 9, 2025

Managed Infrastructure Maintenance Operator - Milestone 1 (#3571)

62e0f99

Adds the MIMO milestone 1 ready for work in deployment testing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

hawkowl commented May 10, 2024 •

edited

Loading

github-actions bot commented May 15, 2024

jaitaiwan left a comment

jaitaiwan May 23, 2024

jaitaiwan May 23, 2024

hawkowl Nov 4, 2024

ArrisLee Jun 18, 2024

ArrisLee Jun 25, 2024

ArrisLee left a comment

jaitaiwan left a comment

jaitaiwan Jul 30, 2024

jaitaiwan Jul 30, 2024

jaitaiwan Jul 30, 2024

jaitaiwan Jul 30, 2024

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Conversation

hawkowl commented May 10, 2024 • edited Loading

Which issue this PR addresses:

What this PR does / why we need it:

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

github-actions bot commented May 15, 2024

jaitaiwan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArrisLee left a comment

Choose a reason for hiding this comment

jaitaiwan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkowl commented May 10, 2024 •

edited

Loading