Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Notification Policy Routes #1800

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

msvechla
Copy link

See #1789 for the related feature proposal.

This implementation is still work in progress for now and should support the feature proposal.

Updates on GrafanaNotificationPolicyRoutes are tracked via watches. I first tried to accomplish this via ownerReferences (see commits), but this is not supported cross-namespace.


Example of how the sample policies in hack/kind/resources/default/grafana-notification-policy.yaml are rendered:

CleanShot 2024-12-18 at 11 58 53

Status updates on the GrafanaNotificationPolicy

❯ kubectl get grafananotificationpolicy grafananotificationpolicy-sample -o jsonpath="{.status}" |jq
{
  "conditions": [
    {
      "lastTransitionTime": "2024-12-17T15:52:31Z",
      "message": "Notification Policy was successfully applied to 1 instances",
      "observedGeneration": 9,
      "reason": "ApplySuccessful",
      "status": "True",
      "type": "NotificationPolicySynchronized"
    }
  ],
  "discoveredRoutes": [
    "grafana-crds/dynamic-c (priority: 1)",
    "default/dynamic-d (priority: 2)",
    "default/dynamic-e (priority: nil)"
  ]
}

Events emitted for the merged routes:

❯ kubectl get events

LAST SEEN   TYPE     REASON   OBJECT                                     MESSAGE
20m         Normal   Merged   grafananotificationpolicyroute/dynamic-c   Route merged into NotificationPolicy default/grafananotificationpolicy-sample
3s          Normal   Merged   grafananotificationpolicyroute/dynamic-d   Route merged into NotificationPolicy default/grafananotificationpolicy-sample
3s          Normal   Merged   grafananotificationpolicyroute/dynamic-e   Route merged into NotificationPolicy default/grafananotificationpolicy-sample

Adapted PROJECT to new go module version path and ran:

```
./bin/operator-sdk create api --group grafana --version v1beta1 --kind GrafanaNotificationPolicyRoute --controller false
```
- during the reconcile loop in notificationpolicy_controller.go, we have
  to fetch all matching GrafanaNotificationPolicyRoutes for the currently reconciled GrafanaNotificationPolicy
- this can be very easily achieved with a routeSelector, which will be a Kubernetes LabelSelector
- if we would go with instanceSelector, we would have to fetch all available
  GrafanaNotificationPolicyRoutes and then do some filtering afterwards,
  to see if the instanceSelector matches, which would be both more inefficient and more complex
The GrafanaNotificationPolicy Controller now watches
GrafanaNoticationPolicyRoutes instead of using ownerReferences, as
ownerReferences do not support cross-namespace references.

We now also emit a event on the GrafanaNotificationPolicyRoute to
indicate that it has been merged into a specific policy.
@msvechla msvechla changed the title Notification policy routes impl Dynamic Notification Policy Routes Dec 18, 2024
Copy link

This PR hasn't been updated for a while, marking as stale

@github-actions github-actions bot added the stale label Jan 20, 2025
@github-actions github-actions bot added the documentation Issues relating to documentation, missing, non-clear etc. label Jan 21, 2025
@msvechla
Copy link
Author

I have updated the draft PR with the latest discussions from #1789:

  • RouteSelector has been moved to the Route object, allowing to dynamically inject routes on all levels now
  • Logic to detect reference loops has been implemented
  • Tests have been added accordingly
  • Priority logic has been removed

There are no checks for RouteSelector and Routes being mutually exclusive so far.
Unfortunately I think we can not implement this validation on the OpenAPI level, as the Routes attribute uses // +kubebuilder:validation:Schemaless, which disables all validation.

Another idea would be to solve this by implementing a ValidationWebhook. @theSuess do you have any thoughts on this?

Results of updates Samples

Assembled hack/kind/resources/default/grafana-notification-policy.yaml:

CleanShot 2025-01-21 at 11 06 27

❮ kubectl get grafananotificationpolicy grafananotificationpolicy-sample -o jsonpath="{.status}" |jq
{
  "conditions": [
    {
      "lastTransitionTime": "2025-01-21T09:48:19Z",
      "message": "Notification Policy was successfully applied to 1 instances",
      "observedGeneration": 1,
      "reason": "ApplySuccessful",
      "status": "True",
      "type": "NotificationPolicySynchronized"
    }
  ],
  "discoveredRoutes": [
    "default/dynamic-e",
    "grafana-crds/dynamic-c",
    "default/dynamic-d"
  ]
}
❯ kubectl get events

LAST SEEN   TYPE     REASON   OBJECT                                     MESSAGE
20m         Normal   Merged   grafananotificationpolicyroute/dynamic-d   Route merged into NotificationPolicy default/grafananotificationpolicy-sample
20m         Normal   Merged   grafananotificationpolicyroute/dynamic-d   Route merged into NotificationPolicy default/grafananotificationpolicy-sample
20m         Normal   Merged   grafananotificationpolicyroute/dynamic-e   Route merged into NotificationPolicy default/grafananotificationpolicy-sample
20m         Normal   Merged   grafananotificationpolicyroute/dynamic-e   Route merged into NotificationPolicy default/grafananotificationpolicy-sample

@msvechla msvechla force-pushed the notification_policy_routes_impl branch from 3da8a88 to 95a1fd8 Compare January 21, 2025 10:11
@msvechla
Copy link
Author

I rebased on the latest master branch.

Additionally I looked into the ValidationWebhook for ensuring that routeSelector and routes are mutually exlusive.

While the implementation of the logic is quite straight-forward, setting up webhooks is a bit more complex, especially bringing this to the Helm chart.

I can certainly do this if you think this is the way forward, we can find another way to do the validation, or maybe skip this validation for now.

Any thoughts?

@github-actions github-actions bot removed the stale label Jan 22, 2025
@theSuess
Copy link
Member

Sorry for the deleted comment, missed the part about api level validation rules in your previous note.

We had some discussions around Validation Webhooks before and came to the conclusion that these are too finicky to get right.

If the validation rules don't work, I'd just go for a status condition in the respective object when both values are set which tells the user that the dynamic matching takes priority and the routes field is ignored

@msvechla
Copy link
Author

Unfortunately, as far as I can see, the validation expressions will not work due to the +kubebuilder:validation:Schemaless on the Routes object. I can see if I can find a workaround.

Alternatively, I will go with the hint in the status condition as you suggested 👍

@theSuess
Copy link
Member

The reason why I added the schemaless validation is to support recursive objects. If there is a cleaner way, I'm all for it!

- adds validation for ensuring routes and routeSelector are mutual
  exclusive
- updates both GrafanaNotificationPolicy and
  GrafanaNotificationPolicyRoute status conditions accordingly
@msvechla
Copy link
Author

I updated the PR and implemented the validation by updating the status condition on both GrafanaNotificationPolicy and GrafanaNotificationPolicyRoute.

Status:
  Conditions:
    Last Transition Time:  2025-01-22T10:08:13Z
    Message:               Dynamically matched definitions from routeSelector will take precedence and the routes field is ignored
    Observed Generation:   8
    Reason:                BothRoutesAndRouteSelectorSpecified
    Status:                True
    Type:                  RoutesIgnoredDueToRouteSelector

}
assembledNotificationPolicy, mergedRoutes, err = assembleNotificationPolicyRoutes(ctx, r.Client, namespace, assembledNotificationPolicy)
r.Log.Info("assembled notification policy routes", "mergedRoutes", mergedRoutes)
if err != nil {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we handle errors here e.g. due to a loop in matched routes? Is it fine as is, by simply returning with an error? Alternatively we could also update the status condition similar to applyErrors := make(map[string]string) below and stop reconciling.

Looking at the code below, it looks like the status condition is never updated when errors occurred, not sure if I am misreading the code or this is on purpose:

        if len(applyErrors) > 0 {
		return ctrl.Result{}, fmt.Errorf("failed to apply to all instances: %v", applyErrors)
	}

// ...
	condition := buildSynchronizedCondition("Notification Policy", conditionNotificationPolicySynchronized, notificationPolicy.Generation, applyErrors, len(instances))
	meta.SetStatusCondition(&notificationPolicy.Status.Conditions, condition)

Looks like this has been adapted here: #1815

Any thoughts @theSuess ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an error doesn't make sense here, as it would cause the operator to retry the same resource. Assuming nothing changed, it'll just fail again. My preferred outcome is to note the error in the conditions.

Status conditions are applied through the defer function here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but the errors are never set on the condition, because the controller always returns when there are errors and never reaches the code to set the error condition:

see:

if len(applyErrors) > 0 {
    return ctrl.Result{}, fmt.Errorf("failed to apply to all instances: %v", applyErrors)
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some modifications here, hopefully that resolves the issue: f924763

Copy link
Member

@theSuess theSuess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good! I've tested it locally and it works as expected. I left some minor comments, but nothing that should require large scale refactoring.

Thanks for also taking care of the documentation right away 💯

api/v1beta1/grafananotificationpolicy_types.go Outdated Show resolved Hide resolved
api/v1beta1/grafananotificationpolicy_types.go Outdated Show resolved Hide resolved
api/v1beta1/grafananotificationpolicyroute_types.go Outdated Show resolved Hide resolved
controllers/grafananotificationpolicyroute_controller.go Outdated Show resolved Hide resolved
Comment on lines +169 to +173
if notificationPolicy.Spec.Route.IsRouteSelectorMutuallyExclusive() {
meta.RemoveStatusCondition(&notificationPolicy.Status.Conditions, conditionRoutesIgnoredDueToRouteSelector)
} else {
setRoutesIgnoredDueToRouteSelectorCondition(&notificationPolicy.Status.Conditions, notificationPolicy.Generation)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything preventing this check from occurring before applying the notification policy? I'd like to avoid half-applied notification policies

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We said that in case both are specified, we simply overwrite what is specified in Routes, so I think it's not an issue if we update the condition at the end. There should be no half-applied notification policies as far as I can see.

controllers/notificationpolicy_controller.go Outdated Show resolved Hide resolved
docs/docs/alerting/notification-policies.md Outdated Show resolved Hide resolved
docs/docs/alerting/notification-policies.md Outdated Show resolved Hide resolved
docs/docs/alerting/notification-policies.md Outdated Show resolved Hide resolved
docs/docs/alerting/notification-policies.md Outdated Show resolved Hide resolved
@msvechla msvechla force-pushed the notification_policy_routes_impl branch from 733f358 to 942a0b5 Compare January 24, 2025 14:47
@msvechla msvechla marked this pull request as ready for review January 24, 2025 14:51
@msvechla msvechla requested a review from theSuess January 24, 2025 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issues relating to documentation, missing, non-clear etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants