From 0157c097ceec275e77405c6df8712cbdf238d86d Mon Sep 17 00:00:00 2001 From: Adam Gardner Date: Fri, 27 Sep 2024 09:21:48 +1000 Subject: [PATCH] Deployed e82445b with MkDocs version: 1.6.0 --- search/search_index.json | 2 +- view-acceptance-test-results/index.html | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/search/search_index.json b/search/search_index.json index 1ac1d2f..4efb1dc 100755 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Release Validation for DevOps Engineers with Site Reliability Guardian","text":"

In this demo, you take on the role of a Product Manager or DevOps engineer. You are running an application, and wish to enable a new feature.

The application is already instrumented to emit tracing data, using the OpenTelemetry standard. The demo system will be automatically configured to transport that data to Dynatrace for storage and processing.

Your job is to:

To achieve these objectives, you will:

"},{"location":"#a-new-release","title":"A New Release","text":"

Your company utilises feature flags to enable new features. A product manager informs you that they wish to release a new feature.

It is your job to:

"},{"location":"#logical-architecture","title":"Logical Architecture","text":"

Below is the \"flow\" of information and actors during this demo.

This architecture also holds true for other load testing tools (eg. JMeter).

  1. A load test is executed. The HTTP requests are annotated with the standard header values.

  2. Metrics are streamed during the load test (if the load testing tool supports this) or the metrics are send at the end of the load test.

  3. The load testing tool is responsible for sending an event to signal \"test is finished\". Integrators are responsible for crafting this event to contain any important information required by Dynatrace such as the test duration.

  4. A workflow is triggered on receipt of this event. The workflow triggers the Site Reliability Guardian.

  5. The Site Reliability Guardian processes the load testing metrics and to provide an automated load testing report. This can be used for information only or can be used as an automated \"go / no go\" decision point.

  6. Dynatrace users can view the results in a dashboard, notebook or use the result as a trigger for further automated workflows.

  7. Integrators have the choice to send (emit) the results to an external tool. This external tool can then use this result. One example would be sending the SRG result to Jenkins to progress or prevent a deployment.

"},{"location":"#compatibility","title":"Compatibility","text":"Deployment Tutorial Compatible Dynatrace Managed \u274c Dynatrace SaaS \u2714\ufe0f "},{"location":"automate-srg/","title":"Automate the Site Reliability Guardian","text":"

Site reliability guardians can be automated so they happen whenever you prefer (on demand / on schedule / event based). A Dynatrace workflow is used to achieve this.

In this demo:

Let's plumb that together now.

Sample k6 teardown test finished event

For information only, no action is required.

This is already coded into the demo load test script.

export function teardown() {\n    // Send event at the end of the test\n    let payload = {\n      \"entitySelector\": \"type(SERVICE),entityName.equals(checkoutservice)\",\n      \"eventType\": \"CUSTOM_INFO\",\n      \"properties\": {\n        \"tool\": \"k6\",\n        \"action\": \"test\",\n        \"state\": \"finished\",\n        \"purpose\": `${__ENV.LOAD_TEST_PURPOSE}`,\n        \"duration\": test_duration\n      },\n      \"title\": \"k6 load test finished\"\n    }\n\n    let res = http.post(`${__ENV.K6_DYNATRACE_URL}/api/v2/events/ingest`, JSON.stringify(payload), post_params);\n  }\n}\n
"},{"location":"automate-srg/#create-a-workflow-to-trigger-guardian","title":"Create a Workflow to Trigger Guardian","text":"

Ensure you are still on the Three golden signals (checkoutservice) screen.

event.type == \"CUSTOM_INFO\" and\ndt.entity.service.name == \"checkoutservice\" and\ntool == \"k6\" and\naction == \"test\" and\nstate == \"finished\"\n
now-{{ event()['duration'] }}\n

The UI will change this to now-event.duration.

"},{"location":"automate-srg/#workflow-created","title":"Workflow Created","text":"

The workflow is now created and connected to the guardian. It will be triggered whenever the platform receives an event like below.

The workflow is now live and listening for events.

"},{"location":"cleanup/","title":"Cleanup","text":"

Go to https://github.com/codespaces and delete the codespace which will delete the demo environment.

You may also wish to delete the API token.

"},{"location":"create-srg/","title":"Create Site Reliability Guardian","text":"

Site reliability guardians are a mechanism to automate analysis when changes are made. They can be used in production (on a CRON) or as deployment checks (eg. pre and post deployment health checks, security checks, infrastructure health checks).

We will create a guardian to check the checkoutservice microservice which is used during the purchase journey.

Automate at scale

This process can be automated for at-scale usage using Monaco or Terraform.

"},{"location":"enable-auto-baselines/","title":"Enable Automatic Baselining for Site Reliability Guardian","text":"

Objectives that are set to \"auto baseline\" in Dynatrace Site Reliability Guardians require 5 runs in order to enable the baselines.

In a real scenario, these test runs would likely be spread over hours, days or weeks. This provides Dynatrace with ample time to gather sufficient usage data.

For demo purposes, 5 seperate \"load tests\" will be triggered in quick succession to enable the baselining.

First, open a new terminal window and apply the load test script:

kubectl apply -f .devcontainer/k6/k6-load-test-script.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-first-load-test","title":"Trigger the First Load Test","text":"
kubectl apply -f .devcontainer/k6/k6-srg-training-run1.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-second-load-test","title":"Trigger the Second Load Test","text":"

Wait a few seconds and trigger the second load test:

kubectl apply -f .devcontainer/k6/k6-srg-training-run2.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-third-load-test","title":"Trigger the Third Load Test","text":"

Wait a few seconds and trigger the third load test:

kubectl apply -f .devcontainer/k6/k6-srg-training-run3.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-fourth-load-test","title":"Trigger the Fourth Load Test","text":"

Wait a few seconds and trigger the fourth load test:

kubectl apply -f .devcontainer/k6/k6-srg-training-run4.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-final-training-load-test","title":"Trigger the Final Training Load Test","text":"

Wait a few seconds and trigger the final (fifth) load test:

kubectl apply -f .devcontainer/k6/k6-srg-training-run5.yaml\n
"},{"location":"enable-auto-baselines/#wait-for-completion","title":"Wait for Completion","text":"

Each load test runs for 1 minute. Run this command to wait for all jobs to complete.

This command will appear to hang until the jobs are done. Be patient. It should take about 2mins:

kubectl -n default wait --for=condition=Complete --all --timeout 120s jobs\n
\u279c /workspaces/obslab-release-validation (main) $ kubectl get jobs\nNAME               STATUS     COMPLETIONS   DURATION   AGE\nk6-training-run1   Complete   1/1           95s        2m2s\nk6-training-run2   Complete   1/1           93s        115s\nk6-training-run3   Complete   1/1           93s        108s\nk6-training-run4   Complete   1/1           90s        100s\nk6-training-run5   Complete   1/1           84s        94s\n
"},{"location":"enable-auto-baselines/#view-completed-training-runs","title":"View Completed Training Runs","text":"

In Dynatrace, go to workflows and select Executions. You should see 5 successful workflow executions:

"},{"location":"enable-auto-baselines/#view-srg-status-using-dql","title":"View SRG Status using DQL","text":"

You can also use this DQL to see the Site Reliability Guardian results in a notebook:

fetch bizevents\n| filter event.provider == \"dynatrace.site.reliability.guardian\"\n| filter event.type == \"guardian.validation.finished\"\n| fieldsKeep guardian.id, validation.id, timestamp, guardian.name, validation.status, validation.summary, validation.from, validation.to\n

"},{"location":"enable-auto-baselines/#view-srg-status-in-the-site-reliability-guardian-app","title":"View SRG Status in the Site Reliability Guardian App","text":"

The SRG results are also available in the Site Reliabiltiy Guardian app:

You should see the 5 runs listed:

Training Complete

The automatic baselines for the guardian are now enabled.

You can proceed to use the guardian for \"real\" evaluations.

"},{"location":"enable-change/","title":"8. Make a Change","text":"

A product manager informs you that they're ready to release their new feature. They ask you to enable the feature and run the load test in a dev environment.

They tell you that the new feature is behind a flag called paymentServiceFailure (yes, an obvious name for this demo) and they tell you to change the defaultValue from off to on.

"},{"location":"enable-change/#update-the-feature-flag-and-inform-dynatrce","title":"Update the Feature Flag and Inform Dynatrce","text":"

Run the following script which notifies Dynatrace using a CUSTOM_INFO event of the change inc. the new value.

./runtimeChange.sh paymentServiceFailure on\n
"},{"location":"enable-change/#change-flag-value","title":"Change Flag Value","text":"

Locate the flags.yaml file. Change the defaultValue of the paymentServiceFailure flag from \"off\" to \"on\" (line 84).

Apply those changes:

kubectl apply -f $CODESPACE_VSCODE_FOLDER/flags.yaml\n

You should see:

configmap/my-otel-demo-flagd-config configured\n
"},{"location":"enable-change/#run-acceptance-load-test","title":"Run Acceptance Load Test","text":"

It is time to run an acceptance load test to see if the new feature has caused a regression.

This load test will run for 3 minutes and then trigger the site reliability guardian again:

kubectl apply -f .devcontainer/k6/k6-after-change.yaml\n
"},{"location":"enable-change/#configuration-change-events","title":"Configuration Change Events","text":"

While you are waiting for the load test to complete, it is worth noting that each time a feature flag is changed, you should execute runtimeChange.sh shell script to send an event to the service that is affected.

The feature flag changes the behaviour of the paymentservice (which the checkoutservice depends on).

Look at the paymentservice and notice the configuration changed events.

Tip

You can send event for anything you like: deployments, load tests, security scans, configuration changes and more.

"},{"location":"getting-started/","title":"Getting Started","text":"

You must have the following to use this hands on demo.

"},{"location":"getting-started/#format-dynatrace-environment-url","title":"Format Dynatrace Environment URL","text":"

Save the Dynatrace environment URL:

The generic format is:

https://<EnvironmentID>.<Environment>.<URL>\n

For example:

https://abc12345.live.dynatrace.com\n

"},{"location":"getting-started/#create-api-token","title":"Create API Token","text":"

In Dynatrace:

"},{"location":"getting-started/#start-demo","title":"Start Demo","text":"

Click this button to open the demo environment. This will open in a new tab.

"},{"location":"resources/","title":"Resources","text":" "},{"location":"run-production-srg/","title":"7. Run a Production SRG","text":"

Preparation Complete

The preparation phase is now complete. Everything before now is a one-off task.

In day-to-day operations, you would begin from here.

"},{"location":"run-production-srg/#run-an-evaluation","title":"Run an Evaluation","text":"

Now that the Site Reliability Guardian is trained, run another evaluation by triggering a load test.

Tip

Remember, the workflow is currently configured to listen for test finished events but you could easily create additional workflows with different triggers such as on-demand on time-based CRON triggers.

This provides an ability to continuously test your service (eg. in production).

Run another load test to trigger a sixth evaluation.

kubectl apply -f .devcontainer/k6/k6.yaml\n

Again, wait for all jobs to complete. This run will take longer. Approximately 2mins.

kubectl -n default wait --for=condition=Complete --all --timeout 120s jobs\n

When the above command returns, you should see:

NAME               STATUS     COMPLETIONS   DURATION   AGE\nk6-training-run1   Complete   1/1           102s       9m41s\nk6-training-run2   Complete   1/1           100s       9m33s\nk6-training-run3   Complete   1/1           101s       9m23s\nk6-training-run4   Complete   1/1           93s        9m17s\nk6-training-run5   Complete   1/1           91s        9m11s\nrun-k6             Complete   1/1           79s        81s\n

When this evaluation is completed, click the Refresh button in the Validation history panel of the site reliability guardian app (when viewing an individual guardian) and the heatmap should look like the image below

Your results may vary

Your results may vary. In this example below, the Traffic objective failed because the auto-adaptive thresholds detected that a traffic level below 1171 requests is too low and the actual traffic level was 1158.

Because one objective failed, the guardian failed.

5 training runs and 1 \"real\" run:

Information Only Objectives

It is possible to add objectives that are \"informational only\" and do not contribute to the pass / fail decisions.

This is useful for new services where you are trying to \"get a feel for\" the real-world data values of your metrics.

To set an objective as \"information only\": * Select the objective to open the side panel * Scroll down to Define thresholds * Select the No thresholds option

"},{"location":"validate-telemetry/","title":"Start The Demo","text":"

After the codespaces has started, the post creation script should begin. This will install everything and will take a few moments.

When the script has completed, a success message will briefly be displayed (it is so quick you'll probably miss it) and an empty terminal window will be shown.

"},{"location":"validate-telemetry/#wait-for-demo-to-start","title":"Wait For Demo to Start","text":"

Wait for the demo application pods to start:

kubectl -n default wait --for=condition=Ready --all --timeout 300s pod\n
"},{"location":"validate-telemetry/#access-demo-user-interface","title":"Access Demo User Interface","text":"

Start port forwarding to access the user interface:

kubectl -n default port-forward svc/my-otel-demo-frontendproxy 8080\n

Leave this command running. Open a new terminal window to run any other commands.

Go to ports tab, right click the demo app entry and choose Open in browser.

You should see the OpenTelemetry demo:

"},{"location":"validate-telemetry/#validate-telemetry","title":"Validate Telemetry","text":"

It is time to ensure telemetry is flowing correctly into Dynatrace.

In Dynatrace, follow these steps:

"},{"location":"validate-telemetry/#validate-services","title":"Validate Services","text":""},{"location":"validate-telemetry/#validate-traces","title":"Validate Traces","text":""},{"location":"validate-telemetry/#validate-metrics","title":"Validate Metrics","text":""},{"location":"validate-telemetry/#validate-logs","title":"Validate Logs","text":"
fetch logs, scanLimitGBytes: 1\n| filter contains(content, \"conversion\")\n
"},{"location":"validate-telemetry/#telemetry-flowing","title":"Telemetry Flowing?","text":"

If these four things are OK, your telemetry is flowing correctly into Dynatrace.

If not, please search for similar problems and / or raise an issue here.

"},{"location":"view-acceptance-test-results/","title":"9. View Acceptance Test Results","text":""},{"location":"view-acceptance-test-results/#view-data","title":"View Data","text":"

Wait for all jobs to complete:

kubectl -n default wait --for=condition=Complete --all --timeout 120s jobs\n

All jobs (including the acceptance-load-test) should now be Complete.

Refresh the Site Reliability Guardian results heatmap again and notice that the guardian has failed.

The guardian has failed due to the error rate being too high.

Navigating to the checkoutservice (ctrl + k > services > checkoutservice), you can see the increase in failure rate.

Scroll down the services screen until you see the OpenTelemetry traces list. Notice lots of failed requests:

"},{"location":"view-acceptance-test-results/#analyse-a-failed-request","title":"Analyse a Failed Request","text":"

Drill into one of the failed requests and notice lots of failures.

These failures are bubbling up through the request chain back towards the checkoutservice.

Ultimately though, the failure comes from the final span in the trace: The call to PaymentService/Charge.

Investigating the span events the cause of the failure becomes clear: The payment service cuase an exception. The exception message and stacktrace is given:

exception.message   PaymentService Fail Feature Flag Enabled\nexception.stacktrace    Error: PaymentService Fail Feature Flag Enabled at module.exports.charge\n  (/usr/src/app/charge.js:21:11) at process.processTicksAndRejections\n  (node:internal/process/task_queues:95:5) at async Object.chargeServiceHandler\n  [as charge] (/usr/src/app/index.js:21:22)\nexception.type  Error\n

"},{"location":"view-acceptance-test-results/#roll-back-change","title":"Roll Back Change","text":"

Inform Dynatrace that a change in configuration is coming. The paymentServiceFailure flag will be set to off

./runtimeChange.sh paymentServiceFailure off\n

Again edit flags.yaml and set the defaultValue of paymentServiceFailure from \"on\" to \"off\" (line 84)

Apply the chnages:

kubectl apply -f $CODESPACE_VSCODE_FOLDER/flags.yaml\n
"},{"location":"view-acceptance-test-results/#summary","title":"Summary","text":"

Looking back at the initial brief, it was your job to:

So how did things turn out?

Works with any metric

The techniques described here work with any metric, from any source.

You are encouraged to use metrics from other devices and sources (such as business related metrics like revenue).

Success

The Dynatrace Platform, Site Reliability Guardian and Workflows have provided visibility and automated change analysis.

"},{"location":"whats-next/","title":"What's Next?","text":"

Content about how the user progresses after this demo.

"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Release Validation for DevOps Engineers with Site Reliability Guardian","text":"

In this demo, you take on the role of a Product Manager or DevOps engineer. You are running an application, and wish to enable a new feature.

The application is already instrumented to emit tracing data, using the OpenTelemetry standard. The demo system will be automatically configured to transport that data to Dynatrace for storage and processing.

Your job is to:

To achieve these objectives, you will:

"},{"location":"#a-new-release","title":"A New Release","text":"

Your company utilises feature flags to enable new features. A product manager informs you that they wish to release a new feature.

It is your job to:

"},{"location":"#logical-architecture","title":"Logical Architecture","text":"

Below is the \"flow\" of information and actors during this demo.

This architecture also holds true for other load testing tools (eg. JMeter).

  1. A load test is executed. The HTTP requests are annotated with the standard header values.

  2. Metrics are streamed during the load test (if the load testing tool supports this) or the metrics are send at the end of the load test.

  3. The load testing tool is responsible for sending an event to signal \"test is finished\". Integrators are responsible for crafting this event to contain any important information required by Dynatrace such as the test duration.

  4. A workflow is triggered on receipt of this event. The workflow triggers the Site Reliability Guardian.

  5. The Site Reliability Guardian processes the load testing metrics and to provide an automated load testing report. This can be used for information only or can be used as an automated \"go / no go\" decision point.

  6. Dynatrace users can view the results in a dashboard, notebook or use the result as a trigger for further automated workflows.

  7. Integrators have the choice to send (emit) the results to an external tool. This external tool can then use this result. One example would be sending the SRG result to Jenkins to progress or prevent a deployment.

"},{"location":"#compatibility","title":"Compatibility","text":"Deployment Tutorial Compatible Dynatrace Managed \u274c Dynatrace SaaS \u2714\ufe0f "},{"location":"automate-srg/","title":"Automate the Site Reliability Guardian","text":"

Site reliability guardians can be automated so they happen whenever you prefer (on demand / on schedule / event based). A Dynatrace workflow is used to achieve this.

In this demo:

Let's plumb that together now.

Sample k6 teardown test finished event

For information only, no action is required.

This is already coded into the demo load test script.

export function teardown() {\n    // Send event at the end of the test\n    let payload = {\n      \"entitySelector\": \"type(SERVICE),entityName.equals(checkoutservice)\",\n      \"eventType\": \"CUSTOM_INFO\",\n      \"properties\": {\n        \"tool\": \"k6\",\n        \"action\": \"test\",\n        \"state\": \"finished\",\n        \"purpose\": `${__ENV.LOAD_TEST_PURPOSE}`,\n        \"duration\": test_duration\n      },\n      \"title\": \"k6 load test finished\"\n    }\n\n    let res = http.post(`${__ENV.K6_DYNATRACE_URL}/api/v2/events/ingest`, JSON.stringify(payload), post_params);\n  }\n}\n
"},{"location":"automate-srg/#create-a-workflow-to-trigger-guardian","title":"Create a Workflow to Trigger Guardian","text":"

Ensure you are still on the Three golden signals (checkoutservice) screen.

event.type == \"CUSTOM_INFO\" and\ndt.entity.service.name == \"checkoutservice\" and\ntool == \"k6\" and\naction == \"test\" and\nstate == \"finished\"\n
now-{{ event()['duration'] }}\n

The UI will change this to now-event.duration.

"},{"location":"automate-srg/#workflow-created","title":"Workflow Created","text":"

The workflow is now created and connected to the guardian. It will be triggered whenever the platform receives an event like below.

The workflow is now live and listening for events.

"},{"location":"cleanup/","title":"Cleanup","text":"

Go to https://github.com/codespaces and delete the codespace which will delete the demo environment.

You may also wish to delete the API token.

"},{"location":"create-srg/","title":"Create Site Reliability Guardian","text":"

Site reliability guardians are a mechanism to automate analysis when changes are made. They can be used in production (on a CRON) or as deployment checks (eg. pre and post deployment health checks, security checks, infrastructure health checks).

We will create a guardian to check the checkoutservice microservice which is used during the purchase journey.

Automate at scale

This process can be automated for at-scale usage using Monaco or Terraform.

"},{"location":"enable-auto-baselines/","title":"Enable Automatic Baselining for Site Reliability Guardian","text":"

Objectives that are set to \"auto baseline\" in Dynatrace Site Reliability Guardians require 5 runs in order to enable the baselines.

In a real scenario, these test runs would likely be spread over hours, days or weeks. This provides Dynatrace with ample time to gather sufficient usage data.

For demo purposes, 5 seperate \"load tests\" will be triggered in quick succession to enable the baselining.

First, open a new terminal window and apply the load test script:

kubectl apply -f .devcontainer/k6/k6-load-test-script.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-first-load-test","title":"Trigger the First Load Test","text":"
kubectl apply -f .devcontainer/k6/k6-srg-training-run1.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-second-load-test","title":"Trigger the Second Load Test","text":"

Wait a few seconds and trigger the second load test:

kubectl apply -f .devcontainer/k6/k6-srg-training-run2.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-third-load-test","title":"Trigger the Third Load Test","text":"

Wait a few seconds and trigger the third load test:

kubectl apply -f .devcontainer/k6/k6-srg-training-run3.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-fourth-load-test","title":"Trigger the Fourth Load Test","text":"

Wait a few seconds and trigger the fourth load test:

kubectl apply -f .devcontainer/k6/k6-srg-training-run4.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-final-training-load-test","title":"Trigger the Final Training Load Test","text":"

Wait a few seconds and trigger the final (fifth) load test:

kubectl apply -f .devcontainer/k6/k6-srg-training-run5.yaml\n
"},{"location":"enable-auto-baselines/#wait-for-completion","title":"Wait for Completion","text":"

Each load test runs for 1 minute. Run this command to wait for all jobs to complete.

This command will appear to hang until the jobs are done. Be patient. It should take about 2mins:

kubectl -n default wait --for=condition=Complete --all --timeout 120s jobs\n
\u279c /workspaces/obslab-release-validation (main) $ kubectl get jobs\nNAME               STATUS     COMPLETIONS   DURATION   AGE\nk6-training-run1   Complete   1/1           95s        2m2s\nk6-training-run2   Complete   1/1           93s        115s\nk6-training-run3   Complete   1/1           93s        108s\nk6-training-run4   Complete   1/1           90s        100s\nk6-training-run5   Complete   1/1           84s        94s\n
"},{"location":"enable-auto-baselines/#view-completed-training-runs","title":"View Completed Training Runs","text":"

In Dynatrace, go to workflows and select Executions. You should see 5 successful workflow executions:

"},{"location":"enable-auto-baselines/#view-srg-status-using-dql","title":"View SRG Status using DQL","text":"

You can also use this DQL to see the Site Reliability Guardian results in a notebook:

fetch bizevents\n| filter event.provider == \"dynatrace.site.reliability.guardian\"\n| filter event.type == \"guardian.validation.finished\"\n| fieldsKeep guardian.id, validation.id, timestamp, guardian.name, validation.status, validation.summary, validation.from, validation.to\n

"},{"location":"enable-auto-baselines/#view-srg-status-in-the-site-reliability-guardian-app","title":"View SRG Status in the Site Reliability Guardian App","text":"

The SRG results are also available in the Site Reliabiltiy Guardian app:

You should see the 5 runs listed:

Training Complete

The automatic baselines for the guardian are now enabled.

You can proceed to use the guardian for \"real\" evaluations.

"},{"location":"enable-change/","title":"8. Make a Change","text":"

A product manager informs you that they're ready to release their new feature. They ask you to enable the feature and run the load test in a dev environment.

They tell you that the new feature is behind a flag called paymentServiceFailure (yes, an obvious name for this demo) and they tell you to change the defaultValue from off to on.

"},{"location":"enable-change/#update-the-feature-flag-and-inform-dynatrce","title":"Update the Feature Flag and Inform Dynatrce","text":"

Run the following script which notifies Dynatrace using a CUSTOM_INFO event of the change inc. the new value.

./runtimeChange.sh paymentServiceFailure on\n
"},{"location":"enable-change/#change-flag-value","title":"Change Flag Value","text":"

Locate the flags.yaml file. Change the defaultValue of the paymentServiceFailure flag from \"off\" to \"on\" (line 84).

Apply those changes:

kubectl apply -f $CODESPACE_VSCODE_FOLDER/flags.yaml\n

You should see:

configmap/my-otel-demo-flagd-config configured\n
"},{"location":"enable-change/#run-acceptance-load-test","title":"Run Acceptance Load Test","text":"

It is time to run an acceptance load test to see if the new feature has caused a regression.

This load test will run for 3 minutes and then trigger the site reliability guardian again:

kubectl apply -f .devcontainer/k6/k6-after-change.yaml\n
"},{"location":"enable-change/#configuration-change-events","title":"Configuration Change Events","text":"

While you are waiting for the load test to complete, it is worth noting that each time a feature flag is changed, you should execute runtimeChange.sh shell script to send an event to the service that is affected.

The feature flag changes the behaviour of the paymentservice (which the checkoutservice depends on).

Look at the paymentservice and notice the configuration changed events.

Tip

You can send event for anything you like: deployments, load tests, security scans, configuration changes and more.

"},{"location":"getting-started/","title":"Getting Started","text":"

You must have the following to use this hands on demo.

"},{"location":"getting-started/#format-dynatrace-environment-url","title":"Format Dynatrace Environment URL","text":"

Save the Dynatrace environment URL:

The generic format is:

https://<EnvironmentID>.<Environment>.<URL>\n

For example:

https://abc12345.live.dynatrace.com\n

"},{"location":"getting-started/#create-api-token","title":"Create API Token","text":"

In Dynatrace:

"},{"location":"getting-started/#start-demo","title":"Start Demo","text":"

Click this button to open the demo environment. This will open in a new tab.

"},{"location":"resources/","title":"Resources","text":" "},{"location":"run-production-srg/","title":"7. Run a Production SRG","text":"

Preparation Complete

The preparation phase is now complete. Everything before now is a one-off task.

In day-to-day operations, you would begin from here.

"},{"location":"run-production-srg/#run-an-evaluation","title":"Run an Evaluation","text":"

Now that the Site Reliability Guardian is trained, run another evaluation by triggering a load test.

Tip

Remember, the workflow is currently configured to listen for test finished events but you could easily create additional workflows with different triggers such as on-demand on time-based CRON triggers.

This provides an ability to continuously test your service (eg. in production).

Run another load test to trigger a sixth evaluation.

kubectl apply -f .devcontainer/k6/k6.yaml\n

Again, wait for all jobs to complete. This run will take longer. Approximately 2mins.

kubectl -n default wait --for=condition=Complete --all --timeout 120s jobs\n

When the above command returns, you should see:

NAME               STATUS     COMPLETIONS   DURATION   AGE\nk6-training-run1   Complete   1/1           102s       9m41s\nk6-training-run2   Complete   1/1           100s       9m33s\nk6-training-run3   Complete   1/1           101s       9m23s\nk6-training-run4   Complete   1/1           93s        9m17s\nk6-training-run5   Complete   1/1           91s        9m11s\nrun-k6             Complete   1/1           79s        81s\n

When this evaluation is completed, click the Refresh button in the Validation history panel of the site reliability guardian app (when viewing an individual guardian) and the heatmap should look like the image below

Your results may vary

Your results may vary. In this example below, the Traffic objective failed because the auto-adaptive thresholds detected that a traffic level below 1171 requests is too low and the actual traffic level was 1158.

Because one objective failed, the guardian failed.

5 training runs and 1 \"real\" run:

Information Only Objectives

It is possible to add objectives that are \"informational only\" and do not contribute to the pass / fail decisions.

This is useful for new services where you are trying to \"get a feel for\" the real-world data values of your metrics.

To set an objective as \"information only\": * Select the objective to open the side panel * Scroll down to Define thresholds * Select the No thresholds option

"},{"location":"validate-telemetry/","title":"Start The Demo","text":"

After the codespaces has started, the post creation script should begin. This will install everything and will take a few moments.

When the script has completed, a success message will briefly be displayed (it is so quick you'll probably miss it) and an empty terminal window will be shown.

"},{"location":"validate-telemetry/#wait-for-demo-to-start","title":"Wait For Demo to Start","text":"

Wait for the demo application pods to start:

kubectl -n default wait --for=condition=Ready --all --timeout 300s pod\n
"},{"location":"validate-telemetry/#access-demo-user-interface","title":"Access Demo User Interface","text":"

Start port forwarding to access the user interface:

kubectl -n default port-forward svc/my-otel-demo-frontendproxy 8080\n

Leave this command running. Open a new terminal window to run any other commands.

Go to ports tab, right click the demo app entry and choose Open in browser.

You should see the OpenTelemetry demo:

"},{"location":"validate-telemetry/#validate-telemetry","title":"Validate Telemetry","text":"

It is time to ensure telemetry is flowing correctly into Dynatrace.

In Dynatrace, follow these steps:

"},{"location":"validate-telemetry/#validate-services","title":"Validate Services","text":""},{"location":"validate-telemetry/#validate-traces","title":"Validate Traces","text":""},{"location":"validate-telemetry/#validate-metrics","title":"Validate Metrics","text":""},{"location":"validate-telemetry/#validate-logs","title":"Validate Logs","text":"
fetch logs, scanLimitGBytes: 1\n| filter contains(content, \"conversion\")\n
"},{"location":"validate-telemetry/#telemetry-flowing","title":"Telemetry Flowing?","text":"

If these four things are OK, your telemetry is flowing correctly into Dynatrace.

If not, please search for similar problems and / or raise an issue here.

"},{"location":"view-acceptance-test-results/","title":"9. View Acceptance Test Results","text":""},{"location":"view-acceptance-test-results/#view-data","title":"View Data","text":"

Wait for all jobs to complete:

kubectl -n default wait --for=condition=Complete --all --timeout 120s jobs\n

All jobs (including the acceptance-load-test) should now be Complete.

Refresh the Site Reliability Guardian results heatmap again and notice that the guardian has failed.

The guardian has failed due to the error rate being too high.

Navigating to the checkoutservice (ctrl + k > services > checkoutservice), you can see the increase in failure rate.

Scroll down the services screen until you see the OpenTelemetry traces list. Notice lots of failed requests:

"},{"location":"view-acceptance-test-results/#analyse-a-failed-request","title":"Analyse a Failed Request","text":"

Drill into one of the failed requests and notice lots of failures.

These failures are bubbling up through the request chain back towards the checkoutservice.

Ultimately though, the failure comes from the final span in the trace: The call to PaymentService/Charge.

Investigating the span events the cause of the failure becomes clear: The payment service cuase an exception. The exception message and stacktrace is given:

exception.message   PaymentService Fail Feature Flag Enabled\nexception.stacktrace    Error: PaymentService Fail Feature Flag Enabled at module.exports.charge\n  (/usr/src/app/charge.js:21:11) at process.processTicksAndRejections\n  (node:internal/process/task_queues:95:5) at async Object.chargeServiceHandler\n  [as charge] (/usr/src/app/index.js:21:22)\nexception.type  Error\n

"},{"location":"view-acceptance-test-results/#roll-back-change","title":"Roll Back Change","text":"

Inform Dynatrace that a change in configuration is coming. The paymentServiceFailure flag will be set to off

./runtimeChange.sh paymentServiceFailure off\n

Again edit flags.yaml and set the defaultValue of paymentServiceFailure from \"on\" to \"off\" (line 84)

Apply the changes:

kubectl apply -f $CODESPACE_VSCODE_FOLDER/flags.yaml\n
"},{"location":"view-acceptance-test-results/#summary","title":"Summary","text":"

Looking back at the initial brief, it was your job to:

So how did things turn out?

Works with any metric

The techniques described here work with any metric, from any source.

You are encouraged to use metrics from other devices and sources (such as business related metrics like revenue).

Success

The Dynatrace Platform, Site Reliability Guardian and Workflows have provided visibility and automated change analysis.

"},{"location":"whats-next/","title":"What's Next?","text":"

Content about how the user progresses after this demo.

"}]} \ No newline at end of file diff --git a/view-acceptance-test-results/index.html b/view-acceptance-test-results/index.html index 1c16226..05c6627 100755 --- a/view-acceptance-test-results/index.html +++ b/view-acceptance-test-results/index.html @@ -681,7 +681,7 @@

Roll Back Change
./runtimeChange.sh paymentServiceFailure off
 

Again edit flags.yaml and set the defaultValue of paymentServiceFailure from "on" to "off" (line 84)

-

Apply the chnages:

+

Apply the changes:

kubectl apply -f $CODESPACE_VSCODE_FOLDER/flags.yaml
 

Summary#