From 0157c097ceec275e77405c6df8712cbdf238d86d Mon Sep 17 00:00:00 2001
From: Adam Gardner In this demo, you take on the role of a Product Manager or DevOps engineer. You are running an application, and wish to enable a new feature. The application is already instrumented to emit tracing data, using the OpenTelemetry standard. The demo system will be automatically configured to transport that data to Dynatrace for storage and processing. Your job is to: To achieve these objectives, you will: Your company utilises feature flags to enable new features. A product manager informs you that they wish to release a new feature. It is your job to: Below is the \"flow\" of information and actors during this demo. This architecture also holds true for other load testing tools (eg. JMeter). A load test is executed. The HTTP requests are annotated with the standard header values. Metrics are streamed during the load test (if the load testing tool supports this) or the metrics are send at the end of the load test. The load testing tool is responsible for sending an event to signal \"test is finished\". Integrators are responsible for crafting this event to contain any important information required by Dynatrace such as the test duration. A workflow is triggered on receipt of this event. The workflow triggers the Site Reliability Guardian. The Site Reliability Guardian processes the load testing metrics and to provide an automated load testing report. This can be used for information only or can be used as an automated \"go / no go\" decision point. Dynatrace users can view the results in a dashboard, notebook or use the result as a trigger for further automated workflows. Integrators have the choice to send (emit) the results to an external tool. This external tool can then use this result. One example would be sending the SRG result to Jenkins to progress or prevent a deployment. Site reliability guardians can be automated so they happen whenever you prefer (on demand / on schedule / event based). A Dynatrace workflow is used to achieve this. In this demo: Let's plumb that together now. Sample k6 teardown test finished event For information only, no action is required. This is already coded into the demo load test script. Ensure you are still on the The UI will change this to Remove
"},{"location":"#a-new-release","title":"A New Release","text":"checkoutservice
)
"},{"location":"#logical-architecture","title":"Logical Architecture","text":"
"},{"location":"#compatibility","title":"Compatibility","text":"Deployment Tutorial Compatible Dynatrace Managed \u274c Dynatrace SaaS \u2714\ufe0f
"},{"location":"automate-srg/","title":"Automate the Site Reliability Guardian","text":"
"},{"location":"automate-srg/#create-a-workflow-to-trigger-guardian","title":"Create a Workflow to Trigger Guardian","text":"export function teardown() {\n // Send event at the end of the test\n let payload = {\n \"entitySelector\": \"type(SERVICE),entityName.equals(checkoutservice)\",\n \"eventType\": \"CUSTOM_INFO\",\n \"properties\": {\n \"tool\": \"k6\",\n \"action\": \"test\",\n \"state\": \"finished\",\n \"purpose\": `${__ENV.LOAD_TEST_PURPOSE}`,\n \"duration\": test_duration\n },\n \"title\": \"k6 load test finished\"\n }\n\n let res = http.post(`${__ENV.K6_DYNATRACE_URL}/api/v2/events/ingest`, JSON.stringify(payload), post_params);\n }\n}\n
Three golden signals (checkoutservice)
screen.
Automate
button. This will create a template workflow.event type
from bizevents
to events
.Filter query
to:event.type == \"CUSTOM_INFO\" and\ndt.entity.service.name == \"checkoutservice\" and\ntool == \"k6\" and\naction == \"test\" and\nstate == \"finished\"\n
run_validation
node.event.timeframe.from
and replace with:now-{{ event()['duration'] }}\n
now-event.duration
.
event.timeframe.to
and replace with: now\n
Click the Save
button.
The workflow is now created and connected to the guardian. It will be triggered whenever the platform receives an event like below.
The workflow is now live and listening for events.
Go to https://github.com/codespaces and delete the codespace which will delete the demo environment.
You may also wish to delete the API token.
Site reliability guardians are a mechanism to automate analysis when changes are made. They can be used in production (on a CRON) or as deployment checks (eg. pre and post deployment health checks, security checks, infrastructure health checks).
We will create a guardian to check the checkoutservice
microservice which is used during the purchase journey.
ctrl + k
search for Site Reliability Guardian
and select the app.+ Guardian
to add a new guardian.Four Golden Signals
choose Use template
.Run query
and toggle 50
rows per page to see more services.checkoutservice
. Click Apply to template (1)
.Saturation
objective and delete it (there are no resource statistics from OpenTelemetry available so this objective cannot be evaluated).Three golden signals (checkoutservice)
.Save
Automate at scale
This process can be automated for at-scale usage using Monaco or Terraform.
Objectives that are set to \"auto baseline\" in Dynatrace Site Reliability Guardians require 5
runs in order to enable the baselines.
In a real scenario, these test runs would likely be spread over hours, days or weeks. This provides Dynatrace with ample time to gather sufficient usage data.
For demo purposes, 5 seperate \"load tests\" will be triggered in quick succession to enable the baselining.
First, open a new terminal window and apply the load test script:
kubectl apply -f .devcontainer/k6/k6-load-test-script.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-first-load-test","title":"Trigger the First Load Test","text":"kubectl apply -f .devcontainer/k6/k6-srg-training-run1.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-second-load-test","title":"Trigger the Second Load Test","text":"Wait a few seconds and trigger the second load test:
kubectl apply -f .devcontainer/k6/k6-srg-training-run2.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-third-load-test","title":"Trigger the Third Load Test","text":"Wait a few seconds and trigger the third load test:
kubectl apply -f .devcontainer/k6/k6-srg-training-run3.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-fourth-load-test","title":"Trigger the Fourth Load Test","text":"Wait a few seconds and trigger the fourth load test:
kubectl apply -f .devcontainer/k6/k6-srg-training-run4.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-final-training-load-test","title":"Trigger the Final Training Load Test","text":"Wait a few seconds and trigger the final (fifth) load test:
kubectl apply -f .devcontainer/k6/k6-srg-training-run5.yaml\n
"},{"location":"enable-auto-baselines/#wait-for-completion","title":"Wait for Completion","text":"Each load test runs for 1 minute. Run this command to wait for all jobs to complete.
This command will appear to hang until the jobs are done. Be patient. It should take about 2mins:
kubectl -n default wait --for=condition=Complete --all --timeout 120s jobs\n
\u279c /workspaces/obslab-release-validation (main) $ kubectl get jobs\nNAME STATUS COMPLETIONS DURATION AGE\nk6-training-run1 Complete 1/1 95s 2m2s\nk6-training-run2 Complete 1/1 93s 115s\nk6-training-run3 Complete 1/1 93s 108s\nk6-training-run4 Complete 1/1 90s 100s\nk6-training-run5 Complete 1/1 84s 94s\n
"},{"location":"enable-auto-baselines/#view-completed-training-runs","title":"View Completed Training Runs","text":"In Dynatrace, go to workflows
and select Executions
. You should see 5 successful workflow executions:
You can also use this DQL to see the Site Reliability Guardian results in a notebook:
fetch bizevents\n| filter event.provider == \"dynatrace.site.reliability.guardian\"\n| filter event.type == \"guardian.validation.finished\"\n| fieldsKeep guardian.id, validation.id, timestamp, guardian.name, validation.status, validation.summary, validation.from, validation.to\n
"},{"location":"enable-auto-baselines/#view-srg-status-in-the-site-reliability-guardian-app","title":"View SRG Status in the Site Reliability Guardian App","text":"The SRG results are also available in the Site Reliabiltiy Guardian app:
ctrl + k
site reliability guardian
or srg
Open
on your guardianYou should see the 5
runs listed:
Training Complete
The automatic baselines for the guardian are now enabled.
You can proceed to use the guardian for \"real\" evaluations.
A product manager informs you that they're ready to release their new feature. They ask you to enable the feature and run the load test in a dev environment.
They tell you that the new feature is behind a flag called paymentServiceFailure
(yes, an obvious name for this demo) and they tell you to change the defaultValue
from off
to on
.
Run the following script which notifies Dynatrace using a CUSTOM_INFO
event of the change inc. the new value.
./runtimeChange.sh paymentServiceFailure on\n
"},{"location":"enable-change/#change-flag-value","title":"Change Flag Value","text":"Locate the flags.yaml
file. Change the defaultValue
of the paymentServiceFailure
flag from \"off\"
to \"on\"
(line 84
).
Apply those changes:
kubectl apply -f $CODESPACE_VSCODE_FOLDER/flags.yaml\n
You should see:
configmap/my-otel-demo-flagd-config configured\n
"},{"location":"enable-change/#run-acceptance-load-test","title":"Run Acceptance Load Test","text":"It is time to run an acceptance load test to see if the new feature has caused a regression.
This load test will run for 3 minutes and then trigger the site reliability guardian again:
kubectl apply -f .devcontainer/k6/k6-after-change.yaml\n
"},{"location":"enable-change/#configuration-change-events","title":"Configuration Change Events","text":"While you are waiting for the load test to complete, it is worth noting that each time a feature flag is changed, you should execute runtimeChange.sh
shell script to send an event to the service that is affected.
The feature flag changes the behaviour of the paymentservice
(which the checkoutservice
depends on).
Look at the paymentservice
and notice the configuration changed events.
Tip
You can send event for anything you like: deployments, load tests, security scans, configuration changes and more.
You must have the following to use this hands on demo.
Save the Dynatrace environment URL:
.apps.
in the URLThe generic format is:
https://<EnvironmentID>.<Environment>.<URL>\n
For example:
https://abc12345.live.dynatrace.com\n
"},{"location":"getting-started/#create-api-token","title":"Create API Token","text":"In Dynatrace:
ctrl + k
. Search for access tokens
.metrics.ingest
logs.ingest
events.ingest
openTelemetryTrace.ingest
Click this button to open the demo environment. This will open in a new tab.
Preparation Complete
The preparation phase is now complete. Everything before now is a one-off task.
In day-to-day operations, you would begin from here.
"},{"location":"run-production-srg/#run-an-evaluation","title":"Run an Evaluation","text":"Now that the Site Reliability Guardian is trained, run another evaluation by triggering a load test.
Tip
Remember, the workflow is currently configured to listen for test finished
events but you could easily create additional workflows with different triggers such as on-demand on time-based CRON triggers.
This provides an ability to continuously test your service (eg. in production).
Run another load test to trigger a sixth evaluation.
kubectl apply -f .devcontainer/k6/k6.yaml\n
Again, wait for all jobs to complete. This run will take longer. Approximately 2mins.
kubectl -n default wait --for=condition=Complete --all --timeout 120s jobs\n
When the above command returns, you should see:
NAME STATUS COMPLETIONS DURATION AGE\nk6-training-run1 Complete 1/1 102s 9m41s\nk6-training-run2 Complete 1/1 100s 9m33s\nk6-training-run3 Complete 1/1 101s 9m23s\nk6-training-run4 Complete 1/1 93s 9m17s\nk6-training-run5 Complete 1/1 91s 9m11s\nrun-k6 Complete 1/1 79s 81s\n
When this evaluation is completed, click the Refresh
button in the Validation history
panel of the site reliability guardian app (when viewing an individual guardian) and the heatmap should look like the image below
Your results may vary
Your results may vary. In this example below, the Traffic
objective failed because the auto-adaptive thresholds detected that a traffic level below 1171
requests is too low and the actual traffic level was 1158
.
Because one objective failed, the guardian failed.
5 training runs and 1 \"real\" run:
Information Only Objectives
It is possible to add objectives that are \"informational only\" and do not contribute to the pass / fail decisions.
This is useful for new services where you are trying to \"get a feel for\" the real-world data values of your metrics.
To set an objective as \"information only\": * Select the objective to open the side panel * Scroll down to Define thresholds
* Select the No thresholds
option
After the codespaces has started, the post creation script should begin. This will install everything and will take a few moments.
When the script has completed, a success message will briefly be displayed (it is so quick you'll probably miss it) and an empty terminal window will be shown.
"},{"location":"validate-telemetry/#wait-for-demo-to-start","title":"Wait For Demo to Start","text":"Wait for the demo application pods to start:
kubectl -n default wait --for=condition=Ready --all --timeout 300s pod\n
"},{"location":"validate-telemetry/#access-demo-user-interface","title":"Access Demo User Interface","text":"Start port forwarding to access the user interface:
kubectl -n default port-forward svc/my-otel-demo-frontendproxy 8080\n
Leave this command running. Open a new terminal window to run any other commands.
Go to ports tab, right click the demo app
entry and choose Open in browser
.
You should see the OpenTelemetry demo:
"},{"location":"validate-telemetry/#validate-telemetry","title":"Validate Telemetry","text":"It is time to ensure telemetry is flowing correctly into Dynatrace.
In Dynatrace, follow these steps:
"},{"location":"validate-telemetry/#validate-services","title":"Validate Services","text":"ctrl + k
. Search for services
. Go to services screen and validate you can see services.SERVICE-****
.CUSTOM_DEVICE-****
:ctrl + k
and search for settings
.Service Detection > Unified services for OpenTelemetry
and ensure the toggle is on.ctrl + k
. Search for distributed traces
.ctrl + k
. Search for metrics
.app.
and validate you can see some metrics.ctrl + k
. Search for notebooks
.+
to add a new DQL
section.fetch logs, scanLimitGBytes: 1\n| filter contains(content, \"conversion\")\n
"},{"location":"validate-telemetry/#telemetry-flowing","title":"Telemetry Flowing?","text":"If these four things are OK, your telemetry is flowing correctly into Dynatrace.
If not, please search for similar problems and / or raise an issue here.
Wait for all jobs to complete:
kubectl -n default wait --for=condition=Complete --all --timeout 120s jobs\n
All jobs (including the acceptance-load-test
) should now be Complete
.
Refresh the Site Reliability Guardian results heatmap again and notice that the guardian has failed.
The guardian has failed due to the error rate being too high.
Navigating to the checkoutservice
(ctrl + k
> services
> checkoutservice
), you can see the increase in failure rate.
Scroll down the services screen until you see the OpenTelemetry traces list. Notice lots of failed requests:
"},{"location":"view-acceptance-test-results/#analyse-a-failed-request","title":"Analyse a Failed Request","text":"Drill into one of the failed requests and notice lots of failures.
These failures are bubbling up through the request chain back towards the checkoutservice.
Ultimately though, the failure comes from the final span in the trace: The call to PaymentService/Charge
.
Investigating the span events the cause of the failure becomes clear: The payment service cuase an exception. The exception message and stacktrace is given:
exception.message PaymentService Fail Feature Flag Enabled\nexception.stacktrace Error: PaymentService Fail Feature Flag Enabled at module.exports.charge\n (/usr/src/app/charge.js:21:11) at process.processTicksAndRejections\n (node:internal/process/task_queues:95:5) at async Object.chargeServiceHandler\n [as charge] (/usr/src/app/index.js:21:22)\nexception.type Error\n
"},{"location":"view-acceptance-test-results/#roll-back-change","title":"Roll Back Change","text":"Inform Dynatrace that a change in configuration is coming. The paymentServiceFailure
flag will be set to off
./runtimeChange.sh paymentServiceFailure off\n
Again edit flags.yaml
and set the defaultValue
of paymentServiceFailure
from \"on\"
to \"off\"
(line 84
)
Apply the chnages:
kubectl apply -f $CODESPACE_VSCODE_FOLDER/flags.yaml\n
"},{"location":"view-acceptance-test-results/#summary","title":"Summary","text":"Looking back at the initial brief, it was your job to:
So how did things turn out?
no go
decision based on evidence provided by OpenTelemetry and the Dynatrace Site Reliability Guardian.Works with any metric
The techniques described here work with any metric, from any source.
You are encouraged to use metrics from other devices and sources (such as business related metrics like revenue).
Success
The Dynatrace Platform, Site Reliability Guardian and Workflows have provided visibility and automated change analysis.
Content about how the user progresses after this demo.
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Release Validation for DevOps Engineers with Site Reliability Guardian","text":"In this demo, you take on the role of a Product Manager or DevOps engineer. You are running an application, and wish to enable a new feature.
The application is already instrumented to emit tracing data, using the OpenTelemetry standard. The demo system will be automatically configured to transport that data to Dynatrace for storage and processing.
Your job is to:
To achieve these objectives, you will:
checkoutservice
)Your company utilises feature flags to enable new features. A product manager informs you that they wish to release a new feature.
It is your job to:
Below is the \"flow\" of information and actors during this demo.
This architecture also holds true for other load testing tools (eg. JMeter).
A load test is executed. The HTTP requests are annotated with the standard header values.
Metrics are streamed during the load test (if the load testing tool supports this) or the metrics are send at the end of the load test.
The load testing tool is responsible for sending an event to signal \"test is finished\". Integrators are responsible for crafting this event to contain any important information required by Dynatrace such as the test duration.
A workflow is triggered on receipt of this event. The workflow triggers the Site Reliability Guardian.
The Site Reliability Guardian processes the load testing metrics and to provide an automated load testing report. This can be used for information only or can be used as an automated \"go / no go\" decision point.
Dynatrace users can view the results in a dashboard, notebook or use the result as a trigger for further automated workflows.
Integrators have the choice to send (emit) the results to an external tool. This external tool can then use this result. One example would be sending the SRG result to Jenkins to progress or prevent a deployment.
Site reliability guardians can be automated so they happen whenever you prefer (on demand / on schedule / event based). A Dynatrace workflow is used to achieve this.
In this demo:
Let's plumb that together now.
Sample k6 teardown test finished event
For information only, no action is required.
This is already coded into the demo load test script.
export function teardown() {\n // Send event at the end of the test\n let payload = {\n \"entitySelector\": \"type(SERVICE),entityName.equals(checkoutservice)\",\n \"eventType\": \"CUSTOM_INFO\",\n \"properties\": {\n \"tool\": \"k6\",\n \"action\": \"test\",\n \"state\": \"finished\",\n \"purpose\": `${__ENV.LOAD_TEST_PURPOSE}`,\n \"duration\": test_duration\n },\n \"title\": \"k6 load test finished\"\n }\n\n let res = http.post(`${__ENV.K6_DYNATRACE_URL}/api/v2/events/ingest`, JSON.stringify(payload), post_params);\n }\n}\n
"},{"location":"automate-srg/#create-a-workflow-to-trigger-guardian","title":"Create a Workflow to Trigger Guardian","text":"Ensure you are still on the Three golden signals (checkoutservice)
screen.
Automate
button. This will create a template workflow.event type
from bizevents
to events
.Filter query
to:event.type == \"CUSTOM_INFO\" and\ndt.entity.service.name == \"checkoutservice\" and\ntool == \"k6\" and\naction == \"test\" and\nstate == \"finished\"\n
run_validation
node.event.timeframe.from
and replace with:now-{{ event()['duration'] }}\n
The UI will change this to now-event.duration
.
Remove event.timeframe.to
and replace with:
now\n
Click the Save
button.
The workflow is now created and connected to the guardian. It will be triggered whenever the platform receives an event like below.
The workflow is now live and listening for events.
Go to https://github.com/codespaces and delete the codespace which will delete the demo environment.
You may also wish to delete the API token.
Site reliability guardians are a mechanism to automate analysis when changes are made. They can be used in production (on a CRON) or as deployment checks (eg. pre and post deployment health checks, security checks, infrastructure health checks).
We will create a guardian to check the checkoutservice
microservice which is used during the purchase journey.
ctrl + k
search for Site Reliability Guardian
and select the app.+ Guardian
to add a new guardian.Four Golden Signals
choose Use template
.Run query
and toggle 50
rows per page to see more services.checkoutservice
. Click Apply to template (1)
.Saturation
objective and delete it (there are no resource statistics from OpenTelemetry available so this objective cannot be evaluated).Three golden signals (checkoutservice)
.Save
Automate at scale
This process can be automated for at-scale usage using Monaco or Terraform.
Objectives that are set to \"auto baseline\" in Dynatrace Site Reliability Guardians require 5
runs in order to enable the baselines.
In a real scenario, these test runs would likely be spread over hours, days or weeks. This provides Dynatrace with ample time to gather sufficient usage data.
For demo purposes, 5 seperate \"load tests\" will be triggered in quick succession to enable the baselining.
First, open a new terminal window and apply the load test script:
kubectl apply -f .devcontainer/k6/k6-load-test-script.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-first-load-test","title":"Trigger the First Load Test","text":"kubectl apply -f .devcontainer/k6/k6-srg-training-run1.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-second-load-test","title":"Trigger the Second Load Test","text":"Wait a few seconds and trigger the second load test:
kubectl apply -f .devcontainer/k6/k6-srg-training-run2.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-third-load-test","title":"Trigger the Third Load Test","text":"Wait a few seconds and trigger the third load test:
kubectl apply -f .devcontainer/k6/k6-srg-training-run3.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-fourth-load-test","title":"Trigger the Fourth Load Test","text":"Wait a few seconds and trigger the fourth load test:
kubectl apply -f .devcontainer/k6/k6-srg-training-run4.yaml\n
"},{"location":"enable-auto-baselines/#trigger-the-final-training-load-test","title":"Trigger the Final Training Load Test","text":"Wait a few seconds and trigger the final (fifth) load test:
kubectl apply -f .devcontainer/k6/k6-srg-training-run5.yaml\n
"},{"location":"enable-auto-baselines/#wait-for-completion","title":"Wait for Completion","text":"Each load test runs for 1 minute. Run this command to wait for all jobs to complete.
This command will appear to hang until the jobs are done. Be patient. It should take about 2mins:
kubectl -n default wait --for=condition=Complete --all --timeout 120s jobs\n
\u279c /workspaces/obslab-release-validation (main) $ kubectl get jobs\nNAME STATUS COMPLETIONS DURATION AGE\nk6-training-run1 Complete 1/1 95s 2m2s\nk6-training-run2 Complete 1/1 93s 115s\nk6-training-run3 Complete 1/1 93s 108s\nk6-training-run4 Complete 1/1 90s 100s\nk6-training-run5 Complete 1/1 84s 94s\n
"},{"location":"enable-auto-baselines/#view-completed-training-runs","title":"View Completed Training Runs","text":"In Dynatrace, go to workflows
and select Executions
. You should see 5 successful workflow executions:
You can also use this DQL to see the Site Reliability Guardian results in a notebook:
fetch bizevents\n| filter event.provider == \"dynatrace.site.reliability.guardian\"\n| filter event.type == \"guardian.validation.finished\"\n| fieldsKeep guardian.id, validation.id, timestamp, guardian.name, validation.status, validation.summary, validation.from, validation.to\n
"},{"location":"enable-auto-baselines/#view-srg-status-in-the-site-reliability-guardian-app","title":"View SRG Status in the Site Reliability Guardian App","text":"The SRG results are also available in the Site Reliabiltiy Guardian app:
ctrl + k
site reliability guardian
or srg
Open
on your guardianYou should see the 5
runs listed:
Training Complete
The automatic baselines for the guardian are now enabled.
You can proceed to use the guardian for \"real\" evaluations.
A product manager informs you that they're ready to release their new feature. They ask you to enable the feature and run the load test in a dev environment.
They tell you that the new feature is behind a flag called paymentServiceFailure
(yes, an obvious name for this demo) and they tell you to change the defaultValue
from off
to on
.
Run the following script which notifies Dynatrace using a CUSTOM_INFO
event of the change inc. the new value.
./runtimeChange.sh paymentServiceFailure on\n
"},{"location":"enable-change/#change-flag-value","title":"Change Flag Value","text":"Locate the flags.yaml
file. Change the defaultValue
of the paymentServiceFailure
flag from \"off\"
to \"on\"
(line 84
).
Apply those changes:
kubectl apply -f $CODESPACE_VSCODE_FOLDER/flags.yaml\n
You should see:
configmap/my-otel-demo-flagd-config configured\n
"},{"location":"enable-change/#run-acceptance-load-test","title":"Run Acceptance Load Test","text":"It is time to run an acceptance load test to see if the new feature has caused a regression.
This load test will run for 3 minutes and then trigger the site reliability guardian again:
kubectl apply -f .devcontainer/k6/k6-after-change.yaml\n
"},{"location":"enable-change/#configuration-change-events","title":"Configuration Change Events","text":"While you are waiting for the load test to complete, it is worth noting that each time a feature flag is changed, you should execute runtimeChange.sh
shell script to send an event to the service that is affected.
The feature flag changes the behaviour of the paymentservice
(which the checkoutservice
depends on).
Look at the paymentservice
and notice the configuration changed events.
Tip
You can send event for anything you like: deployments, load tests, security scans, configuration changes and more.
You must have the following to use this hands on demo.
Save the Dynatrace environment URL:
.apps.
in the URLThe generic format is:
https://<EnvironmentID>.<Environment>.<URL>\n
For example:
https://abc12345.live.dynatrace.com\n
"},{"location":"getting-started/#create-api-token","title":"Create API Token","text":"In Dynatrace:
ctrl + k
. Search for access tokens
.metrics.ingest
logs.ingest
events.ingest
openTelemetryTrace.ingest
Click this button to open the demo environment. This will open in a new tab.
Preparation Complete
The preparation phase is now complete. Everything before now is a one-off task.
In day-to-day operations, you would begin from here.
"},{"location":"run-production-srg/#run-an-evaluation","title":"Run an Evaluation","text":"Now that the Site Reliability Guardian is trained, run another evaluation by triggering a load test.
Tip
Remember, the workflow is currently configured to listen for test finished
events but you could easily create additional workflows with different triggers such as on-demand on time-based CRON triggers.
This provides an ability to continuously test your service (eg. in production).
Run another load test to trigger a sixth evaluation.
kubectl apply -f .devcontainer/k6/k6.yaml\n
Again, wait for all jobs to complete. This run will take longer. Approximately 2mins.
kubectl -n default wait --for=condition=Complete --all --timeout 120s jobs\n
When the above command returns, you should see:
NAME STATUS COMPLETIONS DURATION AGE\nk6-training-run1 Complete 1/1 102s 9m41s\nk6-training-run2 Complete 1/1 100s 9m33s\nk6-training-run3 Complete 1/1 101s 9m23s\nk6-training-run4 Complete 1/1 93s 9m17s\nk6-training-run5 Complete 1/1 91s 9m11s\nrun-k6 Complete 1/1 79s 81s\n
When this evaluation is completed, click the Refresh
button in the Validation history
panel of the site reliability guardian app (when viewing an individual guardian) and the heatmap should look like the image below
Your results may vary
Your results may vary. In this example below, the Traffic
objective failed because the auto-adaptive thresholds detected that a traffic level below 1171
requests is too low and the actual traffic level was 1158
.
Because one objective failed, the guardian failed.
5 training runs and 1 \"real\" run:
Information Only Objectives
It is possible to add objectives that are \"informational only\" and do not contribute to the pass / fail decisions.
This is useful for new services where you are trying to \"get a feel for\" the real-world data values of your metrics.
To set an objective as \"information only\": * Select the objective to open the side panel * Scroll down to Define thresholds
* Select the No thresholds
option
After the codespaces has started, the post creation script should begin. This will install everything and will take a few moments.
When the script has completed, a success message will briefly be displayed (it is so quick you'll probably miss it) and an empty terminal window will be shown.
"},{"location":"validate-telemetry/#wait-for-demo-to-start","title":"Wait For Demo to Start","text":"Wait for the demo application pods to start:
kubectl -n default wait --for=condition=Ready --all --timeout 300s pod\n
"},{"location":"validate-telemetry/#access-demo-user-interface","title":"Access Demo User Interface","text":"Start port forwarding to access the user interface:
kubectl -n default port-forward svc/my-otel-demo-frontendproxy 8080\n
Leave this command running. Open a new terminal window to run any other commands.
Go to ports tab, right click the demo app
entry and choose Open in browser
.
You should see the OpenTelemetry demo:
"},{"location":"validate-telemetry/#validate-telemetry","title":"Validate Telemetry","text":"It is time to ensure telemetry is flowing correctly into Dynatrace.
In Dynatrace, follow these steps:
"},{"location":"validate-telemetry/#validate-services","title":"Validate Services","text":"ctrl + k
. Search for services
. Go to services screen and validate you can see services.SERVICE-****
.CUSTOM_DEVICE-****
:ctrl + k
and search for settings
.Service Detection > Unified services for OpenTelemetry
and ensure the toggle is on.ctrl + k
. Search for distributed traces
.ctrl + k
. Search for metrics
.app.
and validate you can see some metrics.ctrl + k
. Search for notebooks
.+
to add a new DQL
section.fetch logs, scanLimitGBytes: 1\n| filter contains(content, \"conversion\")\n
"},{"location":"validate-telemetry/#telemetry-flowing","title":"Telemetry Flowing?","text":"If these four things are OK, your telemetry is flowing correctly into Dynatrace.
If not, please search for similar problems and / or raise an issue here.
Wait for all jobs to complete:
kubectl -n default wait --for=condition=Complete --all --timeout 120s jobs\n
All jobs (including the acceptance-load-test
) should now be Complete
.
Refresh the Site Reliability Guardian results heatmap again and notice that the guardian has failed.
The guardian has failed due to the error rate being too high.
Navigating to the checkoutservice
(ctrl + k
> services
> checkoutservice
), you can see the increase in failure rate.
Scroll down the services screen until you see the OpenTelemetry traces list. Notice lots of failed requests:
"},{"location":"view-acceptance-test-results/#analyse-a-failed-request","title":"Analyse a Failed Request","text":"Drill into one of the failed requests and notice lots of failures.
These failures are bubbling up through the request chain back towards the checkoutservice.
Ultimately though, the failure comes from the final span in the trace: The call to PaymentService/Charge
.
Investigating the span events the cause of the failure becomes clear: The payment service cuase an exception. The exception message and stacktrace is given:
exception.message PaymentService Fail Feature Flag Enabled\nexception.stacktrace Error: PaymentService Fail Feature Flag Enabled at module.exports.charge\n (/usr/src/app/charge.js:21:11) at process.processTicksAndRejections\n (node:internal/process/task_queues:95:5) at async Object.chargeServiceHandler\n [as charge] (/usr/src/app/index.js:21:22)\nexception.type Error\n
"},{"location":"view-acceptance-test-results/#roll-back-change","title":"Roll Back Change","text":"Inform Dynatrace that a change in configuration is coming. The paymentServiceFailure
flag will be set to off
./runtimeChange.sh paymentServiceFailure off\n
Again edit flags.yaml
and set the defaultValue
of paymentServiceFailure
from \"on\"
to \"off\"
(line 84
)
Apply the changes:
kubectl apply -f $CODESPACE_VSCODE_FOLDER/flags.yaml\n
"},{"location":"view-acceptance-test-results/#summary","title":"Summary","text":"Looking back at the initial brief, it was your job to:
So how did things turn out?
no go
decision based on evidence provided by OpenTelemetry and the Dynatrace Site Reliability Guardian.Works with any metric
The techniques described here work with any metric, from any source.
You are encouraged to use metrics from other devices and sources (such as business related metrics like revenue).
Success
The Dynatrace Platform, Site Reliability Guardian and Workflows have provided visibility and automated change analysis.
Content about how the user progresses after this demo.
"}]} \ No newline at end of file diff --git a/view-acceptance-test-results/index.html b/view-acceptance-test-results/index.html index 1c16226..05c6627 100755 --- a/view-acceptance-test-results/index.html +++ b/view-acceptance-test-results/index.html @@ -681,7 +681,7 @@./runtimeChange.sh paymentServiceFailure off
Again edit flags.yaml
and set the defaultValue
of paymentServiceFailure
from "on"
to "off"
(line 84
)
Apply the chnages:
+Apply the changes:
kubectl apply -f $CODESPACE_VSCODE_FOLDER/flags.yaml