Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node timing #747

Merged
merged 1 commit into from
Jan 31, 2025
Merged

Node timing #747

merged 1 commit into from
Jan 31, 2025

Conversation

paigerube14
Copy link
Collaborator

@paigerube14 paigerube14 commented Jan 14, 2025

This is adding in the ability to add AffectedNode timing to a list that we can track in the telemetry and other output.

This tracks the amount of time that is taken for the cloud provider to stop/start the node, and the amount of time after the node from the cloud side is stopped/started what is the time the node is in not ready/ready state.

Needs to go in after krkn-chaos/krkn-lib#143

Any suggestions on how to make this not touch as many files?

New telemetry section will look like this for a stop/start scenario

          "affected_nodes": [
                    {
                        "node_name": "ip-*.us-east-2.compute.internal",
                        "not_ready_time": 0.0,
                        "ready_time": 24.439035892486572,
                        "unknown_time": 0.15461111068725586,
                        "stopped_time": 136.35135626792908,
                        "running_time": 15.249454021453857,
                        "terminating_time": 0.0
                    },
                    {
                        "node_name": "ip-*.us-east-2.compute.internal",
                        "not_ready_time": 0.0,
                        "ready_time": 23.62007188796997,
                        "unknown_time": 0.15206694602966309,
                        "stopped_time": 166.52288508415222,
                        "running_time": 15.330392122268677,
                        "terminating_time": 0.0
                    }
                ],

Copy link
Collaborator

@chaitanyaenr chaitanyaenr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested different combinations and the reporting is working as expected. Working node scenario configs:

  • one action - node_stop_start_scenario targeting one instance count
  • one action - node_stop_start_scenario targeting multiple instance count

Scenario with multiple actions is failing to report the metrics due to a bug in the node-scenarios code base outside this PR: #749

@paigerube14 paigerube14 force-pushed the node_timing branch 2 times, most recently from 0c804a1 to 8a13cd1 Compare January 23, 2025 13:54
requirements.txt Outdated Show resolved Hide resolved
Copy link
Collaborator

@chaitanyaenr chaitanyaenr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is good for merge after the krkn-lib version is bumped to include the enhancements needed.

@paigerube14 paigerube14 force-pushed the node_timing branch 2 times, most recently from ba96538 to 82b4cea Compare January 31, 2025 18:23
* added new native hog scenario

* removed arcaflow dependency + legacy hog scenarios

* config update

* changed hog configuration structure + added average samples

* fix on cpu count

* removes tripledes warning

* changed selector format

* changed selector syntax

* number of nodes option

* documentation

* functional tests

* exception handling on hog deployment thread

Signed-off-by: Paige Patton <[email protected]>
@chaitanyaenr chaitanyaenr merged commit b024cfd into krkn-chaos:main Jan 31, 2025
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants