From f2584e47dd4c3f19e0ce3f9599c02cb1494288ee Mon Sep 17 00:00:00 2001 From: Keith Cantrell Date: Mon, 30 Sep 2024 15:21:09 -0500 Subject: [PATCH 1/6] Added a CloudFormation template to auto-add-cw-alarms. --- ...udformationTemplate-auto-add-cw-alarms.yml | 37 + Monitoring/auto-add-cw-alarms/README.md | 49 +- .../auto-add-cw-alarms/auto_add_cw_alarms.py | 28 +- .../auto-add-cw-alarms/cloudformation.yaml | 753 ++++++++++++++++++ .../update-auto-add-cw-alarms-CF-Template | 46 ++ .../updateMonOntapServiceCFTemplate | 2 +- 6 files changed, 892 insertions(+), 23 deletions(-) create mode 100644 .github/workflows/update-CloudformationTemplate-auto-add-cw-alarms.yml create mode 100644 Monitoring/auto-add-cw-alarms/cloudformation.yaml create mode 100755 Monitoring/auto-add-cw-alarms/update-auto-add-cw-alarms-CF-Template diff --git a/.github/workflows/update-CloudformationTemplate-auto-add-cw-alarms.yml b/.github/workflows/update-CloudformationTemplate-auto-add-cw-alarms.yml new file mode 100644 index 0000000..1138aee --- /dev/null +++ b/.github/workflows/update-CloudformationTemplate-auto-add-cw-alarms.yml @@ -0,0 +1,37 @@ +--- +# Copyright (c) NetApp, Inc. +# SPDX-License-Identifier: Apache-2.0 + +name: "Update Cloudformation Template" + +on: + pull_request: + paths: + - 'Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py' + push: + paths: + - 'Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py' + branches: + - main + +jobs: + update-Cloudformation-Template: + runs-on: ubuntu-latest + permissions: + # Give the default GITHUB_TOKEN write permission to commit and push the + # added or changed files to the repository. + contents: write + + steps: + - name: Checkout pull request + uses: actions/checkout@v4 + with: + ref: ${{ github.event.pull_request.head.ref }} + + - name: Update the Cloudformation Template + shell: bash + working-directory: Monitoring/auto-add-cw-alarms + run: ./update-auto-add-cw-alarms-CF-Template + + - name: Commit the changes + uses: stefanzweifel/git-auto-commit-action@v5 diff --git a/Monitoring/auto-add-cw-alarms/README.md b/Monitoring/auto-add-cw-alarms/README.md index 798a0d6..ec3e7f9 100644 --- a/Monitoring/auto-add-cw-alarms/README.md +++ b/Monitoring/auto-add-cw-alarms/README.md @@ -10,19 +10,32 @@ to monitor the CPU utilization of the file system. And if a volume or file syste To implement this, you might think to just create EventTail filters to trigger on the creation or deletion of an FSx Volume. This would kind of work, but since you have command line access to the FSx for ONTAP file system, you can create -and delete volumes without creating CloudTrail events. So, this method would not be reliable. Therefore, instead +and delete volumes without generating any CloudTrail events. So, this method would not be reliable. Therefore, instead of relying on those events, this script will scan all the file systems and volumes in all the regions then create and delete alarms as needed. ## Invocation -There are two ways you can invoke this script (Python program). Either from a computer that has Python installed, or you could upload it -as a Lambda function. +There are two ways you can invoke this script (Python program). Either from a computer that has Python installed, or you could install it +as a Lambda function. If you want to run it as a Lambda function, a CloudFormation template is included in the repo that will: +- Create a role that will allow the Lambda function to: + - List AWS regions. So it can scan all regions for FSx for ONTAP file systems and volumes. + - List the FSx for ONTAP file systems. + - List the FSx volume. + - List the CloudWatch alarms. + - List tags for the resources. This is so you can customize the thresholds for the alarms. + - Create CloudWatch alarms. + - Delete CloudWatch alarms that it has created (based on alarm names). +- Create a Lambda function with the Python program. +- Create a EventBridge schedule that will run the Lambda function on a user defined basis. +- Create a role that will allow the EventBridge schedule to trigger the Lambda function. ### Configuring the program Before you can run the program you will need to configure it. You can configure it a few ways: * By editing the top part of the program itself where there are the following variable definitions. -* By setting environment variables. +* By setting environment variables with the same names as the variables in the program. * If running it as a standalone program, via some command line options. +:bulb: **NOTE:** The CloudFormation template will prompt for these values when you create the stack and will set the appropriate environment variables for you. + Here is the list of variables, and what they define: | Variable | Description |Command Line Option| @@ -78,19 +91,20 @@ You can run the program in "Dry Run" mode by specifying the `-d` (or `--dryRun`) messages showing what it would have done, and not really create or delete any CloudWatch alarms. ### Running as a Lambda function -If you run the program as a Lambda function, you will want to set the timeout to at least two minutes since some of the API calls +A CloudFormation template is included in the repo that will do the steps below. Otherwise, here are the steps required to install the program as a Lambda function. +Create a Lambda function and upload the program as the function code. Set the set the timeout to at least five minutes since some of the API calls can take a significant amount of "clock time" to run, especially in distant regions. Once you have installed the Lambda function it is recommended to set up a scheduled type EventBridge rule so the function will run on a regular basis. The appropriate permissions will need to be assigned to the Lambda function in order for it to run correctly. It doesn't need many permissions. It just needs to be able to: -* List the FSx for ONTAP file systems -* List the FSx volume names -* List the CloudWatch alarms -* Create CloudWatch alarms -* Delete CloudWatch alarms -* Create CloudWatch Log Groups and Log Streams in case you need to diagnose an issue +* List the FSx for ONTAP file systems. +* List the FSx volume names. +* List the CloudWatch alarms. +* Create CloudWatch alarms. +* Delete CloudWatch alarms. You can set resource to "arn:aws:cloudwatch:*:${AWS::AccountId}:alarm:FSx-ONTAP-Auto*" to limit the deletion to only the alarms that it created. +* Create CloudWatch Log Groups and Log Streams in case you need to diagnose an issue. The following permissions are required to run the script (although you could narrow the "Resource" specification to suit your needs.) ```JSON @@ -105,7 +119,6 @@ The following permissions are required to run the script (although you could nar "fsx:ListTagsForResource", "fsx:DescribeVolumes", "fsx:DescribeFilesystems", - "cloudwatch:DeleteAlarms", "cloudwatch:DescribeAlarmsForMetric", "ec2:DescribeRegions", "cloudwatch:DescribeAlarms" @@ -115,6 +128,14 @@ The following permissions are required to run the script (although you could nar { "Sid": "VisualEditor1", "Effect": "Allow", + "Action": [ + "cloudwatch:DeleteAlarms" + ], + "Resource": "arn:aws:cloudwatch:*:*:alarm:FSx-ONTAP-Auto*" + }, + { + "Sid": "VisualEditor2", + "Effect": "Allow", "Action": [ "logs:CreateLogStream", "logs:PutLogEvents" @@ -122,7 +143,7 @@ The following permissions are required to run the script (although you could nar "Resource": "arn:aws:logs:*:*:log-group:*:log-stream:*" }, { - "Sid": "VisualEditor2", + "Sid": "VisualEditor3", "Effect": "Allow", "Action": "logs:CreateLogGroup", "Resource": "arn:aws:logs:*:*:log-group:*" @@ -133,7 +154,7 @@ The following permissions are required to run the script (although you could nar ### Expected Action Once the script has been configured and invoked, it will: -* Scan for every FSx for ONTAP file systems in every region. For every file system it finds it will: +* Scan for every FSx for ONTAP file systems in every region. For every file system that it finds it will: * Create a CPU utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm. * Create a SSD utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm. * Scan for every FSx for ONTAP volume in every region. For every volume it finds it will: diff --git a/Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py b/Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py index 5886be8..7accc7c 100755 --- a/Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py +++ b/Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py @@ -4,15 +4,18 @@ # ONTAP volumes, that don't already have one, that will trigger when the # utilization of the volume gets above the threshold defined below. It will # also create an alarm that will trigger when the file system reach -# an average CPU utilization greater than what is specified below. +# an average CPU utilization greater than what is specified below as well +# an alarm that will trigger when the SSD utilization is greater than what +# is specified below. # # It can either be run as a standalone script, or uploaded as a Lambda # function with the thought being that you will create a EventBridge schedule # to invoke it periodically. # -# It will scan all regions looking for FSxN volumes, and since CloudWatch -# can't send SNS messages across regions, it assumes that the specified -# SNS topic exist in each region for the specified account ID. +# It will scan all regions looking for FSxN volumes and file systems +# and since CloudWatch can't send SNS messages across regions, it assumes +# that the specified SNS topic exist in each region for the specified +# account ID. # # Finally, a default volume threshold is defined below. It sets the volume # utilization threshold that will cause CloudWatch to send the alarm event @@ -24,6 +27,9 @@ # Lastly, you can create an override for the SSD alarm, by creating a tag # with the name "SSD_Alarm_Threshold" on the file system resource. # +# Version: %%VERSION%% +# Date: %%DATE%% +# ################################################################################ # # The following variables effect the behavior of the script. They can be @@ -64,14 +70,20 @@ # what you are doing. ################################################################################ # +# The following is put in front of all alarms so an IAM policy can be create +# that will allow this script to only be able to delete the alarms it creates. +# If you change this, you must also change the IAM policy. Note that the +# Cloudfomration template also assume the value of this variable. +basePrefix="FSx-ONTAP-Auto" +# # Define the prefix for the volume utilization alarm name for the CloudWatch alarms. -alarmPrefixVolume="Volume_Utilization_for_volume_" +alarmPrefixVolume=f"{basePrefix}-Volume_Utilization_for_volume_" # # Define the prefix for the CPU utilization alarm name for the CloudWatch alarms. -alarmPrefixCPU="CPU_Utilization_for_fs_" +alarmPrefixCPU=f"{basePrefix}-CPU_Utilization_for_fs_" # # Define the prefix for the SSD utilization alarm name for the CloudWatch alarms. -alarmPrefixSSD="SSD_Utilization_for_fs_" +alarmPrefixSSD=f"{basePrefix}-SSD_Utilization_for_fs_" ################################################################################ # You shouldn't have to modify anything below here. @@ -531,7 +543,7 @@ def lambda_handler(event, context): # This function is used to print out the usage of the script. ################################################################################ def usage(): - print('Usage: add_cw_alarm [-h|--help] [-d|--dryRun] [[-c|--customerID customerID] [[-a|--accountID aws_account_id] [[-s|--SNSTopic SNS_Topic_Name] [[-r|--region region] [[-C|--CPUThreshold threshold] [[-S|--SSDThreshold threshold] [[-V|--VolumeThreshold threshold] [-F|--FileSystemID FileSystemID]') + print('Usage: auto_add_cw_alarms [-h|--help] [-d|--dryRun] [[-c|--customerID customerID] [[-a|--accountID aws_account_id] [[-s|--SNSTopic SNS_Topic_Name] [[-r|--region region] [[-C|--CPUThreshold threshold] [[-S|--SSDThreshold threshold] [[-V|--VolumeThreshold threshold] [-F|--FileSystemID FileSystemID]') ################################################################################ # Main logic starts here. diff --git a/Monitoring/auto-add-cw-alarms/cloudformation.yaml b/Monitoring/auto-add-cw-alarms/cloudformation.yaml new file mode 100644 index 0000000..1d9e307 --- /dev/null +++ b/Monitoring/auto-add-cw-alarms/cloudformation.yaml @@ -0,0 +1,753 @@ +Description: "Deploy auto-add-cw-alarms." +# +# This just formats the page that prompts for the parameters when using the AWS Console to deploy your stack. +Metadata: + AWS::CloudFormation::Interface: + ParameterGroups: + - Label: + default: "Configuration Parameters" + Parameters: + - SNStopic + - accountId + - customerId + - defaultCPUThreshold + - defaultSSDThreshold + - defaultVolumeThreshold + - checkInterval + +Parameters: + SNStopic: + Description: "The SNS Topic name where CloudWatch will send alerts to. Note that it is assumed that the SNS topic, with the same name, will exist in all the regions where alarms are to be created." + Type: String + + accountId: + Description: "The AWS account ID associated with the SNStopic. This is only used to compute the ARN to the SNS Topic." + Type: String + + customerId: + Description: "This is really just a comment that will be added to the alarm description." + Type: String + Default: "" + + defaultCPUThreshold: + Description: "This will define the default CPU utilization threshold. You can override the default by having a specific tag associated with the file system. See below for more information." + Type: Number + MinValue: 0 + MaxValue: 100 + Default: 80 + + defaultSSDThreshold: + Description: "This will define the default SSD (aggregate) utilization threshold. You can override the default by having a specific tag associated with the file system. See below for more information." + Type: Number + MinValue: 0 + MaxValue: 100 + Default: 80 + + defaultVolumeThreshold: + Description: "This will define the default Volume utilization threshold. You can override the default by having a specific tag associated with the volume. See below for more information." + Type: Number + MinValue: 0 + MaxValue: 100 + Default: 80 + + checkInterval: + Description: "This is how often you want the Lambda function to run to look for new file systems and/or volumes (minutes)." + Type: Number + MinValue: 5 + Default: 15 + +Resources: + LambdaRole: + Type: "AWS::IAM::Role" + Properties: + RoleName: !Sub "Lambda-Role-for-${AWS::StackName}" + AssumeRolePolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: "Allow" + Principal: + Service: "lambda.amazonaws.com" + Action: "sts:AssumeRole" + + ManagedPolicyArns: + - "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole" + + Policies: + - PolicyName: "LambdaPolicy_for_Auto_Add_CW_Alarms" + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: "Allow" + Action: + - "fsx:DescribeFileSystems" + - "fsx:DescribeVolumes" + - "fsx:ListTagsForResource" + - "ec2:DescribeRegions" + - "cloudwatch:DescribeAlarms" + - "cloudwatch:DescribeAlarmsForMetric" + - "cloudwatch:PutMetricAlarm" + Resource: "*" + + - Effect: "Allow" + Action: + - "cloudwatch:DeleteAlarms" + Resource: !Sub "arn:aws:cloudwatch:*:${AWS::AccountId}:alarm:FSx-ONTAP-Auto*" + + SchedulerRole: + Type: "AWS::IAM::Role" + Properties: + RoleName: !Sub "SchedulerRole-for-${AWS::StackName}" + AssumeRolePolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: "Allow" + Principal: + Service: "scheduler.amazonaws.com" + Action: "sts:AssumeRole" + + Policies: + - PolicyName: "SchedulerPolicy_for_Auto_Add_CW_Alarms" + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: "Allow" + Action: + - "lambda:InvokeFunction" + Resource: !GetAtt LambdaFunction.Arn + + LambdaScheduler: + Type: "AWS::Scheduler::Schedule" + Properties: + Description: "Schedule the auto_add_cw_alarms Lambda function." + Name: !Sub "Schedule-for-${AWS::StackName}" + FlexibleTimeWindow: + Mode: "OFF" + ScheduleExpression: !Sub "rate(${checkInterval} minutes)" + Target: + Arn: !GetAtt LambdaFunction.Arn + RoleArn: !GetAtt SchedulerRole.Arn + + LambdaFunction: + Type: "AWS::Lambda::Function" + Properties: + FunctionName: !Sub "Lambda-for-${AWS::StackName}" + Role: !GetAtt LambdaRole.Arn + PackageType: "Zip" + Runtime: "python3.12" + Handler: "index.lambda_handler" + Timeout: 300 + Environment: + Variables: + SNStopic: !Ref SNStopic + accountId: !Ref accountId + customerId: !Ref customerId + defaultCPUThreshold: !Ref defaultCPUThreshold + defaultSSDThreshold: !Ref defaultSSDThreshold + defaultVolumeThreshold: !Ref defaultVolumeThreshold + + Code: + ZipFile: | + #!/usr/bin/python3 + # + # This script is used to add CloudWatch alarms for all the FSx for NetApp + # ONTAP volumes, that don't already have one, that will trigger when the + # utilization of the volume gets above the threshold defined below. It will + # also create an alarm that will trigger when the file system reach + # an average CPU utilization greater than what is specified below as well + # an alarm that will trigger when the SSD utilization is greater than what + # is specified below. + # + # It can either be run as a standalone script, or uploaded as a Lambda + # function with the thought being that you will create a EventBridge schedule + # to invoke it periodically. + # + # It will scan all regions looking for FSxN volumes and file systems + # and since CloudWatch can't send SNS messages across regions, it assumes + # that the specified SNS topic exist in each region for the specified + # account ID. + # + # Finally, a default volume threshold is defined below. It sets the volume + # utilization threshold that will cause CloudWatch to send the alarm event + # to the SNS topic. It can be overridden on a per volume basis by having a + # tag with the name of "alarm_threshold" set to the desired threshold. + # If the tag is set to 100, then no alarm will be created. You can also + # set an override to the filesystem CPU utilization alarm, but setting + # a tag with the name of 'CPU_Alarm_Threshold' on the file system resouce. + # Lastly, you can create an override for the SSD alarm, by creating a tag + # with the name "SSD_Alarm_Threshold" on the file system resource. + # + # Version: v2.11 + # Date: 2024-09-27-16:03:19 + # + ################################################################################ + # + # The following variables effect the behavior of the script. They can be + # either be set here, overridden via the command line options, or + # overridden by environment variables. + # + # Define which SNS topic you want "volume full" message to be sent to. + SNStopic='' + # + # Provide the account id the SNS topic resides under: + # MUST be a string. + accountId='' + # + # Set the customer ID associated with the AWS account. This is used to + # as part of the alarm name prefix so a customer ID can be associated + # with the alarm. If it is left as an empty string, no extra prefix + # will be added. + customerId='' + # + # Define the default CPU utilization threshold before sending the alarm. + # Setting it to 100 will disable the creation of the alarm. + defaultCPUThreshold=80 + # + # Define the default SSD utilization threshold before sending the alarm. + # Setting it to 100 will disable the creation of the alarm. + defaultSSDThreshold=90 + # + # Define the default volume utilization threshold before sending the alarm. + # Setting it to 100 will disable the creation of the alarm. + defaultVolumeThreshold=80 + # + # + ################################################################################ + # You can't change the following variables from the command line or environment + # variables since changing them after the program has run once would cause + # all existing CloudWatch alarms to be abandoned, and all new alarms to be + # created. So, it is not recommended to change these variables unless you know + # what you are doing. + ################################################################################ + # + # The following is put in front of all alarms so an IAM policy can be create + # that will allow this script to only be able to delete the alarms it creates. + # If you change this, you must also change the IAM policy. Note that the + # Cloudfomration template also assume the value of this variable. + basePrefix="FSx-ONTAP-Auto" + # + # Define the prefix for the volume utilization alarm name for the CloudWatch alarms. + alarmPrefixVolume=f"{basePrefix}-Volume_Utilization_for_volume_" + # + # Define the prefix for the CPU utilization alarm name for the CloudWatch alarms. + alarmPrefixCPU=f"{basePrefix}-CPU_Utilization_for_fs_" + # + # Define the prefix for the SSD utilization alarm name for the CloudWatch alarms. + alarmPrefixSSD=f"{basePrefix}-SSD_Utilization_for_fs_" + + ################################################################################ + # You shouldn't have to modify anything below here. + ################################################################################ + + import botocore + from botocore.config import Config + import boto3 + import os + import getopt + import sys + import time + import json + + ################################################################################ + # This function adds the SSD Utilization CloudWatch alarm. + ################################################################################ + def add_ssd_alarm(cw, fsId, alarmName, alarmDescription, threshold, region): + action = 'arn:aws:sns:' + region + ':' + accountId + ':' + SNStopic + if not dryRun: + cw.put_metric_alarm( + AlarmName=alarmName, + ActionsEnabled=True, + AlarmActions=[action], + AlarmDescription=alarmDescription, + EvaluationPeriods=1, + DatapointsToAlarm=1, + Threshold=threshold, + ComparisonOperator='GreaterThanThreshold', + MetricName="StorageCapacityUtilization", + Period=300, + Statistic="Average", + Namespace="AWS/FSx", + Dimensions=[{'Name': 'FileSystemId', 'Value': fsId}, {'Name': 'StorageTier', 'Value': 'SSD'}, {'Name': 'DataType', 'Value': 'All'}] + ) + else: + print(f'Would have added SSD alarm for {fsId} with name {alarmName} with thresold of {threshold} in {region} with action {action}') + + ################################################################################ + # This function adds the CPU Utilization CloudWatch alarm. + ################################################################################ + def add_cpu_alarm(cw, fsId, alarmName, alarmDescription, threshold, region): + action = 'arn:aws:sns:' + region + ':' + accountId + ':' + SNStopic + if not dryRun: + cw.put_metric_alarm( + AlarmName=alarmName, + ActionsEnabled=True, + AlarmActions=[action], + AlarmDescription=alarmDescription, + EvaluationPeriods=1, + DatapointsToAlarm=1, + Threshold=threshold, + ComparisonOperator='GreaterThanThreshold', + MetricName="CPUUtilization", + Period=300, + Statistic="Average", + Namespace="AWS/FSx", + Dimensions=[{'Name': 'FileSystemId', 'Value': fsId}] + ) + else: + print(f'Would have added CPU alarm for {fsId} with name {alarmName} with thresold of {threshold} in {region} with action {action}.') + + ################################################################################ + # This function adds the Volume utilization CloudWatch alarm. + ################################################################################ + def add_volume_alarm(cw, volumeId, alarmName, alarmDescription, fsId, threshold, region): + action = 'arn:aws:sns:' + region + ':' + accountId + ':' + SNStopic + if not dryRun: + cw.put_metric_alarm( + ActionsEnabled=True, + AlarmName=alarmName, + AlarmActions=[action], + AlarmDescription=alarmDescription, + EvaluationPeriods=1, + DatapointsToAlarm=1, + Threshold=threshold, + ComparisonOperator='GreaterThanThreshold', + Metrics=[{"Id":"e1","Label":"Utilization","ReturnData":True,"Expression":"m2/m1*100"},\ + {"Id":"m2","ReturnData":False,"MetricStat":{"Metric":{"Namespace":"AWS/FSx","MetricName":"StorageUsed","Dimensions":[{"Name":"VolumeId","Value": volumeId},{"Name":"FileSystemId","Value":fsId}]},"Period":300,"Stat":"Average"}},\ + {"Id":"m1","ReturnData":False,"MetricStat":{"Metric":{"Namespace":"AWS/FSx","MetricName":"StorageCapacity","Dimensions":[{"Name":"VolumeId","Value": volumeId},{"Name":"FileSystemId","Value":fsId}]},"Period":300,"Stat":"Average"}}] + ) + else: + print(f'Would have added volume alarm for {volumeId} {fsId} with name {alarmName} with thresold of {threshold} in {region} with action {action}.') + + + ################################################################################ + # This function deletes a CloudWatch alarm. + ################################################################################ + def delete_alarm(cw, alarmName): + if not dryRun: + cw.delete_alarms(AlarmNames=[alarmName]) + else: + print(f'Would have deleted alarm {alarmName}.') + + ################################################################################ + # This function checks to see if the alarm already exists. + ################################################################################ + def contains_alarm(alarmName, alarms): + for alarm in alarms: + if(alarm['AlarmName'] == alarmName): + return True + return False + + ################################################################################ + # This function checks to see if a volume exists. + ################################################################################ + def contains_volume(volumeId, volumes): + for volume in volumes: + if(volume['VolumeId'] == volumeId): + return True + return False + + ################################################################################ + # This function checks to see if a file system exists. + ################################################################################ + def contains_fs(fsId, fss): + for fs in fss: + if(fs['FileSystemId'] == fsId): + return True + return False + + ################################################################################ + # This function returns the value assigned to the "alarm_threshold" tag + # associated with the arn passed in. If none is found, it returns the default + # threshold set above. + ################################################################################ + def getAlarmThresholdTagValue(fsx, arn): + # + # If there are a lot of volumes, we could get hit by the AWS rate limit, + # so we will sleep for a short period of time and then retry. We will + # double the sleep time each time we get a rate limit exception until + # we get to 5 seconds, then we will just raise the exception. + sleep=.125 + # + # This is put into a try block because it is possible that the volume + # is deleted between the time we get the list of volumes and the time + # we try to get the tags for the volume. + while True: + try: + tags = fsx.list_tags_for_resource(ResourceARN=arn) + for tag in tags['Tags']: + if(tag['Key'].lower() == "alarm_threshold"): + return(tag['Value']) + return(defaultVolumeThreshold) + except botocore.exceptions.ClientError as e: + if e.response['Error']['Code'] == 'ResourceNotFound': + return(100) # Return 100 so we don't try to create an alarm. + + if e.response['Error']['Code'] == 'TooManyRequestsException' or e.response['Error']['Code'] == 'ThrottlingException': + sleep = sleep * 2 + if sleep > 5: + raise e + print(f"Warning: Rate Limit fault while getting tags. Sleeping for {sleep} seconds.") + time.sleep(sleep) + else: + print(f"boto3 client error: {json.dumps(e.response)}") + raise e + + ################################################################################ + # This function returns the value assigned to the "CPU_alarm_threshold" tag + # that is in the array of tags passed in. if it doesn't find that tag it + # returns the default threshold set above. + ################################################################################ + def getCPUAlarmThresholdTagValue(tags): + for tag in tags: + if(tag['Key'].lower() == "cpu_alarm_threshold"): + return(tag['Value']) + return(defaultCPUThreshold) + + ################################################################################ + # This function returns the value assigned to the "CPU_alarm_threshold" tag + # that is in the array of tags passed in. if it doesn't find that tag it + # returns the default threshold set above. + ################################################################################ + def getSSDAlarmThresholdTagValue(tags): + for tag in tags: + if(tag['Key'].lower() == "ssd_alarm_threshold"): + return(tag['Value']) + return(defaultSSDThreshold) + + ################################################################################ + # This function returns the file system id that the passed in alarm is + # associated with. + ################################################################################ + def getFileSystemId(alarm): + + for metric in alarm['Metrics']: + if metric["Id"] == "m1": + for dim in metric['MetricStat']['Metric']['Dimensions']: + if dim['Name'] == 'FileSystemId': + return dim['Value'] + return None + + ################################################################################ + # This function will return all the file systems in the region. It will handle the + # case where there are more file systms than can be returned in a single call. + # It will also handle the case where we get a rate limit exception. + ################################################################################ + def getFss(fsx): + + # The initial amount of time to sleep if there is a rate limit exception. + sleep=.125 + while True: + try: + response = fsx.describe_file_systems() + fss = response['FileSystems'] + nextToken = response.get('NextToken') + sleep=.125 + break + except botocore.exceptions.ClientError as e: + if e.response['Error']['Code'] == 'TooManyRequestsException' or e.response['Error']['Code'] == 'ThrottlingException': + sleep = sleep * 2 # Exponential backoff. + if sleep > 5: + raise e + print(f"Warning: Rate Limit fault while getting initial file system list. Sleeping for {sleep} seconds.") + time.sleep(sleep) + else: + print(f"boto3 client error: {json.dumps(e.response)}") + raise e + + while nextToken: + try: + response = fsx.describe_file_systems(NextToken=nextToken) + fss += response['FileSystems'] + nextToken = response.get('NextToken') + sleep=.125 + except botocore.exceptions.ClientError as e: + if e.response['Error']['Code'] == 'TooManyRequestsException' or e.response['Error']['Code'] == 'ThrottlingException': + sleep = sleep * 2 # Exponential backoff. + if sleep > 5: + raise e + print(f"Warning: Rate Limit fault while getting additional file systems. Sleeping for {sleep} seconds.") + time.sleep(sleep) + else: + print(f"boto3 client error: {json.dumps(e.response)}") + raise e + return fss + + ################################################################################ + # This function will return all the volumes in the region. It will handle the + # case where there are more volumes than can be returned in a single call. + # It will also handle the case where we get a rate limit exception. + ################################################################################ + def getVolumes(fsx): + # + # The initial amount of time to sleep if there is a rate limit exception. + sleep=.125 + while True: + try: + response = fsx.describe_volumes() + volumes = response['Volumes'] + nextToken = response.get('NextToken') + sleep=.125 + break + except botocore.exceptions.ClientError as e: + if e.response['Error']['Code'] == 'TooManyRequestsException' or e.response['Error']['Code'] == 'ThrottlingException': + sleep = sleep * 2 # Exponential backoff. + if sleep > 5: + raise e + print(f"Warning: Rate Limit fault while getting the initial list of volumes. Sleeping for {sleep} seconds.") + time.sleep(sleep) + else: + print(f"boto3 client error: {json.dumps(e.response)}") + raise e + + while nextToken: + try: + response = fsx.describe_volumes(NextToken=nextToken) + volumes += response['Volumes'] + nextToken = response.get('NextToken') + sleep=.125 + except botocore.exceptions.ClientError as e: + if e.response['Error']['Code'] == 'TooManyRequestsException' or e.response['Error']['Code'] == 'ThrottlingException': + sleep = sleep * 2 # Exponential backoff. + if sleep > 5: + raise e + print(f"Warning: Rate Limit fault while getting additional volumes. Sleeping for {sleep} seconds.") + time.sleep(sleep) + else: + print(f"boto3 client error: {json.dumps(e.response)}") + raise e + + return volumes + + ################################################################################ + # This function will return all the alarms in the region. It will handle the + # case where there are more alarms than can be returned in a single call. + # It will also handle the case where we get a rate limit exception. + ################################################################################ + def getAlarms(cw): + + # The initial amount of time to sleep if there is a rate limit exception. + sleep=.125 + while True: + try: + response = cw.describe_alarms() + alarms = response['MetricAlarms'] + nextToken = response.get('NextToken') + sleep=.125 + break + except botocore.exceptions.ClientError as e: + if e.response['Error']['Code'] == 'TooManyRequestsException' or e.response['Error']['Code'] == 'ThrottlingException': + sleep = sleep * 2 + if sleep > 5: + raise e + print(f"Warning: Rate Limit fault while getting the initial list of alarms. Sleeping for {sleep} seconds.") + time.sleep(sleep) + else: + print(f"boto3 client error: {json.dumps(e.response)}") + raise e + + while nextToken: + try: + response = cw.describe_alarms(NextToken=nextToken) + alarms += response['MetricAlarms'] + nextToken = response.get('NextToken') + sleep=.125 + except botocore.exceptions.ClientError as e: + if e.response['Error']['Code'] == 'TooManyRequestsException' or e.response['Error']['Code'] == 'ThrottlingException': + sleep = sleep * 2 # Exponential backoff. + if sleep > 5: + raise e + print(f"Warning: Rate Limit fault while getting additional alarms. Sleeping for {sleep} seconds.") + time.sleep(sleep) + else: + print(f"boto3 client error: {json.dumps(e.response)}") + raise e + + return alarms + + ################################################################################ + # This is the main logic of the program. It loops on all the regions then all + # the fsx volumes within the region, checking to see if any of them already + # have a CloudWatch alarm, and if not, add one. + ################################################################################ + def lambda_handler(event, context): + global customerId, regions, SNStopic, accountId, onlyFilesystemId + # + # If the customer ID is set, reformat it to be used in the alarm description. + if customerId != '': + customerId = f", CustomerID: {customerId}" + + if len(SNStopic) == 0: + raise Exception("You must specify a SNS topic to send the alarm messages to.") + + if len(accountId) == 0: + raise Exception("You must specify an accountId to run this program.") + # + # Configure boto3 to use the more advanced "adaptive" retry method. + boto3Config = Config( + retries = { + 'max_attempts': 5, + 'mode': 'adaptive' + } + ) + + if len(regions) == 0: # pylint: disable=E0601 + ec2Client = boto3.client('ec2', config=boto3Config) + ec2Regions = ec2Client.describe_regions()['Regions'] + for region in ec2Regions: + regions += [region['RegionName']] + + fsxRegions = boto3.Session().get_available_regions('fsx') + for region in regions: + if region in fsxRegions: + print(f'Scanning {region}') + fsx = boto3.client('fsx', region_name=region, config=boto3Config) + cw = boto3.client('cloudwatch', region_name=region, config=boto3Config) + # + # Get all the file systems, volumes and alarm in the region. + fss = getFss(fsx) + volumes = getVolumes(fsx) + alarms = getAlarms(cw) + # + # Scan for filesystems without CPU Utilization Alarm. + for fs in fss: + if(fs['FileSystemType'] == "ONTAP"): + threshold = int(getCPUAlarmThresholdTagValue(fs['Tags'])) + if(threshold != 100): + fsId = fs['FileSystemId'] + fsName = fsId.replace('fs-', 'FsxId') + alarmName = alarmPrefixCPU + fsId + alarmDescription = f"CPU utilization alarm for file system {fsName}{customerId} in region {region}." + + if(not contains_alarm(alarmName, alarms) and onlyFilesystemId == None or + not contains_alarm(alarmName, alarms) and onlyFilesystemId != None and onlyFilesystemId == fsId): + print(f'Adding CPU Alarm for {fs["FileSystemId"]}') + add_cpu_alarm(cw, fsId, alarmName, alarmDescription, threshold, region) + # + # Scan for CPU alarms without a FSxN filesystem. + for alarm in alarms: + alarmName = alarm['AlarmName'] + if(alarmName[:len(alarmPrefixCPU)] == alarmPrefixCPU): + fsId = alarmName[len(alarmPrefixCPU):] + if(not contains_fs(fsId, fss) and onlyFilesystemId == None or + not contains_fs(fsId, fss) and onlyFilesystemId != None and onlyFilesystemId == fsId): + print("Deleting alarm: " + alarmName + " in region " + region) + delete_alarm(cw, alarmName) + # + # Scan for filesystems without SSD Utilization Alarm. + for fs in fss: + if(fs['FileSystemType'] == "ONTAP"): + threshold = int(getSSDAlarmThresholdTagValue(fs['Tags'])) + if(threshold != 100): + fsId = fs['FileSystemId'] + fsName = fsId.replace('fs-', 'FsxId') + alarmName = alarmPrefixSSD + fsId + alarmDescription = f"SSD utilization alarm for file system {fsName}{customerId} in region {region}." + + if(not contains_alarm(alarmName, alarms) and onlyFilesystemId == None or + not contains_alarm(alarmName, alarms) and onlyFilesystemId != None and onlyFilesystemId == fsId): + print(f'Adding SSD Alarm for {fsId}') + add_ssd_alarm(cw, fs['FileSystemId'], alarmName, alarmDescription, threshold, region) + # + # Scan for SSD alarms without a FSxN filesystem. + for alarm in alarms: + alarmName = alarm['AlarmName'] + if(alarmName[:len(alarmPrefixSSD)] == alarmPrefixSSD): + fsId = alarmName[len(alarmPrefixSSD):] + if(not contains_fs(fsId, fss) and onlyFilesystemId == None or + not contains_fs(fsId, fss) and onlyFilesystemId != None and onlyFilesystemId == fsId): + print("Deleteing alarm: " + alarmName + " in region " + region) + delete_alarm(cw, alarmName) + # + # Scan for volumes without alarms. + for volume in volumes: + if(volume['VolumeType'] == "ONTAP"): + volumeId = volume['VolumeId'] + volumeName = volume['Name'] + volumeARN = volume['ResourceARN'] + fsId = volume['FileSystemId'] + + threshold = int(getAlarmThresholdTagValue(fsx, volumeARN)) + + if(threshold != 100): # No alarm if the value is set to 100. + alarmName = alarmPrefixVolume + volumeId + fsName = fsId.replace('fs-', 'FsxId') + alarmDescription = f"Volume utilization alarm for volumeId {volumeId}{customerId}, File System Name: {fsName}, Volume Name: {volumeName} in region {region}." + if(not contains_alarm(alarmName, alarms) and onlyFilesystemId == None or + not contains_alarm(alarmName, alarms) and onlyFilesystemId != None and onlyFilesystemId == fsId): + print(f'Adding volume utilization alarm for {volumeName} in region {region}.') + add_volume_alarm(cw, volumeId, alarmName, alarmDescription, fsId, threshold, region) + # + # Scan for volume alarms without volumes. + for alarm in alarms: + alarmName = alarm['AlarmName'] + if(alarmName[:len(alarmPrefixVolume)] == alarmPrefixVolume): + volumeId = alarmName[len(alarmPrefixVolume):] + if(not contains_volume(volumeId, volumes) and onlyFilesystemId == None or + not contains_volume(volumeId, volumes) and onlyFilesystemId != None and onlyFilesystemId == getFileSystemId(alarm)): + print("Deleteing alarm: " + alarmName + " in region " + region) + delete_alarm(cw, alarmName) + + return + + ################################################################################ + # This function is used to print out the usage of the script. + ################################################################################ + def usage(): + print('Usage: auto_add_cw_alarms [-h|--help] [-d|--dryRun] [[-c|--customerID customerID] [[-a|--accountID aws_account_id] [[-s|--SNSTopic SNS_Topic_Name] [[-r|--region region] [[-C|--CPUThreshold threshold] [[-S|--SSDThreshold threshold] [[-V|--VolumeThreshold threshold] [-F|--FileSystemID FileSystemID]') + + ################################################################################ + # Main logic starts here. + ################################################################################ + # + # Set some default values. + regions = [] + dryRun = False + # + # Check to see if there any any environment variables set. + customerId = os.environ.get('customerId', '') + accountId = os.environ.get('accountId', '') + SNStopic = os.environ.get('SNStopic', '') + onlyFilesystemId = None + defaultCPUThreshold = int(os.environ.get('defaultCPUThreshold', defaultCPUThreshold)) + defaultSSDThreshold = int(os.environ.get('defaultSSDThreshold', defaultSSDThreshold)) + defaultVolumeThreshold = int(os.environ.get('defaultVolumeThreshold', defaultVolumeThreshold)) + # + # Check to see if we are bring run from a command line or a Lmabda function. + if os.environ.get('AWS_LAMBDA_FUNCTION_NAME') == None: + argumentList = sys.argv[1:] + options = "hc:a:s:dr:C:S:V:F:" + + longOptions = ["help", "customerID=", "accountID=", "SNSTopic=", "dryRun", "region=", "CPUThreshold=", "SSDThreshold=", "VolumeThreshold=", "FileSystemID="] + skip = False + try: + arguments, values = getopt.getopt(argumentList, options, longOptions) + + for currentArgument, currentValue in arguments: + if currentArgument in ("-h", "--help"): + usage() + skip = True + elif currentArgument in ("-c", "--customerID"): + customerId = currentValue + elif currentArgument in ("-a", "--accountID"): + accountId = currentValue + elif currentArgument in ("-s", "--SNSTopic"): + SNStopic = currentValue + elif currentArgument in ("-C", "--CPUThreshold"): + defaultCPUThreshold = int(currentValue) + elif currentArgument in ("-S", "--SSDThreshold"): + defaultSSDThreshold = int(currentValue) + elif currentArgument in ("-V", "--VolumeThreshold"): + defaultVolumeThreshold = int(currentValue) + elif currentArgument in ("-d", "--dryRun"): + dryRun = True + elif currentArgument in ("-r", "--region"): + regions += [currentValue] + elif currentArgument in ("-F", "--FileSystemID"): + onlyFilesystemId = currentValue + + except getopt.error as err: + print(str(err)) + usage() + skip = True + + if not skip: + lambda_handler(None, None) diff --git a/Monitoring/auto-add-cw-alarms/update-auto-add-cw-alarms-CF-Template b/Monitoring/auto-add-cw-alarms/update-auto-add-cw-alarms-CF-Template new file mode 100755 index 0000000..8c838cd --- /dev/null +++ b/Monitoring/auto-add-cw-alarms/update-auto-add-cw-alarms-CF-Template @@ -0,0 +1,46 @@ +#!/bin/bash +# +# This script is used to update the CloudFormation template with the latest +# version of the Lambda function. It will also update the version number in +# the template. +################################################################################# + +majorVersionNum=2 +file="auto_add_cw_alarms.py" + +tmpfile1=$(mktemp /tmp/tmpfile1.XXXXXX) +tmpfile2=$(mktemp /tmp/tmpfile2.XXXXXX) +trap "rm -f $tmpfile1 $tmpfile2" exit +# +# First get the monitoring code out of the CF template. +sed -e '1,/ZipFile/d' cloudformation.yaml > cloudformation.yaml.tmp +# +# Now get the Date and Version lines out of both files. +egrep -v '^ # Date:|^ # Version' cloudformation.yaml.tmp > $tmpfile1 +egrep -v '^# Date:|^# Version:' $file > $tmpfile2 + +if diff -w $tmpfile1 $tmpfile2 > /dev/null; then + echo "No changes to the monitor code." + rm -f cloudformation.yaml.tmp + rm -f $tmpfile1 $tmpfile2 + exit 0 +fi +# +# Get the number of commits in the git history for the file to calculate the minor version number. +minorVersionNum="$(git log "$file" | egrep '^commit' | wc -l)" +if [ -z "$minorVersionNum" ]; then + echo "Failed to calculate version number." 1>&2 + exit 1 +fi + +version="v${majorVersionNum}.${minorVersionNum}" +# +# Strip out the monitoring code. +sed -e '/ZipFile/,$d' cloudformation.yaml > cloudformation.yaml.tmp +echo " ZipFile: |" >> cloudformation.yaml.tmp +# +# Add the monitoring code to the CF template while updating the version and date. +cat "$file" | sed -e 's/^/ /' -e "s/%%VERSION%%/${version}/" -e "s/%%DATE%%/$(date +%Y-%m-%d-%H:%M:%S)/" >> cloudformation.yaml.tmp + +echo "Updating cloudformation.yaml" +mv cloudformation.yaml.tmp cloudformation.yaml diff --git a/Monitoring/monitor-ontap-services/updateMonOntapServiceCFTemplate b/Monitoring/monitor-ontap-services/updateMonOntapServiceCFTemplate index 3da3bbb..219a002 100755 --- a/Monitoring/monitor-ontap-services/updateMonOntapServiceCFTemplate +++ b/Monitoring/monitor-ontap-services/updateMonOntapServiceCFTemplate @@ -2,7 +2,7 @@ # # This script is used to update the CloudFormation template with the latest # version of the Lambda function. It will also update the version number in -# the template as well as create a git tag for the version. +# the template. ################################################################################# majorVersionNum=2 From aff0fcf146d8ebb5356dcb4f6fc7d7138dd931ce Mon Sep 17 00:00:00 2001 From: Keith Cantrell Date: Mon, 30 Sep 2024 15:36:30 -0500 Subject: [PATCH 2/6] Added a CloudFormation template to auto-add-cw-alarms. --- Monitoring/auto-add-cw-alarms/README.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/Monitoring/auto-add-cw-alarms/README.md b/Monitoring/auto-add-cw-alarms/README.md index ec3e7f9..8e9de95 100644 --- a/Monitoring/auto-add-cw-alarms/README.md +++ b/Monitoring/auto-add-cw-alarms/README.md @@ -17,7 +17,7 @@ of relying on those events, this script will scan all the file systems and volum There are two ways you can invoke this script (Python program). Either from a computer that has Python installed, or you could install it as a Lambda function. If you want to run it as a Lambda function, a CloudFormation template is included in the repo that will: - Create a role that will allow the Lambda function to: - - List AWS regions. So it can scan all regions for FSx for ONTAP file systems and volumes. + - List AWS regions. This is so it can scan all regions for FSx for ONTAP file systems and volumes. - List the FSx for ONTAP file systems. - List the FSx volume. - List the CloudWatch alarms. @@ -92,7 +92,8 @@ messages showing what it would have done, and not really create or delete any Cl ### Running as a Lambda function A CloudFormation template is included in the repo that will do the steps below. Otherwise, here are the steps required to install the program as a Lambda function. -Create a Lambda function and upload the program as the function code. Set the set the timeout to at least five minutes since some of the API calls + +Create a Lambda function and upload the program as the function code. Set the timeout to at least five minutes since some of the API calls can take a significant amount of "clock time" to run, especially in distant regions. Once you have installed the Lambda function it is recommended to set up a scheduled type EventBridge rule so the function will run on a regular basis. @@ -156,7 +157,7 @@ The following permissions are required to run the script (although you could nar Once the script has been configured and invoked, it will: * Scan for every FSx for ONTAP file systems in every region. For every file system that it finds it will: * Create a CPU utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm. - * Create a SSD utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm. + * Create an SSD utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm. * Scan for every FSx for ONTAP volume in every region. For every volume it finds it will: * Create a Volume Utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm. * Scan for the CloudWatch alarms and remove any alarms that the associated resource doesn't exist anymore. From e46565e2b8b239224e7966589ce2348910db2348 Mon Sep 17 00:00:00 2001 From: Keith Cantrell Date: Tue, 1 Oct 2024 15:00:23 -0500 Subject: [PATCH 3/6] Allow users to set the alarm prefix string from the CloudFormation template. --- .../auto-add-cw-alarms/auto_add_cw_alarms.py | 10 ++++---- .../auto-add-cw-alarms/cloudformation.yaml | 23 +++++++++++++------ 2 files changed, 22 insertions(+), 11 deletions(-) diff --git a/Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py b/Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py index 7accc7c..077de19 100755 --- a/Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py +++ b/Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py @@ -72,9 +72,12 @@ # # The following is put in front of all alarms so an IAM policy can be create # that will allow this script to only be able to delete the alarms it creates. -# If you change this, you must also change the IAM policy. Note that the -# Cloudfomration template also assume the value of this variable. -basePrefix="FSx-ONTAP-Auto" +# If you change this, you must also change the IAM policy. It can be +# set via an environment variable, this is so that the CloudFormation template +# can pass the value to the Lambda function. To change the value, change +# the "FSx-ONTAP-Auto" string to your desired value. +import os +basePrefix = os.environ.get('basePrefix', "FSx-ONTAP-Auto") # # Define the prefix for the volume utilization alarm name for the CloudWatch alarms. alarmPrefixVolume=f"{basePrefix}-Volume_Utilization_for_volume_" @@ -92,7 +95,6 @@ import botocore from botocore.config import Config import boto3 -import os import getopt import sys import time diff --git a/Monitoring/auto-add-cw-alarms/cloudformation.yaml b/Monitoring/auto-add-cw-alarms/cloudformation.yaml index 1d9e307..64a1ff2 100644 --- a/Monitoring/auto-add-cw-alarms/cloudformation.yaml +++ b/Monitoring/auto-add-cw-alarms/cloudformation.yaml @@ -14,6 +14,7 @@ Metadata: - defaultSSDThreshold - defaultVolumeThreshold - checkInterval + - alarmPrefixString Parameters: SNStopic: @@ -56,6 +57,11 @@ Parameters: MinValue: 5 Default: 15 + alarmPrefixString: + Description: "This is the string that will be prepended to all CloudWatch alarms created by this script." + Type: String + Default: "FSx-ONTAP-Auto" + Resources: LambdaRole: Type: "AWS::IAM::Role" @@ -91,7 +97,7 @@ Resources: - Effect: "Allow" Action: - "cloudwatch:DeleteAlarms" - Resource: !Sub "arn:aws:cloudwatch:*:${AWS::AccountId}:alarm:FSx-ONTAP-Auto*" + Resource: !Sub "arn:aws:cloudwatch:*:${AWS::AccountId}:alarm:${alarmPrefixString}*" SchedulerRole: Type: "AWS::IAM::Role" @@ -144,6 +150,7 @@ Resources: defaultCPUThreshold: !Ref defaultCPUThreshold defaultSSDThreshold: !Ref defaultSSDThreshold defaultVolumeThreshold: !Ref defaultVolumeThreshold + basePrefix: !Ref alarmPrefixString Code: ZipFile: | @@ -176,8 +183,8 @@ Resources: # Lastly, you can create an override for the SSD alarm, by creating a tag # with the name "SSD_Alarm_Threshold" on the file system resource. # - # Version: v2.11 - # Date: 2024-09-27-16:03:19 + # Version: v2.12 + # Date: 2024-10-01-14:58:15 # ################################################################################ # @@ -221,9 +228,12 @@ Resources: # # The following is put in front of all alarms so an IAM policy can be create # that will allow this script to only be able to delete the alarms it creates. - # If you change this, you must also change the IAM policy. Note that the - # Cloudfomration template also assume the value of this variable. - basePrefix="FSx-ONTAP-Auto" + # If you change this, you must also change the IAM policy. It can be + # set via an environment variable, this is so that the CloudFormation template + # can pass the value to the Lambda function. To change the value, change + # the "FSx-ONTAP-Auto" string to your desired value. + import os + basePrefix = os.environ.get('basePrefix', "FSx-ONTAP-Auto") # # Define the prefix for the volume utilization alarm name for the CloudWatch alarms. alarmPrefixVolume=f"{basePrefix}-Volume_Utilization_for_volume_" @@ -241,7 +251,6 @@ Resources: import botocore from botocore.config import Config import boto3 - import os import getopt import sys import time From a82d59cbcc0fba741502336862b8f178be99c94f Mon Sep 17 00:00:00 2001 From: Keith Cantrell Date: Thu, 3 Oct 2024 12:31:18 -0500 Subject: [PATCH 4/6] Added the ability to limit the scanning to specified regions. --- Monitoring/auto-add-cw-alarms/README.md | 125 ++++++++++++++---- .../auto-add-cw-alarms/auto_add_cw_alarms.py | 3 + .../auto-add-cw-alarms/cloudformation.yaml | 14 +- 3 files changed, 111 insertions(+), 31 deletions(-) diff --git a/Monitoring/auto-add-cw-alarms/README.md b/Monitoring/auto-add-cw-alarms/README.md index 8e9de95..321130f 100644 --- a/Monitoring/auto-add-cw-alarms/README.md +++ b/Monitoring/auto-add-cw-alarms/README.md @@ -8,14 +8,16 @@ delete alarms. This can be tedious, and error prone. This script will automate t AWS CloudWatch alarms that monitor the utilization of the file system and its volumes. It will also create alarms to monitor the CPU utilization of the file system. And if a volume or file system is removed, it will remove the associated alarms. -To implement this, you might think to just create EventTail filters to trigger on the creation or deletion of an FSx Volume. +To implement this, you might think to just create EventBridge rules to trigger on the creation or deletion of an FSx Volume. This would kind of work, but since you have command line access to the FSx for ONTAP file system, you can create and delete volumes without generating any CloudTrail events. So, this method would not be reliable. Therefore, instead of relying on those events, this script will scan all the file systems and volumes in all the regions then create and delete alarms as needed. ## Invocation -There are two ways you can invoke this script (Python program). Either from a computer that has Python installed, or you could install it -as a Lambda function. If you want to run it as a Lambda function, a CloudFormation template is included in the repo that will: +The preferred way to run this script is as a Lambda function. That is because it is very inexpensive to run without having +to maintain compute resources. You can use an `EventBridge Schedule` to run it on a regular basis to +ensure that all the CloudWatch alarms are kept up to date. Since there are several steps involved in setting up a Lambda function +a CloudFormation script is included in the repo, named `cloudlformation.yaml`, that will do the following steps for you: - Create a role that will allow the Lambda function to: - List AWS regions. This is so it can scan all regions for FSx for ONTAP file systems and volumes. - List the FSx for ONTAP file systems. @@ -28,8 +30,32 @@ as a Lambda function. If you want to run it as a Lambda function, a CloudFormati - Create a EventBridge schedule that will run the Lambda function on a user defined basis. - Create a role that will allow the EventBridge schedule to trigger the Lambda function. +To use the CloudFormation template perform the following steps: + +1. Download the `cloudformation.yaml` file from this repo. +2. Go to the `CloudFormation` services page in the AWS console and select `Create Stack -> With new resources (standard)`. +3. Select `Choose an existing template` and `Upload a template file`. +4. Click `Choose file` and select the `cloudformation.yaml` file you downloaded in step 1. +5. Click `Next` and fill in the parameters presented on the next page. The parameters are: + - `Stack name` - The name of the CloudFormation stack. Note this name is also used as a base name for some of the resources that are created, to make them unique, so you must keep this string under 25 characters so the resource names don't exceed their name length limit. + - `SNStopic` - The SNS Topic name where CloudWatch will send alerts to. Note that it is assumed that the SNS topic, with the same name, will exist in all the regions where alarms are to be created. This CloudFormation template, nor the Lambda function, will not create these SNS topics for you. + - `accountId` - The AWS account ID associated with the SNStopic. This is only used to compute the ARN to the SNS Topic set above. + - `customerId` - This is optional. If provided the string entered is included in the description of every alarm created. + - `defaultCPUThreshold` - This will define a default CPU utilization threshold. You can override the default by having a specific tag associated with the file system (see below). + - `defaultSSDThreshold` - This will define a default SSD (aggregate) utilization threshold. You can override the default by having a specific tag associated with the file system (see below). + - `defaultVolumeThreshold` - This will define the default Volume utilization threshold. You can override the default by having a specific tag associated with the volume (see below). + - `checkInterval` - This is the interval in minutes that the program will run. + - `alarmPrefixString` - This defines the string that will be prepended to every CloudWatch alarm name that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm. + - `regions` - This is a comma separated list of AWS region names (e.g. us-east-1) that the program will act on. If not specified, the program will scan on all regions that support an FSx for ONTAP file system. Note that no checking is preformed to ensure that the regions you provide are valid. +6. Click `Next`. There aren't any recommended changes to make to any of the proceeding pages, so just click `Next` again. +7. On the final page, check the box that says `I acknowledge that AWS CloudFormation might create IAM resources with custom names.` and click `Submit`. + +If you prefer, you can run this Python program on any UNIX based computer that has Python installed. See the "Running on a computer" section below for more information. + ### Configuring the program -Before you can run the program you will need to configure it. You can configure it a few ways: +If you use the CloudFormation template to deploy the program, it will create the appropriate environment variables for you. +However, if you didn't use the CloudFormation template, you will need to configure the program yourself. Here are the +various ways you can do so: * By editing the top part of the program itself where there are the following variable definitions. * By setting environment variables with the same names as the variables in the program. * If running it as a standalone program, via some command line options. @@ -40,20 +66,20 @@ Here is the list of variables, and what they define: | Variable | Description |Command Line Option| |:---------|:------------|:--------------------------------| -|SNStopic | The SNS Topic name where CloudWatch will send alerts to. Note that it is assumed that the SNS topic, with the same name, will exist in all the regions where alarms are to be created.|-s SNS_Topic_Name| -|accountId | The AWS account ID associated with the SNStopic. This is only used to compute the ARN to the SNS Topic.|-a Account_number| -|customerId| This is really just a comment that will be added to the alarm description.|-c Customer_String| +|SNStopic | The SNS Topic name where CloudWatch will send alerts to. Note that it is assumed that the SNS topic, with the same name, will exist in all the regions where alarms are to be created.|-s SNS\_Topic\_Name| +|accountId | The AWS account ID associated with the SNStopic. This is only used to compute the ARN to the SNS Topic.|-a Account\_number| +|customerId| This is really just a comment that will be added to the alarm description.|-c Customer\_String| |defaultCPUThreshold | This will define the default CPU utilization threshold. You can override the default by having a specific tag associated with the file system. See below for more information.|-C number| |defaultSSDThreshold | This will define the default SSD (aggregate) utilization threshold. You can override the default by having a specific tag associated with the file system. See below for more information.|-S number| |defaultVolumeThreshold | This will define the default Volume utilization threshold. You can override the default by having a specific tag associated with the volume. See below for more information.|-V number| -|alarmPrefixCPU | This defines the string that will be put in front of the name of every CPU utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A| -|alarmPrefixSSD | This defines the string that will be put in front of the name of every SSD utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A| +|alarmPrefixCPU | This defines the string that will be put in front of the name of every CPU utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A| +|alarmPrefixSSD | This defines the string that will be put in front of the name of every SSD utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A| |alarmPrefixVolume | This defines the string that will be put in front of the name of every volume utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A| +|regions | This is a comma separated list of AWS region names (e.g. us-east-1) that the program will act on. If not specified, the program will scan on all regions that support an FSx for ONTAP file system. Note that no checking is preformed to ensure that the regions you provide are valid.|-r region -r region ...| There are a few command line options that don't have a corresponding variables: |Option|Description| |:-----|:----------| -|-r region| This option can be specified multiple times to limit the regions that the program will act on. If not specified, the program will act on all regions.| |-d| This option will cause the program to run in "Dry Run" mode. In this mode, the program will only display messages showing what it would have done, and not really create or delete any CloudWatch alarms.| |-F filesystem\_ID| This option will cause the program to only add or remove alarms that are associated with the filesystem\_ID.| @@ -81,8 +107,9 @@ Once you have Python and boto3 installed, you can run the program by executing t python3 auto_add_cw_alarms.py ``` This will run the program based on all the variables set at the top. If you want to change the behavior without -having to edit the program, you can use the Command Line Option specified in the table above. Note that you can give a `-h` (or `--help`) -option and the program will display a list of all the available options. +having to edit the program, you can either use the Command Line Option specified in the table above or you can +set the appropriate environment variable. Note that you can give a `-h` (or `--help`) command line option +and the program will display a list of all the available options. You can limit the regions that the program will act on by using the `-r region` option. You can specify that option multiple times to act on multiple regions. @@ -91,23 +118,39 @@ You can run the program in "Dry Run" mode by specifying the `-d` (or `--dryRun`) messages showing what it would have done, and not really create or delete any CloudWatch alarms. ### Running as a Lambda function -A CloudFormation template is included in the repo that will do the steps below. Otherwise, here are the steps required to install the program as a Lambda function. - -Create a Lambda function and upload the program as the function code. Set the timeout to at least five minutes since some of the API calls -can take a significant amount of "clock time" to run, especially in distant regions. - -Once you have installed the Lambda function it is recommended to set up a scheduled type EventBridge rule so the function will run on a regular basis. - -The appropriate permissions will need to be assigned to the Lambda function in order for it to run correctly. -It doesn't need many permissions. It just needs to be able to: +A CloudFormation template is included in the repo that will do the steps below. If you don't want to use that, here are +the detailed steps required to install the program as a Lambda function. + +#### Create a Lambda function +1. Download the `auto_add_cw_alarms.py` file from this repo. +2. Create a new Lambda function in the AWS console by going to the Lambda services page and clicking on the `Create function` button. +3. Choose `Author from scratch` and give the function a name. For example `auto_add_cw_alarms`. +4. Choose the latest version of Python (currently Python 3.11) as the runtime and click on `Create function`. +5. In the function code section, copy and paste the contents of the `auto_add_cw_alarms.py` file into the code editor. +6. Click on the `Deploy` button to save the function. +7. Click on the Configuration tag and then the "General configuration" sub tab and set the "Timeout" to be at least 3 minutes. +8. Click on the "Environment variables" tab and add the following environment variables: + - `SNStopic` - The SNS Topic name where CloudWatch will send alerts to. + - `accountId` - The AWS account ID associated with the SNStopic. + - `customerId` - This is optional. If provided the string entered is included in the description of every alarm created. + - `defaultCPUThreshold` - This will define a default CPU utilization threshold. + - `defaultSSDThreshold` - This will define a default SSD (aggregate) utilization threshold. + - `defaultVolumeThreshold` - This will define the default Volume utilization threshold. + - `alarmPrefixString` - This defines the string that will be prepended to every CloudWatch alarm name that the program creates. + - `regions` - This is an optional comma separated list of AWS region names (e.g. us-east-1) that the program will act on. If not specified, the program will scan on all regions that support an FSx for ONTAP file system. + +You will also need to set up the appropriate permissions for the Lambda function to run. It doesn't need many permissions. It just needs to be able to: * List the FSx for ONTAP file systems. * List the FSx volume names. +* List tags associated with an FSx file system or volume. * List the CloudWatch alarms. +* List all the AWS regions. * Create CloudWatch alarms. -* Delete CloudWatch alarms. You can set resource to "arn:aws:cloudwatch:*:${AWS::AccountId}:alarm:FSx-ONTAP-Auto*" to limit the deletion to only the alarms that it created. +* Delete CloudWatch alarms. You can set resource to `arn:aws:cloudwatch:*:`*AccountId*`:alarm:`*alarmPrefixString*`*` to limit the deletion to only the alarms that it creates. * Create CloudWatch Log Groups and Log Streams in case you need to diagnose an issue. -The following permissions are required to run the script (although you could narrow the "Resource" specification to suit your needs.) +The following is an example AWS policy that has all the required permissions to run the script (although you could narrow the "Resource" specification to suit your needs.) +Note it assumes that the alarmPrefixString is set to "FSx-ONTAP-Auto". ```JSON { "Version": "2012-10-17", @@ -116,13 +159,13 @@ The following permissions are required to run the script (although you could nar "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ - "cloudwatch:PutMetricAlarm", - "fsx:ListTagsForResource", - "fsx:DescribeVolumes", "fsx:DescribeFilesystems", + "fsx:DescribeVolumes", + "fsx:ListTagsForResource", + "cloudwatch:DescribeAlarms" "cloudwatch:DescribeAlarmsForMetric", "ec2:DescribeRegions", - "cloudwatch:DescribeAlarms" + "cloudwatch:PutMetricAlarm", ], "Resource": "*" }, @@ -153,15 +196,39 @@ The following permissions are required to run the script (although you could nar } ``` +Once you have deployed the Lambda function it is recommended to set up a scheduled to run it on a regular basis. +The easiest way to do that is: +1. Click on the `Add trigger` button from the Lambda function page. +2. Select `EventBridge (CloudWatch Events)` as the trigger type. +3. Click on the `Create a new rule` button. +4. Give the rule a name and a description. +5. Set the `Schedule expression` to be the interval you want the function to run. For example, if you want it to run every 15 minutes, you would set the expression to `rate(15 minutes)`. +6. Click on the `Add` button + ### Expected Action Once the script has been configured and invoked, it will: -* Scan for every FSx for ONTAP file systems in every region. For every file system that it finds it will: +* Scan for every FSx for ONTAP file systems in every region, unless you have specified a specific list of regions to scan. For every file system that it finds it will: * Create a CPU utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm. * Create an SSD utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm. -* Scan for every FSx for ONTAP volume in every region. For every volume it finds it will: +* Scan for every FSx for ONTAP volume in every region, unless you have specified a specific list of regions to scan. For every volume it finds it will: * Create a Volume Utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm. * Scan for the CloudWatch alarms and remove any alarms that the associated resource doesn't exist anymore. +### Cleaning up +If you decide you don't want to use this program anymore, you can delete the CloudFormation stack that you created. +This will remove the Lambda function, the EventBridge schedule, and the roles that were created for you. If you did +not use the CloudFormation template, you will have to do these steps yourself. + +Once you have removed the program, you can remove all the CloudWatch alarms that were created by the program by running +the following command: + +```bash +region=us-west-2 +aws cloudwatch describe-alarms --region=$region --alarm-name-prefix "FSx-ONTAP-Auto" --query "MetricAlarms[*].AlarmName" --output text | xargs -n 50 aws cloudwatch delete-alarms --region $region --alarm-names +``` +This command will remove all the alarms that have an alarm name that starts with "FSx-ONTAP-Auto" in the us-west-2 region. +Make sure to adjust the alarm-name-prefix to match the AlarmPrefix you set when you deployed the program. +You will also need to adjust the region variable and run the `aws` command again for each region where you have alarms in. ## Author Information diff --git a/Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py b/Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py index 077de19..fb9697f 100755 --- a/Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py +++ b/Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py @@ -563,6 +563,9 @@ def usage(): defaultCPUThreshold = int(os.environ.get('defaultCPUThreshold', defaultCPUThreshold)) defaultSSDThreshold = int(os.environ.get('defaultSSDThreshold', defaultSSDThreshold)) defaultVolumeThreshold = int(os.environ.get('defaultVolumeThreshold', defaultVolumeThreshold)) +regionsEnv = os.environ.get('regions', '') +if regionsEnv != '': + regions = regionsEnv.split(',') # # Check to see if we are bring run from a command line or a Lmabda function. if os.environ.get('AWS_LAMBDA_FUNCTION_NAME') == None: diff --git a/Monitoring/auto-add-cw-alarms/cloudformation.yaml b/Monitoring/auto-add-cw-alarms/cloudformation.yaml index 64a1ff2..686ce57 100644 --- a/Monitoring/auto-add-cw-alarms/cloudformation.yaml +++ b/Monitoring/auto-add-cw-alarms/cloudformation.yaml @@ -15,6 +15,7 @@ Metadata: - defaultVolumeThreshold - checkInterval - alarmPrefixString + - regions Parameters: SNStopic: @@ -62,6 +63,11 @@ Parameters: Type: String Default: "FSx-ONTAP-Auto" + regions: + Description: "This is a list of AWS regions that you want the Lambda function to run in. If left blank, it will run in all regions." + Type: CommaDelimitedList + Default: "" + Resources: LambdaRole: Type: "AWS::IAM::Role" @@ -151,6 +157,7 @@ Resources: defaultSSDThreshold: !Ref defaultSSDThreshold defaultVolumeThreshold: !Ref defaultVolumeThreshold basePrefix: !Ref alarmPrefixString + regions: !Join [",", !Ref regions] Code: ZipFile: | @@ -183,8 +190,8 @@ Resources: # Lastly, you can create an override for the SSD alarm, by creating a tag # with the name "SSD_Alarm_Threshold" on the file system resource. # - # Version: v2.12 - # Date: 2024-10-01-14:58:15 + # Version: v2.13 + # Date: 2024-10-03-12:30:21 # ################################################################################ # @@ -719,6 +726,9 @@ Resources: defaultCPUThreshold = int(os.environ.get('defaultCPUThreshold', defaultCPUThreshold)) defaultSSDThreshold = int(os.environ.get('defaultSSDThreshold', defaultSSDThreshold)) defaultVolumeThreshold = int(os.environ.get('defaultVolumeThreshold', defaultVolumeThreshold)) + regionsEnv = os.environ.get('regions', '') + if regionsEnv != '': + regions = regionsEnv.split(',') # # Check to see if we are bring run from a command line or a Lmabda function. if os.environ.get('AWS_LAMBDA_FUNCTION_NAME') == None: From 977c9e8edc8a18629a80cf3f5ec7a1a7f2daf487 Mon Sep 17 00:00:00 2001 From: Keith Cantrell Date: Thu, 3 Oct 2024 15:10:22 -0500 Subject: [PATCH 5/6] Made minor adjustments to the descriptions. --- Monitoring/auto-add-cw-alarms/README.md | 26 ++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/Monitoring/auto-add-cw-alarms/README.md b/Monitoring/auto-add-cw-alarms/README.md index 321130f..0b2c30e 100644 --- a/Monitoring/auto-add-cw-alarms/README.md +++ b/Monitoring/auto-add-cw-alarms/README.md @@ -8,9 +8,9 @@ delete alarms. This can be tedious, and error prone. This script will automate t AWS CloudWatch alarms that monitor the utilization of the file system and its volumes. It will also create alarms to monitor the CPU utilization of the file system. And if a volume or file system is removed, it will remove the associated alarms. -To implement this, you might think to just create EventBridge rules to trigger on the creation or deletion of an FSx Volume. +To implement this, you might think to just create EventBridge rules that trigger on the creation or deletion of an FSx Volume. This would kind of work, but since you have command line access to the FSx for ONTAP file system, you can create -and delete volumes without generating any CloudTrail events. So, this method would not be reliable. Therefore, instead +and delete volumes without generating any CloudTrail events. So, depending on CloudTrail events would not be reliable. Therefore, instead of relying on those events, this script will scan all the file systems and volumes in all the regions then create and delete alarms as needed. ## Invocation @@ -38,15 +38,15 @@ To use the CloudFormation template perform the following steps: 4. Click `Choose file` and select the `cloudformation.yaml` file you downloaded in step 1. 5. Click `Next` and fill in the parameters presented on the next page. The parameters are: - `Stack name` - The name of the CloudFormation stack. Note this name is also used as a base name for some of the resources that are created, to make them unique, so you must keep this string under 25 characters so the resource names don't exceed their name length limit. - - `SNStopic` - The SNS Topic name where CloudWatch will send alerts to. Note that it is assumed that the SNS topic, with the same name, will exist in all the regions where alarms are to be created. This CloudFormation template, nor the Lambda function, will not create these SNS topics for you. - - `accountId` - The AWS account ID associated with the SNStopic. This is only used to compute the ARN to the SNS Topic set above. + - `SNStopic` - The SNS Topic name where CloudWatch will send alerts to. Note that since CloudWatch can't send messages to an SNS topic residing in a different region, it is assumed that the SNS topic, with the same name, will exist in all the regions where alarms are to be created. + - `accountId` - The AWS account ID associated with the SNS topic. This is only used to compute the ARN to the SNS Topic set above. - `customerId` - This is optional. If provided the string entered is included in the description of every alarm created. - - `defaultCPUThreshold` - This will define a default CPU utilization threshold. You can override the default by having a specific tag associated with the file system (see below). - - `defaultSSDThreshold` - This will define a default SSD (aggregate) utilization threshold. You can override the default by having a specific tag associated with the file system (see below). - - `defaultVolumeThreshold` - This will define the default Volume utilization threshold. You can override the default by having a specific tag associated with the volume (see below). + - `defaultCPUThreshold` - This will define a default CPU utilization threshold. You can override the default by having a specific tag associated with the file system (see below for more information). + - `defaultSSDThreshold` - This will define a default SSD (aggregate) utilization threshold. You can override the default by having a specific tag associated with the file system (see below for more information). + - `defaultVolumeThreshold` - This will define the default Volume utilization threshold. You can override the default by having a specific tag associated with the volume (see below for more information). - `checkInterval` - This is the interval in minutes that the program will run. - `alarmPrefixString` - This defines the string that will be prepended to every CloudWatch alarm name that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm. - - `regions` - This is a comma separated list of AWS region names (e.g. us-east-1) that the program will act on. If not specified, the program will scan on all regions that support an FSx for ONTAP file system. Note that no checking is preformed to ensure that the regions you provide are valid. + - `regions` - This is a comma separated list of AWS region names (e.g. us-east-1) that the program will act on. If not specified, the program will scan on all regions that support an FSx for ONTAP file system. Note that no checking is performed to ensure that the regions you provide are valid. 6. Click `Next`. There aren't any recommended changes to make to any of the proceeding pages, so just click `Next` again. 7. On the final page, check the box that says `I acknowledge that AWS CloudFormation might create IAM resources with custom names.` and click `Submit`. @@ -67,17 +67,17 @@ Here is the list of variables, and what they define: | Variable | Description |Command Line Option| |:---------|:------------|:--------------------------------| |SNStopic | The SNS Topic name where CloudWatch will send alerts to. Note that it is assumed that the SNS topic, with the same name, will exist in all the regions where alarms are to be created.|-s SNS\_Topic\_Name| -|accountId | The AWS account ID associated with the SNStopic. This is only used to compute the ARN to the SNS Topic.|-a Account\_number| -|customerId| This is really just a comment that will be added to the alarm description.|-c Customer\_String| +|accountId | The AWS account ID associated with the SNS topic. This is only used to compute the ARN to the SNS Topic.|-a Account\_number| +|customerId| This is just an optional string that will be added to the alarm description.|-c Customer\_String| |defaultCPUThreshold | This will define the default CPU utilization threshold. You can override the default by having a specific tag associated with the file system. See below for more information.|-C number| |defaultSSDThreshold | This will define the default SSD (aggregate) utilization threshold. You can override the default by having a specific tag associated with the file system. See below for more information.|-S number| |defaultVolumeThreshold | This will define the default Volume utilization threshold. You can override the default by having a specific tag associated with the volume. See below for more information.|-V number| |alarmPrefixCPU | This defines the string that will be put in front of the name of every CPU utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A| |alarmPrefixSSD | This defines the string that will be put in front of the name of every SSD utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A| |alarmPrefixVolume | This defines the string that will be put in front of the name of every volume utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A| -|regions | This is a comma separated list of AWS region names (e.g. us-east-1) that the program will act on. If not specified, the program will scan on all regions that support an FSx for ONTAP file system. Note that no checking is preformed to ensure that the regions you provide are valid.|-r region -r region ...| +|regions | This is a comma separated list of AWS region names (e.g. us-east-1) that the program will act on. If not specified, the program will scan on all regions that support an FSx for ONTAP file system. Note that no checking is performed to ensure that the regions you provide are valid.|-r region -r region ...| -There are a few command line options that don't have a corresponding variables: +There are a few command line options that don't have a corresponding variable: |Option|Description| |:-----|:----------| |-d| This option will cause the program to run in "Dry Run" mode. In this mode, the program will only display messages showing what it would have done, and not really create or delete any CloudWatch alarms.| @@ -131,7 +131,7 @@ the detailed steps required to install the program as a Lambda function. 7. Click on the Configuration tag and then the "General configuration" sub tab and set the "Timeout" to be at least 3 minutes. 8. Click on the "Environment variables" tab and add the following environment variables: - `SNStopic` - The SNS Topic name where CloudWatch will send alerts to. - - `accountId` - The AWS account ID associated with the SNStopic. + - `accountId` - The AWS account ID associated with the SNS topic. - `customerId` - This is optional. If provided the string entered is included in the description of every alarm created. - `defaultCPUThreshold` - This will define a default CPU utilization threshold. - `defaultSSDThreshold` - This will define a default SSD (aggregate) utilization threshold. From 39e4af0b45485f2742915b2a5fdb8653a8a4c7de Mon Sep 17 00:00:00 2001 From: Keith Cantrell Date: Thu, 10 Oct 2024 14:21:08 -0500 Subject: [PATCH 6/6] Made minor adjustments to the README.md file. --- Monitoring/auto-add-cw-alarms/README.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/Monitoring/auto-add-cw-alarms/README.md b/Monitoring/auto-add-cw-alarms/README.md index 0b2c30e..fd22a2c 100644 --- a/Monitoring/auto-add-cw-alarms/README.md +++ b/Monitoring/auto-add-cw-alarms/README.md @@ -15,19 +15,19 @@ of relying on those events, this script will scan all the file systems and volum ## Invocation The preferred way to run this script is as a Lambda function. That is because it is very inexpensive to run without having -to maintain compute resources. You can use an `EventBridge Schedule` to run it on a regular basis to -ensure that all the CloudWatch alarms are kept up to date. Since there are several steps involved in setting up a Lambda function -a CloudFormation script is included in the repo, named `cloudlformation.yaml`, that will do the following steps for you: +to maintain any compute resources. You can use an `EventBridge Schedule` to run it on a regular basis to +ensure that all the CloudWatch alarms are kept up to date. Since there are several steps involved in setting up a Lambda function, +a CloudFormation template is included in the repo, named `cloudlformation.yaml`, that will do the following steps for you: - Create a role that will allow the Lambda function to: - - List AWS regions. This is so it can scan all regions for FSx for ONTAP file systems and volumes. + - List AWS regions. This is so it can get a list of all the regions, so it can know which regions to scan for FSx for ONTAP file systems and volumes. - List the FSx for ONTAP file systems. - - List the FSx volume. + - List the FSx volumes. - List the CloudWatch alarms. - - List tags for the resources. This is so you can customize the thresholds for the alarms. + - List tags for the resources. This is so you can customize the thresholds for the alarms on a per instance basis. More on that below. - Create CloudWatch alarms. - Delete CloudWatch alarms that it has created (based on alarm names). - Create a Lambda function with the Python program. -- Create a EventBridge schedule that will run the Lambda function on a user defined basis. +- Create an EventBridge schedule that will run the Lambda function on a user defined basis. - Create a role that will allow the EventBridge schedule to trigger the Lambda function. To use the CloudFormation template perform the following steps: @@ -37,7 +37,7 @@ To use the CloudFormation template perform the following steps: 3. Select `Choose an existing template` and `Upload a template file`. 4. Click `Choose file` and select the `cloudformation.yaml` file you downloaded in step 1. 5. Click `Next` and fill in the parameters presented on the next page. The parameters are: - - `Stack name` - The name of the CloudFormation stack. Note this name is also used as a base name for some of the resources that are created, to make them unique, so you must keep this string under 25 characters so the resource names don't exceed their name length limit. + - `Stack name` - The name of the CloudFormation stack. Note this name is also used as a base name for some of the resources that are created, to make them unique, so you must keep this string under 25 characters, so the resource names don't exceed their name length limit. - `SNStopic` - The SNS Topic name where CloudWatch will send alerts to. Note that since CloudWatch can't send messages to an SNS topic residing in a different region, it is assumed that the SNS topic, with the same name, will exist in all the regions where alarms are to be created. - `accountId` - The AWS account ID associated with the SNS topic. This is only used to compute the ARN to the SNS Topic set above. - `customerId` - This is optional. If provided the string entered is included in the description of every alarm created. @@ -48,7 +48,7 @@ To use the CloudFormation template perform the following steps: - `alarmPrefixString` - This defines the string that will be prepended to every CloudWatch alarm name that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm. - `regions` - This is a comma separated list of AWS region names (e.g. us-east-1) that the program will act on. If not specified, the program will scan on all regions that support an FSx for ONTAP file system. Note that no checking is performed to ensure that the regions you provide are valid. 6. Click `Next`. There aren't any recommended changes to make to any of the proceeding pages, so just click `Next` again. -7. On the final page, check the box that says `I acknowledge that AWS CloudFormation might create IAM resources with custom names.` and click `Submit`. +7. On the final page, check the box that says `I acknowledge that AWS CloudFormation might create IAM resources with custom names.` and then click `Submit`. If you prefer, you can run this Python program on any UNIX based computer that has Python installed. See the "Running on a computer" section below for more information. @@ -196,7 +196,7 @@ Note it assumes that the alarmPrefixString is set to "FSx-ONTAP-Auto". } ``` -Once you have deployed the Lambda function it is recommended to set up a scheduled to run it on a regular basis. +Once you have deployed the Lambda function it is recommended to set up a schedule to run it on a regular basis. The easiest way to do that is: 1. Click on the `Add trigger` button from the Lambda function page. 2. Select `EventBridge (CloudWatch Events)` as the trigger type.