Skip to content

Latest commit

 

History

History
182 lines (110 loc) · 10.5 KB

README.md

File metadata and controls

182 lines (110 loc) · 10.5 KB

Build an enterprise grade MLOps platfrom on AWS using Github and Terraform

Introduction

As enterprise businesses embrace Machine Learning (ML) across their organisations, manual workflows for building, training, and deploying ML models tend to become bottlenecks to innovation. To overcome this, enterprises need to shape a clear operating model defining how multiple personas, such as Data Scientists, Data Engineers, ML Engineers, IT, and Business stakeholders, should collaborate and interact, how to separate the concerns, responsibilities and skills, and how to leverage AWS services optimally. This combination of ML and Operations, so-called MLOps, is helping companies streamline their end-to-end ML lifecycle and boost productivity of data scientists while maintaining high model accuracy and enhancing security and compliance.

High level architecture

In this repository, we show how to use Terraform with GitHub and GitHub Actions to build a baseline infrastructure for secure MLOps. The solution can be broken down into three parts:

Base Infrastructure

The necessary infrastructure components for your accounts including SageMaker Studio, Networking, Permissions and SSM Parameters.

drawing

Shared Template Repositories

GitHub template repositories that are cloned when a custom SageMaker Project is deployed by a Data Scientist or ML Engineer.

drawing

User Experience

This is how the end-users (Data Scientists or ML Engineers) use SageMaker projects.

Typically, when a SageMaker project is deployed:

  • GitHub private repos are created from templates that Data Scientists need to customize as per their use-case.
  • These variables show best practices such as testing, approvals, and dashboards. They can be fully customized once deployed.
  • Depending on the chosen SageMaker project, other project specific resources might also be created such as a dedicated S3 bucket for the project and automation to trigger ML deployment from model registry.

An architecture for the Building, training, and deployment project is shown below.

drawing

Currently, four example project template are available.

  1. MLOps Template for Model Building, Training, and Deployment: ML Ops pattern to train models using SageMaker pipelines and to deploy the trained model into preproduction and production accounts. This template supports Real-time inference, Batch Inference Pipeline, and BYOC containers.

  2. MLOps Template for promoting the full ML pipeline across environments: ML Ops pattern to shows how to take the same SageMaker pipeline across environments from dev to prod.

  3. MLOps Template for Model Building and Training: MLOps pattern that shows a simple one-account SageMaker Pipeline setup.

  4. MLOps Template for LLM Model Building, Training and Evaluation: MLOps pattern that shows a simple one-account SageMaker Pipeline setup for LLM models.

Based on the selected project and its setting, SageMaker projects clones GitHub repos using templates. It also sets the secrets, environment variables, and deployment environments.

drawing

Prerequisites

The instructions here assume the following prerequisites. Make a note of these details to use in following sections.

  1. AWS Account(s) with sufficient permissions to deploy base infrastructure. We recommended using at least three AWS accounts for a Dev, Preprod, and Prod environment for one business-unit. However, you can deploy the infrastructure using one account for testing purposes.
  2. A GitHub Organization.
  3. Personal Access Token (PAT) for GitHub organization. It is recommended to create a service/platform account and use it's PAT.

How to use:

Bootstrap you AWS Accounts

This section explains the steps required to bootstrap your accounts for GitHub and Terraform.

NOTE: You can skip directly to CloudFormation template section to avoid manual bootstrapping.
NOTE: You can skip directly to Bash Script section to avoid manual bootstrapping.

GitHub Actions using OpenId Connect

To avoid using long-term AWS Identity and Access Management (IAM) user access keys, we can configure an OpenID Connect (OIDC) identity provider (idP) inside an AWS account which allows the use of IAM roles and short-term credentials. Follow detailed instructions at Use IAM roles to connect GitHub Actions to actions in AWS or use the CloudFormation template provided below.

Terraform S3 and DynamoDB Backend

Terraform backend supports Amazon S3 and DynamoDB for storing states and locking consistency checking.

Create the following resources in each AWS account or use the CloudFormation template provided below.

  1. S3 Bucket: ${Prefix}-${Environment}-${AWS::Region}-${AWS::AccountId}
  2. DynamoDB Table: ${Prefix}-${Environment}

Where, Prefix: Common name for resources. e.g. mlops Environment: dev or preprod or prod

(Option 1) CloudFormation Template for bootstrapping

A bootstrap.yaml CloudFormation template has been provided. This can be deployed to each AWS account. Later, bootstrapping can be standardised and automated via CloudFormation StackSets for an AWS Organization.

You can get started with one account but we recommend creating at least 3 AWS accounts: a dev, preprod, and prod account.

Deploy the provided bootstrap.yaml CloudFormation template in your account(s) either using the AWS console or using AWS CLI as shown below, from the root of the repo.

  1. Ensure AWS CLI is installed and credentials are loaded for the target account that you want to bootstrap.

  2. Identify the following: a. Environment Type of the account: dev, preprod, or prod b. Name of your GitHub Organization c. (Optional) Customize S3 bucket name for Terraform state files by choosing a prefix. d. (Optional) Customize DynamoDB Table Name for State Locking.

  3. Run the following command updating the details from step 2.

# Update
export ENV=xxx
export GITHUB_ORG=xxx
# Optional
export TerraformStateBucketPrefix=terraform-state
export TerraformStateLockTableName=terraform-state-locks

aws cloudformation create-stack \
  --stack-name YourStackName \
  --template-body file://bootstrap.yaml \
  --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM \
  --parameters ParameterKey=Environment,ParameterValue=$ENV \
               ParameterKey=GitHubOrg,ParameterValue=$GITHUB_ORG \
               ParameterKey=OIDCProviderArn,ParameterValue="" \
               ParameterKey=TerraformStateBucketPrefix,ParameterValue=$TerraformStateBucketPrefix \
               ParameterKey=TerraformStateLockTableName,ParameterValue=$TerraformStateLockTableName

(Option 2) Bash script for bootstrapping

A bootstrap.sh script has been provided. This can be run against each AWS account.

You can get started with one account but we recommend creating at least 3 AWS accounts: a dev, preprod, and prod account.

  1. Ensure AWS CLI is installed and credentials are loaded for the target account that you want to bootstrap.

  2. Identify the following: a. Environment Type of the account: dev, preprod, or prod b. Name of your GitHub Organization c. (Optional) Customize S3 bucket name for Terraform state files by choosing a prefix. d. (Optional) Customize DynamoDB Table Name for State Locking.

  3. Run the script (bash ./bootstrap.sh) and input the details from step 2 when prompted. You can leave most of these options default.

NOTE: if you change the TerraformStateBucketPrefix or TerraformStateLockTableName parameters, you must update the environment variables (S3_PREFIX, DYNAMODB_PREFIX) in the deploy.yml to match.

This one-time deployment creates the following resources in your AWS account:

  • For Terraform Backend:
    • S3 Bucket to store state files.
    • DynamoDB table to store state locking.
  • AWS Identity provider for GitHub actions using OIDC (as explained above)
  • IAM Role to assume from GitHub Actions using the identity provider.

Once this is deployed, you're ready to move on to the next step.

Set up repositories in you GitHub Organization

We will move the code from this example to your GitHub Organization.

  1. base-infrastructure: An internal repository for Base Infrastructure which wil contain all code from ./sagemaker-mlops-terraform folder.
  2. template-repos: GitHub template repositories with code from ./template-repos/**. Make sure to use the same name as the folder name.

Note: This is an important step to be able to deploy infrastructure. All further steps should be performed directly in your GitHub Organization.

You are now ready!

Follow the instructions in the base-infra repository to deploy MLOps infrastructure to your bootstrapped AWS accounts.

Contacts

If you have any comments or questions, please contact:

The team who created the repo:

AWS MLOps accelerators