This is a CDK Python project to deploy multiple FMs to the same instance.
In this demo, we will now create inference component-based endpoints and deploy a copy of the Dolly v2 7B model and a copy of the FLAN-T5 XXL model from the Hugging Face model hub on a SageMaker real-time endpoint.
An inference component (IC) abstracts your ML model and enables you to assign CPUs, GPU, or AWS Neuron accelerators, and scaling policies per model. Inference components offer the following benefits:
- SageMaker will optimally place and pack models onto ML instances to maximize utilization, leading to cost savings.
- SageMaker will scale each model up and down based on your configuration to meet your ML application requirements.
- SageMaker will scale to add and remove instances dynamically to ensure capacity is available while keeping idle compute to a minimum.
- You can scale down to zero copies of a model to free up resources for other models. You can also specify to keep important models always loaded and ready to serve traffic.
(Image Source: AWS Blog)
The cdk.json
file tells the CDK Toolkit how to execute your app.
This project is set up like a standard Python project. The initialization
process also creates a virtualenv within this project, stored under the .venv
directory. To create the virtualenv it assumes that there is a python3
(or python
for Windows) executable in your path with access to the venv
package. If for any reason the automatic creation of the virtualenv fails,
you can create the virtualenv manually.
To manually create a virtualenv on MacOS and Linux:
$ python3 -m venv .venv
After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv.
$ source .venv/bin/activate
If you are a Windows platform, you would activate the virtualenv like this:
% .venv\Scripts\activate.bat
Once the virtualenv is activated, you can install the required dependencies.
(.venv) $ pip install -r requirements.txt
To add additional dependencies, for example other CDK libraries, just add
them to your setup.py
file and rerun the pip install -r requirements.txt
command.
Then, you should set approperly the cdk context configuration file, cdk.context.json
.
For example,
{ "sagemaker_endpoint_name": "ic-endpoint", "sagemaker_endpoint_config": { "instance_type": "ml.g5.12xlarge", "managed_instance_scaling": { "min_instance_count": 1, "max_instance_count": 2, "status": "ENABLED" }, "routing_config": { "routing_strategy": "LEAST_OUTSTANDING_REQUESTS" } }, "deep_learning_container_image_uri": { "repository_name": "huggingface-pytorch-tgi-inference", "tag": "2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04" }, "models": { "dolly-v2-7b": { "HF_MODEL_ID": "databricks/dolly-v2-7b", "HF_TASK": "text-generation" }, "flan-t5-xxl": { "HF_MODEL_ID": "google/flan-t5-xxl", "HF_TASK": "text-generation" } }, "inference_components": { "ic-dolly-v2-7b": { "model_name": "dolly-v2-7b", "compute_resource_requirements": { "number_of_accelerator_devices_required": 2, "number_of_cpu_cores_required": 2, "min_memory_required_in_mb": 1024 }, "runtime_config": { "copy_count": 1 } }, "ic-flan-t5-xxl": { "model_name": "flan-t5-xxl", "compute_resource_requirements": { "number_of_accelerator_devices_required": 2, "number_of_cpu_cores_required": 2, "min_memory_required_in_mb": 1024 }, "runtime_config": { "copy_count": 1 } } } }
ℹ️ The avialable Deep Learning Container (DLC) images (deep_learning_container_image_uri
) can be found in here.
At this point you can now synthesize the CloudFormation template for this code.
(.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
(.venv) $ export CDK_DEFAULT_REGION=$(aws configure get region)
(.venv) $ cdk synth --all
Use cdk deploy
command to create the stack shown above.
(.venv) $ cdk deploy --require-approval never --all
If you want to run inference, checkout this example notebook.
Delete the CloudFormation stack by running the below command.
(.venv) $ cdk destroy --force --all
cdk ls
list all stacks in the appcdk synth
emits the synthesized CloudFormation templatecdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk docs
open CDK documentation
Enjoy!
- (AWS Blog) Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency (2023-11-29)
- (AWS Blog) Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker (2023-11-30)
- Amazon Sagemaker API Reference - CreateInferenceComponent
- Amazon SageMaker Deploy models for real-time inference
- Docker Registry Paths and Example Code for Pre-built SageMaker Docker images
- Available Amazon Deep Learning Containers Images page
- 🛠️ sagemaker-huggingface-inference-toolkit - SageMaker Hugging Face Inference Toolkit is an open-source library for serving 🤗 Transformers and Diffusers models on Amazon SageMaker.
- 🛠️ sagemaker-inference-toolkit - The SageMaker Inference Toolkit implements a model serving stack and can be easily added to any Docker container, making it deployable to SageMaker.