Training any Deep Learning model on a large dataset often takes a lot of time, specially when the size of the data set for training is in the range of 100s of GBs. And running such machine learning model at scale on cloud demands a sophisticated mechanism.
In this section, we shall learn how we can leverage Kubeflow, which is an open source machine learning toolkit to deploy any machine learning workflows on Kubernetes environment. We are going to use Amazon Elastic Kubernetes Service(EKS) for deploying our Kubernetes cluster and Amazon Elastic File System(EFS) as a persistent storage in the backend, which will be used for staging the dataset for training, hosting jupyter notebooks and machine learning model.
But before we go ahead, make sure you have already setup your workspace as per this tutorial and complete the following:
- Setting up the initial Cloud9 setup
- Installation of few of the Kubernetes tool kits
- Creating the Kubernetes cluster using Amazon EKS.
Once that is done, make sure you have everything installed and configured properly on your Cloud9 workspace.
$ kubectl version
$ aws --version
$ jq --version
$ yq --version
Check the cluster nodes
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION Ready <none> 8m43s v1.21.5-eks-9017834 Ready <none> 8m44s v1.21.5-eks-9017834 Ready <none> 8m44s v1.21.5-eks-9017834 Ready <none> 8m46s v1.21.5-eks-9017834 Ready <none> 8m44s v1.21.5-eks-9017834
Next, we need to install kubeflow in our cluster, and we are going to use [kustomize] ( for this. kustomise is a command line tool to customize Kubernetes objects through a kustomization file. Let's first install it and verify.
$ wget -O kustomize
$ chmod +x kustomize
$ sudo mv -v kustomize /usr/local/bin
$ kustomize version
Version: {KustomizeVersion:3.2.0 GitCommit:a3103f1e62ddb5b696daa3fd359bb6f2e8333b49 BuildDate:2019-09-18T16:26:36Z GoOs:linux GoArch:amd64}
Lets clone this amazon-efs-developer-zone repo and install kubeflow
now using kustomize
This installation might take couple of minutes, as it will deploy many different Pods in your EKS cluster.
$ git clone
Cloning into 'amazon-efs-developer-zone'...
remote: Enumerating objects: 4309, done.
remote: Counting objects: 100% (4309/4309), done.
Resolving deltas: 100% (1725/1725), done.
$ cd amazon-efs-developer-zone/application-integration/container/eks/kubeflow/manifests
$ while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
After the installation, it will take some time for all Pods to become ready. Make sure all Pods
are ready before we try to connect, otherwise we might get unexpected errors. To check that all kubeflow related Pods
are ready, we can use the following commands(just make sure the STATUS for each Pod is in Running state before moving ahead):
$ kubectl get pods -n cert-manager
$ kubectl get pods -n istio-system
$ kubectl get pods -n auth
$ kubectl get pods -n knative-eventing
$ kubectl get pods -n knative-serving
$ kubectl get pods -n kubeflow
$ kubectl get pods -n kubeflow-user-example-com
So, now our Amazon EKS cluster is up and running with kubeflow
installed, the final setup would be to create an Amazon EFS file system and connect it with our EKS cluster.
- Create an OIDC provider for the cluster
$ export CLUSTER_NAME=efsworkshop-eksctl
$ eksctl utils associate-iam-oidc-provider --cluster $CLUSTER_NAME --approve
Now, we are going to make the setup for EFS, and we are going to use this script The script automates all the Manual steps and is only for
Dynamic Provisioning
option. The script applies some default values for the file system name, performance mode etc and performs the following:- Install the EFS CSI Driver
- Create the IAM Policy for the CSI Driver
- Create an EFS Filesystem
- Creates a Storage Class for the cluster
Before we run this script, lets look at the exiting
Storage Class
$ kubectl get sc
gp2 (default) Delete WaitForFirstConsumer false 148m
- Now, we can run the
$ cd ml/efs
$ pip install -r requirements.txt
$ python --region $AWS_REGION --cluster $CLUSTER_NAME --efs_file_system_name myEFS1
EFS Setup
Prerequisites Verification
Verifying OIDC provider...
OIDC provider found
Verifying eksctl is installed...
eksctl found!
Setting up dynamic provisioning...
Editing storage class with appropriate values...
Creating storage class... created
Storage class created!
Dynamic provisioning setup done!
EFS Setup Complete
- Now we can verify the
Storage Class
in the Kubernetes cluster
$ kubectl get sc
efs-sc Delete WaitForFirstConsumer true 96s
gp2 (default) Delete WaitForFirstConsumer false 148m
- Now, we can validate the EFS File System which got created in the AWS Console
- If you see above, we have a shiny new
Storage Class
which kubeflow can use as a persistent storage.gp2
is still theDefault
storage class, but we can always change theDefault
storage class fromgp2
$ kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"":"false"}}}' patched
$ kubectl patch storageclass efs-sc -p '{"metadata": {"annotations":{"":"true"}}}' patched
$ kubectl get sc
efs-sc (default) Delete WaitForFirstConsumer true 9m48s
gp2 Delete WaitForFirstConsumer false 86m
By changing the default storage class, when we now create workspace volumes for your notebooks, it will use your EFS storage class automatically.
- Run the following to
Istio's Ingress-Gateway to local port8080
$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Forwarding from -> 8080
Forwarding from [::1]:8080 -> 8080
- In your Cloud9 environment, click Tools > Preview > Preview Running Application to access dashboard. You can click on Pop out window button to maximize browser into new tab.
- Leave the current terminal running because if you kill the process, you will loose access to the dashboard. Open new Terminal to follow rest of the demo
- Login to the Kubeflow dashboard with the default user's credential. The default email address is
[email protected]
and the default password is12341234
- Now, let’s create a Jupyter Notebook. For that click on Notebook → New Server
- Mention the notebook name as “notebook1”, rest all keep it as default and scroll down and click on LAUNCH
At this point the EFS CSI driver should create an Access Point
, as this new notebook on Kubeflow will internally create a Pod
, which will create a PVC
, and that will call the storage from the storage class efs-sc
(as that’s the default storage we have selected for this EKS cluster). Let’s wait for the notebook to come to ready state.
- Now we can check the
using kubectl which got created under the hood
$ kubectl get pv
pvc-3d1806bc-984c-404d-9c2a-489408279bad 20Gi RWO Delete Bound kubeflow/minio-pvc gp2 52m
pvc-8f638f2c-7493-461c-aee8-984760e233c2 10Gi RWO Delete Bound kubeflow-user-example-com/workspace-nootbook1 efs-sc 5m16s
pvc-940c8ebf-5632-4024-a413-284d5d288592 10Gi RWO Delete Bound kubeflow/katib-mysql gp2 52m
pvc-a8f5e29f-d29d-4d61-90a8-02beeb2c638c 20Gi RWO Delete Bound kubeflow/mysql-pv-claim gp2 52m
pvc-af81feba-6fd6-43ad-90e4-270727e6113e 10Gi RWO Delete Bound istio-system/authservice-pvc gp2 52m
$ kubectl get pvc -n kubeflow-user-example-com
workspace-nootbook1 Bound pvc-8f638f2c-7493-461c-aee8-984760e233c2 10Gi RWO efs-sc 5m59s
- And finally we can see the access point now in the AWS console
So, at this point you can make use of this Jupyter Notebook
Now, we can create a PVC for your machine learning training dataset in ReadWriteMany
mode meaning it can be used by many notebooks at the same time. You can go to the Kubeflow Dashboard → Volumes → New Volume and create a new volume called dataset
with efs-sc as the storage class
The volume would would be in pending
state till it gets consumed by any of the pod
and this is because when we create the Storage Class
using the
script, it used volumeBindingMode
as WaitForFirstConsumer
$ kubectl get pvc -A
Lets now create another Jupyter notebook, and call it as notebook2
with a Tensorflow image
as shown bellow and attached this newly created PVC as Data Volume
Once this Notebook is ready, we can see the PVC
is bonded
with the PV
and the respective EFS access point will also get created.
$ kubectl get pvc -A
We can see 3 access points on the EFS file system, 1 for the dataset volume which we created and the other two are for the notebook home directory. And now we can open the notebook and use that shared storage volume dataset
- Download the dataset from your notebook
import pathlib
import tensorflow as tf
dataset_url = ""
!wget ""
!tar -xf flower_photos.tgz --directory /home/jovyan/dataset
!rm -rf flower_photos.tgz
- Build and Push the Docker image
$ aws ecr create-repository \
--repository-name my-repo
"repository": {
"repositoryArn": "arn:aws:ecr:ap-southeast-1:123912348584:repository/my-repo",
"registryId": "123912348584",
"repositoryName": "my-repo",
"repositoryUri": "",
"createdAt": "2022-03-25T17:38:29+00:00",
"imageTagMutability": "MUTABLE",
"imageScanningConfiguration": {
"scanOnPush": false
"encryptionConfiguration": {
"encryptionType": "AES256"
- Now, we can build the image locally in our Cloud9 instance and push it to our newly created ECR repository
$ cd /home/ec2-user/environment/amazon-efs-developer-zone/application-integration/container/eks/kubeflow/manifests/ml
$ pwd
$ cd training-sample/
$ export IMAGE_URI=<repositoryUri:latest>
$ docker build -t my-repo .
Sending build context to Docker daemon 6.144kB
Step 1/3 : FROM
2.6.0-cpu-py38: Pulling from c9e4w0g3/notebook-servers/jupyter-tensorflow
7b1a6ab2e44d: Pulling fs layer
31cd47bee782: Pulling fs layer
Removing intermediate container 040429cfaecd
---> b3189c0564da
Successfully built b3189c0564da
$ aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $IMAGE_URI
WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See
Login Succeeded
$ docker tag my-repo $IMAGE_URI
$ docker push $IMAGE_URI
$ cd -
$ pwd
- Configure the
Tensorflow spec file
Once the docker image is built and pushed to our ECR repo, we can now replace the the following in the tfjob.yaml file
- Docker Image URI (
) - Namespace (
) - Claim Name (
$ yq eval ".spec.tfReplicaSpecs.Worker.template.spec.containers[0].image = \"$IMAGE_URI\"" -i training-sample/tfjob.yaml
$ export CLAIM_NAME=dataset
$ yq eval ".spec.tfReplicaSpecs.Worker.template.spec.volumes[0].persistentVolumeClaim.claimName = \"$CLAIM_NAME\"" -i training-sample/tfjob.yaml
$ export PVC_NAMESPACE=kubeflow-user-example-com
$ yq eval ".metadata.namespace = \"$PVC_NAMESPACE\"" -i training-sample/tfjob.yaml
- Create the TFJob
Now, we are ready to train the model using the training-sample/
script and the data available on the shared volume with the Kubeflow TFJob operator as
$ kubectl apply -f training-sample/tfjob.yaml created
In order to check that the training job is running as expected, you can check the events in the TFJob
describe response as well as the job logs as:
$ kubectl describe tfjob image-classification-pvc -n $PVC_NAMESPACE
Name: image-classification-pvc
Namespace: kubeflow-user-example-com
Labels: <none>
Annotations: <none>
API Version:
Kind: TFJob
Creation Timestamp: 2022-03-25T20:30:35Z
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 7m2s tf-operator Created pod: image-classification-pvc-worker-0
Normal SuccessfulCreatePod 7m2s tf-operator Created pod: image-classification-pvc-worker-1
Normal SuccessfulCreateService 7m2s tf-operator Created service: image-classification-pvc-worker-0
Normal SuccessfulCreateService 7m2s tf-operator Created service: image-classification-pvc-worker-1
Normal ExitedWithCode 5m44s (x2 over 5m44s) tf-operator Pod: kubeflow-user-example-com.image-classification-pvc-worker-1 exited with code 0
Normal ExitedWithCode 5m44s tf-operator Pod: kubeflow-user-example-com.image-classification-pvc-worker-0 exited with code 0
- Lastly we can check the status of the training within the Pods itself
$ kubectl logs -n $PVC_NAMESPACE image-classification-pvc-worker-0 -f
Using deprecated annotation `` in pod/image-classification-pvc-worker-0. Please use `` instead
2022-03-25 20:30:41.570601: W tensorflow/core/profiler/internal/] Initializing the SageMaker Profiler.
2022-03-25 20:30:41.570748: W tensorflow/core/profiler/internal/] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
2022-03-25 20:30:41.598012: W tensorflow/core/profiler/internal/] Initializing the SageMaker Profiler.
Found 3670 files belonging to 5 classes.
Using 2936 files for training.
2022-03-25 20:30:43.672514: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-25 20:30:43.682354: I tensorflow/core/common_runtime/] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Found 3670 files belonging to 5 classes.
Using 734 files for validation.
['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']
Model: "sequential"
Layer (type) Output Shape Param #
rescaling (Rescaling) (None, 180, 180, 3) 0
conv2d (Conv2D) (None, 180, 180, 16) 448
max_pooling2d (MaxPooling2D) (None, 90, 90, 16) 0
conv2d_1 (Conv2D) (None, 90, 90, 32) 4640
max_pooling2d_1 (MaxPooling2 (None, 45, 45, 32) 0
conv2d_2 (Conv2D) (None, 45, 45, 64) 18496
max_pooling2d_2 (MaxPooling2 (None, 22, 22, 64) 0
flatten (Flatten) (None, 30976) 0
dense (Dense) (None, 128) 3965056
dense_1 (Dense) (None, 5) 645
Total params: 3,989,285
Trainable params: 3,989,285
Non-trainable params: 0
Epoch 1/2
Extension horovod.torch has not been built: /usr/local/lib/python3.8/site-packages/horovod/torch/mpi_lib/ not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still avaiable.
[2022-03-25 20:30:44.564 image-classification-pvc-worker-0:1 INFO] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-03-25 20:30:44.749 image-classification-pvc-worker-0:1 INFO] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
2022-03-25 20:30:45.282342: I tensorflow/compiler/mlir/] None of the MLIR Optimization Passes are enabled (registered 2)
92/92 [==============================] - 40s 338ms/step - loss: 1.3258 - accuracy: 0.4452 - val_loss: 1.1074 - val_accuracy: 0.5286
Epoch 2/2
92/92 [==============================] - 29s 314ms/step - loss: 1.0277 - accuracy: 0.5937 - val_loss: 1.0266 - val_accuracy: 0.6063
So, in this demo, we learnt how you can setup Kubeflow for your machine learning workflow on Amazon EKS and how we can leverage Amazon EFS as a shared filesystem to storing the training dataset. For more details you may like to explore these following links: