- Go to Operator Hub. Find OpenShift Data Foundation (OpenShift Container Storage in older versions). Install the operator.
- Instantiate a storage cluster. This will take some time (~ 15 minutes).
- Switch to the
openshift-storage
project. - Find the
noobaa-admin
secret. Note the S3 credentials. - Find the
noobaa-mgmt
route and log into the management console using the Noobaa admin credentials. - Create buckets:
- accounts
- creditcards
- demographics
- features
- labels
- loans
- pachyderm
- trino
- Install operator (
Crunchy Postgres for Kubernetes
community operator) from the Operator Hub.- Note: mlflow deployment depends on this.
- Deploy
manifests/postgresql/custom-sql.yaml
.
- Create a new project
odh
. - Adapt and deploy
manifests/trino/trino-s3-credentials.yaml
. - Install the Open Data Hub operator from the Operator Hub.
- Select Open Data Hub operator in Installed Operators within project
odh
. - Adapt and deploy
manifests/odh.yaml
.- The Trino parameter
s3_endpoint_url
needs to be set to the https location of thes3
route inopenshift-storage
.
- The Trino parameter
- After all components have been deployed, scale down the
opendatahub-operator
pod to 0 in theopenshift-operators
project. This is required so we can freely configure the ODH components. - Verify the deployment by opening the
odh-dashboard
route URL. You should see the ODH dashboard.
- Adapt and deploy
manifests/trino/hive-config.yaml
.- Property
fs.s3a.endpoint
inhive-site.xml
needs to be set to the https location of thes3
route inopenshift-storage
.
- Property
- Adapt and deploy
manifests/trino/trino-catalog.yaml
.- Property
hive.s3.endpoint
inhive.properties
needs to be set to the https location of thes3
route inopenshift-storage
. - Property
connection-password
inpostgresql.properties
needs to be set to the value ofpassword
in secretcustom-sql-pguser-custom-sql
.
- Property
- Adapt and deploy
manifests/trino/trino-config.yaml
.- Property
s3_endpoint_url
needs to be set to the https location of thes3
route inopenshift-storage
. Remove thehttps://
prefix.
- Property
- Adapt and deploy
manifests/trino/hive-metastore.yaml
.S3_ENDPOINT
inspec.template.spec.containers[0].env
needs to be set to the https location of thes3
route inopenshift-storage
. Remove thehttps://
prefix.
- Deploy
manifests/trino/trino-coordinator.yaml
. - Deploy
manifests/trino/trino-worker.yaml
. - Restart the
hive-metastore
,trino-coordinator
, andtrino-worker
pods.
- Install Pachyderm operator from the Operator Hub.
- Adapt and deploy
manifests/pachyderm/pachyderm-s3-secret.yaml
. - Deploy
manifests/pachyderm/pachyderm.yaml
.
- Note: The PostgresQL operator needs to be installed prior to deploying mlflow (see above).
- Log into bastion host (machine that has full access to the OCP nodes and is logged into the cluster through
oc
). - Install Helm
sudo su -
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
sed -i 's/\/local//' get_helm.sh
chmod 700 get_helm.sh
bash get_helm.sh
logout
- Install mlflow Helm chart (if S3 is backed by ODF. For other S3 variants, see below.)
oc project odh
helm repo add strangiato https://strangiato.github.io/helm-charts/
helm repo update
helm install mlflow strangiato/mlflow-server
- Verify the deployment by opening the
mlflow-mlflow-server
route URL. You should see the mlflow dashboard. - If using other S3 variants,
- ensure bucket
mlflow
is present and note credentials and S3 endpoint. oc project odh
git clone https://github.com/mamurak/helm-charts.git
cd helm-charts
git checkout alternate-s3
cd charts/mlflow-server
vim values.yaml
- set
objectStorage.objectBucketClaim.enabled
tofalse
- set
S3EndpointUrl
- set
MlflowBucketName
tomlflow
- set
S3AccessKeyId
- set
S3SecretAccessKey
- set
helm repo add strangiato https://strangiato.github.io/helm-charts/
helm dependency build
helm install mlflow -f values.yaml .
- ensure bucket
- Create imagestream
s2i-custom-notebook
.- Use metadata from
manifests/jupyterhub/s2i-custom-notebook-imagestream.yaml
.
- Use metadata from
- Deploy
manifests/jupyterhub/s2i-custom-notebook.yaml
. - Trigger a build.
- Once the build finishes, adapt and deploy
manifests/jupyterhub/s2i-custom-notebook-ist.yaml
.- Replace the imagestream name (starting with
sha256:
) with the one from the new build:tag.from.name
image.metadata.name
image.dockerImageReference
- Replace the imagestream name (starting with
- Access JupyterHub through the
jupyterhub
route URL. - You should see the Jupyter environment configuration page. You should see the
custom notebook image
among the available notebook images. - In case you're using custom certificates in your environment, you might not be able to access the JupyterHub entry page (HTTP error 599). To work around this issue, do the following:
- In the
jupyterhub-cfg
configmap, set thejupyterhub_config.py
parameter toc.OpenShiftOAuthenticator.validate_cert = False
. - Restart the JupyterHub pods.
- In the
- The folder
manifests/pipelines
contains for each pipeline (X
) a subfolder with three resources: a secret, an imagestream, and a build config. - Adapt and deploy the secrets (
X-pipeline-secret.yaml
). - Deploy the imagestreams (
X-pipeline-is.yaml
). - Deploy the build configs (
X-pipeline-bc.yaml
). - Trigger a build for each pipeline.
- Access
jupyterhub
route URL. - On the JupyterHub configuration page
- select
custom notebook image
, - choose container size
Medium
, - and Start Server.
- select
- If the loading screen is stuck, open the
jupyterhub
route URL again. You should see the JupyterLab screen.
- Click the Git icon in the left toolbar.
- Clone this repository.
- Open
notebooks/s3-upload.ipynb
.- Adapt S3 credentials and run the notebook.
- Run
notebooks/initialize_tables.ipynb
.
- Open shell session on management host. It needs the
oc
client and connectivity to the OpenShift cluster (e.g. the bastion host). - Run
curl -o /tmp/pachctl.tar.gz -L https://github.com/pachyderm/pachyderm/releases/download/v2.2.7/pachctl_2.2.7_linux_amd64.tar.gz && tar -xvf /tmp/pachctl.tar.gz -C /tmp && sudo cp /tmp/pachctl_2.2.7_linux_amd64/pachctl /usr/local/bin
- Verify installation by running
pachctl version --client-only
. You should see the client version.
oc project odh
pachctl config import-kube local --overwrite
pachctl config set active-context local
pachctl port-forward
- This command blocks. Open a new shell session to continue.
- Verify integration by running
pachctl version
. You should see the client and instance versions.
git clone {this repository}
cd {repository}/pipeline-definitions
- For each pipeline
X
run:pachctl create pipeline -f X.json
- For updating existing pipelines
X
run:pachctl update pipeline -f X.json
- For inspecting a running pipeline
X
run:pachctl inspect pipeline X
- Develop pipeline code in notebooks, see
notebooks/preprocessing.ipynb
. - To prepare the containerized pipeline, create a folder in
container-images
containing- a
Containerfile
, - a
requirements.txt
, - the main code.
- See
container-images/preprocessing-pipeline
.
- a
- To prepare the container builds, create a folder in
manifests/pipelines
with three manifests:- a secret,
- an imagestream,
- a build config.
- See
manifests/pipelines/preprocessing
.
- Deploy these resources into the
odh
project and trigger a container build. - Create a pipeline definition in
pipeline-definitions
, seepipeline-definitions/preprocessing_pipeline.json
. - Run the pipeline with
pachctl create pipeline -f <pipeline definition>
. - Monitor pipeline progress:
- through the client:
pachctl inspect pipeline <pipeline>
, - through OpenShift: access the logs of the pipeline pod,
- through Trino, for instance via the
trino-access
notebook.
- through the client:
data
: the dummy data representing raw data to be stored in the datalake and processed by the preprocessing pipelines.notebooks
: scripts and sample procedures for ODH component integration.pipeline definitions
: Pachyderm pipeline manifests.manifests
: OpenShift deployment artifacts.container-images
: dependencies of container builds.