DMT User Guide

For DMT introduction, please refer to /README.md
For Teradata data migration deployment user guide, please refer to /docs/installation/README_teradata_datamigration.md
For Hive data migration deployment user guide, please refer to /docs/installation/README_hive_datamigration.md
For Redshift data migration deployment user guide, please refer to /docs/installation/README_redshift_datamigration.md

About the Tool

Prerequisites

Firewall rules - enable firewall between the source database and Google Cloud VPC

[For Redshift]

For AWS Redshift connectivity, please refer guide https://cloud.google.com/bigquery/docs/migration/redshift-vpc

Note:- It is necessary to create VPC GCC subnetwork region as the same region (in GCP) to which VPN tunnel is established (to connect to AWS Redshift cluster)
[Mandatory] BigQuery Transfer Service Agent VM requires access to internet to update packages in Agent VM environment and hence the recommended approach is to have a Cloud NAT in place in the same VPC, subnet and regions as your Teradata Agent VM

Cloud NAT is deployed automatically using default settings through terraform with standard naming conventions and automatic IP allocations.

If you wish to use your own Cloud NAT and Cloud router with custom allocations ensure they are created before proceeding with the deployment.

In the case of manual creation of Cloud NAT and Cloud router, to turn off automatic creation of Cloud NAT, change the value of the variable create_nat = “false” in ~/terraform/datamigration/gce/variables.tf after cloning the DMT repository

If you are okay with automatic NAT creation through terraform, please proceed without any changes.

The Cloud Router and NAT should only be created in the same region where the VPC for Cloud Composer and Migration Agent VM are going to be created

The DVT Cloud Run and Cloud Composer should only be created in the same region.

Assumptions

Using Standard Deployment for infrastructure will ensure you have default (standard) names for GCP service, service accounts, and default folder structures. This is the recommended approach.
By default, all services will be followed by the project-id to denote unique identifier names for all buckets and services. If you wish to change this ensure that you provide the substitution parameter for _MY_CUSTOMER_NAME as part of gcloud builds submit command later during deployment
The tool will deploy by default using below location/regions
- us-central1-a
- US location
To change the locations, go to variables.tf file under each GCP service terraform module and update the location values according to your requirements before running the Bash wrapper script for Infrastructure deployment.

It is recommended to keep the migration pipeline components in the same regions as much as possible.
By default, GCP services requiring a VPC and subnet will be deployed in default VPC and subnet. To provide custom VPCs and subnets, go to variables.tf file under each applicable GCP service terraform module and update the vpc names and subnet names. This step is applicable for
- Cloud Composer (gcc)
- Teradata Agent Compute Engine VM (gce)
- Cloud Run

General Limitations

Best practice - provide input DDLs and SQLs in the form of one DDL or SQL query per file.
Incremental data migration is supported for HIVE only.

Feature Limitations

Current release supports DDLs and SELECT queries only. DML statements will be supported with dry run option only.
Current release doesn't support data transfer for Oracle or Incremental data load for Teradata, Redshift.

DMT Deployment

Deployment of DMT through terraform modules is split into 2 parts

Translation – For DDL, DML, SQL translation and validation.
Data Migration - For data migration from different sources (Teradata and Hive)

Clone the DMT git repository

The code repository is hosted on Github and can be cloned using below command.

git clone https://github.com/GoogleCloudPlatform/data-migration-tool

Navigate to clone directory using below command

cd data-migration-tool

Checkout the main branch using below command

git checkout main

Assign Executing User and Admin Permissions

Admin User who will be executing the deployment of DMT through Cloud Build will require the below set of permissions

roles/bigquery.dataViewer
roles/bigquery.user
roles/cloudbuild.builds.editor
roles/run.viewer
roles/compute.admin
roles/iam.serviceAccountViewer
roles/logging.viewer
roles/run.viewer
roles/serviceusage.serviceUsageConsumer
roles/storage.admin
roles/iam.serviceAccountViewer
projects/${PROJECT_ID}/roles/DMTAdminAdditionalPermissions

And, user who will be using the DMT tool will require the below set of permissions

roles/bigquery.dataViewer
roles/bigquery.admin
roles/run.viewer
roles/composer.user
roles/storage.admin
roles/vpcaccess.admin
roles/logging.viewer
projects/${PROJECT_ID}/roles/DMTUserAdditionalPermissions

Note - DMTUserAdditionalPermissions role is custom DMT user role and DMTAdminAdditionalPermissions is also custom role for DMT admin

 export ADMIN_ACCOUNT=<EXECUTING_ADMIN_ACCOUNT>

 export USER_ACCOUNT=<EXECUTING_USER_ACCOUNT>

 export SOURCE_PROJECT=<YOUR_PROJECT_ID>

DMT requires additional user/admin permission except predefined role, you can execute the Bash script dmt-user-roles-setup.sh and dmt-admin-roles-setup.sh present in the root directory to create custom dmt user/admin additional permission role

bash dmt-user-roles-setup.sh

bash dmt-admin-roles-setup.sh

To assign these roles, you can execute the Bash script dmt-user-setup.sh and dmt-admin-user-setup.sh present in the root directory

bash dmt-user-setup.sh

bash dmt-admin-user-setup.sh

Enable Google Cloud APIs

From the Cloud Shell, you can enable Google Cloud Services using the gcloud command line interface in your Google Cloud project.

export SOURCE_PROJECT=<YOUR_PROJECT_ID>

gcloud config set project $SOURCE_PROJECT

gcloud services enable serviceusage.googleapis.com \
			     cloudresourcemanager.googleapis.com \
			     cloudbuild.googleapis.com

IAM permissions for Cloud Build Service Account

Identify the Cloud Build default Service Account and assign the below IAM roles for the execution of the deployment of DMT architecture.

Google Cloud Console -> Cloud Build -> Settings

IAM roles required for the created Service Account

Artifact Registry Administrator
BigQuery Admin
Cloud Build Service Account
Cloud Run Admin
Composer Administrator
Compute Instance Admin (v1)
Compute Network Admin
Create Service Accounts
Logs Viewer
Project IAM Admin
Pub/Sub Admin
Secret Manager Admin
Service Account User
Service Usage Admin
Storage Admin

To assign these roles, you can execute the Bash script cloudbuild-sa-iam-setup.sh present in the root directory

export BUILD_ACCOUNT=<CLOUDBUILD_SERVICE_ACCOUNT>

export SOURCE_PROJECT=<YOUR_PROJECT_ID>

bash cloudbuild-sa-iam-setup.sh

This script will assign all necessary IAM permissions to the Cloud Build Service Account for deployment of the DMT tool.

Once the execution of the bash script is successful, you can proceed directly to the next step Deploying DMT Infrastructure

Alternatively, you can also assign the roles by running gcloud command from your cloud shell

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/bigquery.admin"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/run.admin"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/composer.admin"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/compute.instanceAdmin.v1"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/compute.networkAdmin"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/iam.serviceAccountCreator"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/logging.viewer"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/resourcemanager.projectIamAdmin"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/pubsub.admin"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/secretmanager.admin"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/iam.serviceAccountUser"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/serviceusage.serviceUsageAdmin"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/storage.admin"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/artifactregistry.admin"

Deploying DMT infrastructure

Translations deployment will take care of the workflow needed to perform

DDL, DML and SQL translations
Schema migration to BQ (Creation of tables from translated DDLs)
Schema Validation and Custom query(row/column for SQL validation) using DVT tool
- Options to run DVT on cloud run and GKE airflow pod operator depending on scaling requirements.

Note : SQL translation and validation can only be run after data loading happens into Bigquery

If you choose the deployment of Translation + Data Migration, in addition to the translation activities mentioned above, you will be able to

Perform initial data loads
Row and column data validation using DVT tool

Pre deployment Steps

1. Before you proceed with DMT terraform deployment, ensure that there is a GCS bucket created to store Terraform infrastructure State Files

export SOURCE_PROJECT=<YOUR_PROJECT_ID>

gcloud config set project ${SOURCE_PROJECT}

gcloud storage buckets create gs://${SOURCE_PROJECT}-dmt-state-bucket

2. [Mandatory for Oracle] Before you proceed with DMT terraform deployment, ensure that Oracle client .RPM library is uploaded in the GCS bucket

Guideline to download oracle_client_rpm_file and upload in GCS bucket -

Download oracle instant client library(.rpm extension) file from Oracle downloads: https://www.oracle.com/in/database/technologies/instant-client/linux-x86-64-downloads.html
Upload the file in GCS bucket

gsutil cp <downloaded_oracle_client_rpm_file> gs://<gcs_bucket>/<nested_folders_if_any>/

(File can also be uploaded using cloud console)

It is advisable that oracle client rpm file name should be same as downloaded

Set the below environment variables(This would be used while running deployment command) -

export ORACLE_RPM_GCS_BUCKET_PATH=<Fully Qualified RPM file GCS bucket path>

For example, Fully Qualified RPM file GCS bucket path is gs://<gcs_bucket>/<nested_folders_if_any>/downloaded_filename.rpm

export ORACLE_INSTANT_CLIENT_LIB_VERSION=<Oracle_Instant_Client_Lib_Version>

The version canbe found from the downloaded file or from the Oracle download page from where the library is downloaded

Note: This library is required by data validation tool and while running deployment command it is required to pass library details as substitution parameters

3. Mandatory for Shared VPC network configuration in Cloud Composer

In Shared VPC networking designated host project provides VPC network and sub-network to the service project where cloud composer environment would be deployed. The service project must be attached with host project to use the VPC.

Perform below predeployment steps to setup/configure shared VPC for composer -

Setup the environment variable if it doesn't exist.

export HOST_PROJECT=<HOST_PROJECT_ID>

export SOURCE_PROJECT=<YOUR_PROJECT_ID>

Network resource configuration - VPC network should have subnetwork with the region available for cloud composer see available regions list under Products available by location section deployment. Subnetwork should have 2 Secondary IP ranges created explicitly which is required for GKE pods and services.

Note:-

Mask size for secondary IP ranges should be between /9 to /23 bits based on composer environment size.

For Example, /17 and /22 is recommended for medium composer environment.
While creating subnet or using existing one Enable Private Google Access (This would be recomended even if plan is to use Public IP environments)

Enable GKE API in your host and service projects if not enabled

gcloud services enable container.googleapis.com --project $HOST_PROJECT
gcloud services enable container.googleapis.com --project $SOURCE_PROJECT

Provide below permission to service project service account in host project -

a. Create environment variable for service project number(Service project number is available in dashboard page of project under Project info section) -

export SERVICE_PROJECT_NUMBER=<service_project_number>

b. Provide Compute Network User Permission to Google APIs service agent

export GOOGLE_API_SERVICE_AGENT=$SERVICE_PROJECT_NUMBER@cloudservices.gserviceaccount.com

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$GOOGLE_API_SERVICE_AGENT" \
--role="roles/compute.networkUser"

c. Provide Compute Network User and Kubernetes Engine Host Service Agent User Permission to GKE service agent

export GKE_SERVICE_AGENT=service-$SERVICE_PROJECT_NUMBER@container-engine-robot.iam.gserviceaccount.com

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$GKE_SERVICE_AGENT" \
--role="roles/compute.networkUser"

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$GKE_SERVICE_AGENT" \
--role="roles/container.hostServiceAgentUser"

d. Provide Permission to Composer Service Agent

Either provide Composer Shared VPC Agent Permission to Composer Service Agent Account in case for Private environment

export COMPOSER_SERVICE_AGENT=service-$SERVICE_PROJECT_NUMBER@cloudcomposer-accounts.iam.gserviceaccount.com

   gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
   --member="serviceAccount:$COMPOSER_SERVICE_AGENT" \
   --role="roles/composer.sharedVpcAgent"

Or, Provide Compute Network User Permission to Composer Agent Service Account in case for Public environment

export COMPOSER_SERVICE_AGENT=service-$SERVICE_PROJECT_NUMBER@cloudcomposer-accounts.iam.gserviceaccount.com

   gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
   --member="serviceAccount:$COMPOSER_SERVICE_AGENT" \
   --role="roles/compute.networkUser"

For more information and references please visit Configure Shared VPC networking documentation

Terraform Deployment

Set default GCP project in your Cloud shell or terminal or bastion host with terraform, terragrunt and gcloud sdk installed

gcloud config set project ${SOURCE_PROJECT}

Check the current project

gcloud config get project

Run the command

gcloud auth application-default login

Login with your Google Cloud credentials and ensure you have the required IAM permissions listed in the step [Assign Executing User Permissions]
DMT tool leverages default VPC network and subnetwork for Cloud Composer and Teradata Agent VM Compute Engine. If you wish to use a custom VPC and subnet, provide the right values for custom VPC and subnet in these files

~/terraform/translation/gcc/variables.tf ~/terraform/datamigration/gce/variables.tf

Note : DMT tool does not create VPC and subnets, these should be created beforehand.

It is recommended to use the same VPC and subnet for Cloud Composer and Teradata Agent VM with the VPC and subnet able to access source Teradata and Hive Data Warehouses necessary ports.

~/terraform/translation/gcc/variables.tf Click here to view source file

~/terraform/datamigration//gce/variables.tf

If you would like to customize additional variables, please ensure the appropriate values are set for those variables in variables.tf file in each services’ folder - pubsub, gcc, gcs, bq, cloudrun

~/terraform/translation/gcs/variables.tf

~/terraform/translation/pubsub/variables.tf

~/terraform/translation/cloudrun/variables.tf

~/terraform/translation/gcc/variables.tf

~/terraform/translation/bq/variables.tf

(Note: Please note that region for cloud composer and dvt cloud run should be same)

Navigate inside the dmt repo

cd data-migration-tool

gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=<source data warehouse>

Cloud build DMT uses the following default values

PROJECT_ID=${PROJECT_ID}
_MY_BUCKET_NAME=${PROJECT_ID}-dmt-state-bucket
_MY_CUSTOMER_NAME=${PROJECT_ID}
_DELETE_BQ_TABLES="false"

If you wish to change the default customer name tagged as suffix for all services and also change the name of the bucket for storing Terraform State Files (should be created beforehand)

Provide the values as substitution variables in gcloud builds command

For example,

If bucket name for state files = demo-state

Required customer name = democust

gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_MY_BUCKET_NAME=demo-state,_MY_CUSTOMER_NAME=democust,_DATA_SOURCE=hive

Otherwise DMT deploys with default predefined naming conventions for State File GCS bucket and Customer Name

Default false value of _DELETE_BQ_TABLES won't allow to delete BigQuery tables when you want to destroy DMT infrastructure in future. So if you wish to destroy BigQuery tables with destroy script, set the value to true in gcloud builds command.

For example,

gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DELETE_BQ_TABLES="true",_DATA_SOURCE=<data_source>

Otherwise DMT deploys with default predefined values.

Additional substitution variable for Oracle data source

_ORACLE_INSTANTCLIENT_FILE_PATH=${ORACLE_RPM_GCS_BUCKET_PATH}
_ORACLE_ODBC_VERSION_NUMBER=${ORACLE_INSTANT_CLIENT_LIB_VERSION}

Used environment variable created in pre-deployment steps

For instance, if you would like to leverage DMT for

Teradata to BigQuery Migration

gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=teradata

Hive to Bigquery Migrations

gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=hive

Oracle to Bigquery Migrations

gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_ORACLE_INSTANTCLIENT_FILE_PATH=${ORACLE_RPM_GCS_BUCKET_PATH},_ORACLE_ODBC_VERSION_NUMBER=${ORACLE_INSTANT_CLIENT_LIB_VERSION}

Note:

If Data Migration is not required, you can choose to skip _DATA_SOURCE substitution variable which will only deploy the DMT Translation Architecture

Oracle initial release supports only schema migration so _DATA_SOURCE substitution variable not required.

All the available deployment configurations and corresponding gcloud build commands are listed below -

DMT Purpose	Source	Command
Translation Only	All Sources (except Oracle)	`gcloud builds submit . --project ${SOURCE_PROJECT} \ \ --config cloudbuild_deploy.yaml`
Translation for Oracle	Oracle	`gcloud builds submit . --project ${SOURCE_PROJECT} \ --config cloudbuild_deploy.yaml \ --substitutions \ _ORACLE_INSTANTCLIENT_FILE_PATH=${ORACLE_RPM_GCS_BUCKET_PATH},_ORACLE_ODBC_VERSION_NUMBER=${ORACLE_INSTANT_CLIENT_LIB_VERSION}`
Translation + Data Migration	Teradata	`gcloud builds submit . --project ${SOURCE_PROJECT} \ --config cloudbuild_deploy.yaml \ --substitutions \ _DATA_SOURCE=teradata`
Translation + Data Migration	Hive	`gcloud builds submit . --project ${SOURCE_PROJECT} \ --config cloudbuild_deploy.yaml \ --substitutions \ _DATA_SOURCE=hive`
Translation + Data Migration	Redshift	`gcloud builds submit . --project ${SOURCE_PROJECT} \ --config cloudbuild_deploy.yaml \ --substitutions \ _DATA_SOURCE=redshift`

Script execution starting logs

starting build "206986f5-4e41-48f6-bff7-xyz"

FETCHSOURCE
Fetching storage object: gs://dmt-1_cloudbuild/source/1684304084.135633-xyz.tgz#1684304086786340
Copying gs://dmt-1_cloudbuild/source/1684304084.135633-xyz.tgz#1684304086786340...
/ [0 files][    0.0 B/115.2 KiB]                                                
/ [1 files][115.2 KiB/115.2 KiB]                                                
Operation completed over 1 objects/115.2 KiB.                                    
BUILD
Pulling image: gcr.io/google.com/cloudsdktool/cloud-sdk:latest
latest: Pulling from google.com/cloudsdktool/cloud-sdk
918547b94326: Pulling fs layer
9ecbbb83fff4: Pulling fs layer
831b9ef5c145: Pulling fs layer
4718bdb8c7d1: Pulling fs layer
671167368ac0: Pulling fs layer
014f91216bbc: Pulling fs layer
4718bdb8c7d1: Waiting
671167368ac0: Waiting
014f91216bbc: Waiting
831b9ef5c145: Verifying Checksum
9ecbbb83fff4: Verifying Checksum
9ecbbb83fff4: Download complete
918547b94326: Verifying Checksum
918547b94326: Download complete
014f91216bbc: Verifying Checksum
014f91216bbc: Download complete
671167368ac0: Verifying Checksum
671167368ac0: Download complete
918547b94326: Pull complete
9ecbbb83fff4: Pull complete
831b9ef5c145: Pull complete
4718bdb8c7d1: Download complete
4718bdb8c7d1: Pull complete
671167368ac0: Pull complete
014f91216bbc: Pull complete

This automated infra setup for GCS, Pubsub, Cloud Composer, Cloud Run, BQ DVT dataset is expected to take approximately 35-45 minutes to complete. Cloud Composer creation takes the most amount of time

You can monitor the status of the Deployment in your Cloud Build Logs by navigating to below path -

Google Cloud Console -> Cloud Build -> History

Since the chosen option is Translation + Data Migration for Teradata

Cloud Build will automatically trigger the deployment of Data Migration architecture as well.

Successful execution log - 
Apply complete! Resources: 2 added, 0 changed, 0 destroyed.
google_compute_router.router[0]: Still creating... [20s elapsed]
google_compute_router.router[0]: Creation complete after 21s [id=projects/dmt-1/regions/us-central1/routers/dmt-cloud-router]
google_compute_router_nat.nat[0]: Creating...
google_compute_router_nat.nat[0]: Still creating... [10s elapsed]
google_compute_router_nat.nat[0]: Still creating... [20s elapsed]
google_compute_router_nat.nat[0]: Creation complete after 22s [id=dmt-1/us-central1/dmt-cloud-router/dmt-cloud-nat]
google_compute_disk.dm-disk-terdata-bq: Creating...
google_compute_disk.dm-disk-terdata-bq: Creation complete after 1s [id=projects/dmt-1/zones/us-central1-a/disks/dm-disk-terdata-bq]
google_compute_instance.dm-vm-teradata-bq: Creating...
google_compute_instance.dm-vm-teradata-bq: Still creating... [10s elapsed]
google_compute_instance.dm-vm-teradata-bq: Creation complete after 12s [id=projects/dmt-1/zones/us-central1-a/instances/dm-vm-teradata-bq]

Apply complete! Resources: 10 added, 0 changed, 0 destroyed.
PUSH
DONE

Post-Installation setup

Follow the below steps for Teradata and Oracle

Check if the environment values still exist and are correct, if not, set them again for post-installation steps

Ensure you have an existing source database environment for testing:

For Teradata,
- If you do not have one, you can create one following the steps here.
- If you want to follow the qwiklab to set up TD instance with tpch installed when finished, you can follow the steps here.
For Oracle,
- If you do not have one, you can create one following the steps here.

Set environment variables in Cloud Shell

export SOURCE_PROJECT=<YOUR_PROJECT_ID>

<SOURCE_PROJECT> - Project ID where you will be deploying DMT tool

Grant Cloud Run invoker role to PubSub service account (dmt-sa-pubsub) under GCP IAM roles.

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:dmt-sa-pubsub@$SOURCE_PROJECT.iam.gserviceaccount.com" \
--role="roles/run.invoker"

Go to the GCP console and search for Kubernetes Clusters - select your composer cluster instance and identify the IP subnet range of the Pod Address range.
Go to your VPC network (default in this case) and create a firewall rule with the name pod-operator-firewall.

Direction of Traffic - Ingress
Action on Match - Allow
Targets - All instances in the network [Teradata/Oracle GCE instance]
Source Filter - IPv4
Source IPv4 ranges - that you got in step #2. E.g: 10.57.0.0/17
Protocols and Ports - Allow all / Allowed Teradata ports

gcloud compute firewall-rules create pod-operator-firewall \
--project=$SOURCE_PROJECT \
--direction=INGRESS \
--priority=1000 \
--network=<YOUR Composer VPC name> \
--action=ALLOW \
--rules=all \
--source-ranges=<YOUR Composer Kubernetes Pod Cluster IP obtained from Step 2>

Note. For real customer use cases, firewall ports from Composer Cluster Pod Address Range should be able to reach the on-prem Teradata/ Oracle instance through firewall rules being created

Teradata TCP DB connection port
Oracle TCP DB connection port

Go to Secret Manager and create a new secret with the name secret-edw_credentials

This should contain the password for Teradata/Oracle DB. It is recommended to keep the name of the secret as secret-edw_credentials

Note (Real customer use cases): please create vpc access connectors as part of the VPC that has firewalls open to connect to customer’s on prem teradata instance. The Cloud Run service (to execute the DVT tool) requires the VPC access connector to be attached to it to hit the Teradata/Oracle DB

Teradata TCP DB connection port.
Oracle TCP DB connection port.

Note (If you are a Googler): If you are a Googler and are running a test environment with Teradata express edition or Oracle on GCP Compute VMs, ensure you add a network tag cc-dvt to the on TD/Oracle instance

Validate if all DAG Airflows in the Cloud Composer instance are in paused state through Airflow URI
Switch all the DAG status to Active
Validate workload_identity_creator_dag in Airflow UI is executed one time automatically on turning DAGs active and is successful

If you intend to migrate tables over to BQ dataset other than dmt-teradata-dataset, please ensure this dataset is manually created and is provided in the config files later (create tables from DDLs)
(Mandatory for teradata ddl extraction) Teradata ddl extraction requires teradata jdbc jar. Upload the jdbc jar file to your GCS config bucket under the software/teradata folder as shown below:
1. Download the jar from teradata downloads: https://downloads.teradata.com/download/connectivity/jdbc-driver
2. Upload the jar to your GCS config bucket and ensure that the jar file name exactly matches terajdbc4.jar.
  Make sure the file is placed in the software/teradata path
  (e.g. gs://YOUR_DMT_CONFIG_BUCKET/software/teradata/terajdbc4.jar)
(Optional - Only for Oracle) Remove GCS bucket created to store .RPM file for setup.

Build connection between CloudRun and Source database [Applicable for network connectivity uses VPN connectivity]

Create serverless VPC Connector (specify same network as well as region which has firewall rules allowed to connect to on prem/ cloud data source)

gcloud compute networks vpc-access connectors create crun-dvt-connector \
--network default \
--region us-central1 \
--range 10.6.0.0/28 \
--min-instances 2 \
--max-instances 3 \
--machine-type f1-micro

Attach the network tag cc-dvt as the target or provide directly the IP address of the on prem TD/Oracle instance. (This tag can be used as target tag while creating a firewall rule at a later stage)
Create an ingress firewall rule (in source database network) to deny traffic from connector network tag:

gcloud compute firewall-rules create deny-vpc-connector --action=DENY --rules=all --source-tags=vpc-connector-us-central1-crun-dvt-connector --direction=INGRESS --network=default --priority=990

Create an ingress firewall rule targeting the Teradata/Oracle IP that you want the VPC connector to access. Set the priority for this rule to be a lower value than the priority of the rule you made in Step 25. (Here **cc-dvt **is the network tag attached to Teradata/Oracle VM)

gcloud compute firewall-rules create allow-vpc-connector-for-select-resources --allow=all --source-tags=vpc-connector-us-central1-crun-dvt-connector --direction=INGRESS --network=default --target-tags=cc-dvt --priority=980

Deploy a new version of the Cloud Run DVT service (edw-dvt-tool-<customerid>) after performing the below changes
1. Setting the connector
2. Go to the Secrets section in the Cloud Run and configure as shown below
Secret: secret-edw_credentials

Reference Method : Exposed as Environment variable

Environment_variables

Name 1: edw_credentials

Trigger DMT

Translations
Schema Migration to Google BigQuery
Schema Validation

Prepare sample config.json file including Input/output buckets, translation type (DDL, SQL or DML), dialects, Name mapping properties and DVT connection details
Upload required SQL and DDL files into respective bucket paths defined in config.json

For example,

DDL = gs://dmt-translation-control/input/ddl

SQL = gs://dmt-translation-control/input/sql (SQL files), gs://dmt-translation-control/input/sql/ddl (DDL/Metadata files)

DML = gs://dmt-translation-control/input/dml (DML files), gs://dmt-translation-control/input/sql/ddl (DDL/Metadata files)
Upload config.json into the config folder of the bucket

For example,

DDL = gs://dmt-config-control/input/ddl

SQL = gs://dmt-config-control/input/sql

DML = gs://dmt-config-control/input/dml
In case of DDL - validate execution of DAGs
1. controller_dag
2. extract_ddl_dag - if extract_ddl is yes
3. batch_sql_translation
4. schema_dag
5. validation_dag
  1. validation_dag - if validation_mode is gke
  2. validation_crun if validation_mode is cloudrun
In case of SQL - validate execution of DAGs
1. Controller_dag
2. Batch_sql_translation
3. validation_dag
  1. validation_dag - if validation_mode is gke
  2. validation_crun if validation_mode is cloudrun
In case of DML - validate execution of DAGs
1. Controller_dag
2. Batch_sql_translation
3. Dml_validation_dag
Translated DDL and SQL files can be found in output buckets

DDL = gs://dmt-translation-control/output/ddl

SQL = gs://dmt-translation-control/output/sql

DML = gs://dmt-translation-control/output/dml
Translation Summary results can be found in the output buckets along with translated files - batch_translation_report.csv

Sample config.json

Click here for sample config.json files

Configuration JSON Attributes

DDL

JSON attribute	Description
batchDistribution	Defines the number of parallel Composer Airflow DAGs *[Number of DAGs = Total DDL files/batchDistribution]*
extract_ddl	Provide flag for ddl/schema extraction from source database
unique_id	Provide an unique name for every batch run identification
type	Type of Translation : ddl
source	Source datawarehouse : teradata
translationErrorFilterRules `{field, matchtype, value}`	Field : Category Matchtype : equals Value : <> Allows bypassing of warnings and BQ dialect translation errors; Example Values - NoTargetSupportForFeature, NoTargetSupportForSetTables, NoTargetSupportForIndexBackedConstraints, NoTargetSupportForPartitionSemantics
migrationTask: type	Translation_Teradata2BQ
migrationTask:translationConfigDetails:sourceDialect:teradataDialect:mode	Teradata - SQL, BTEQ
migrationTask:translationConfigDetails:sourceEnv:defaultDatabase	BigQuery GCP Project ID
migrationTask:translationConfigDetails:targetDialect:bigqueryDialect	{}
gcsSourcePath	GCS location of input DDL Teradata Files
gcsTargetPath	GCS location where translated output DDL BQ dialect Files must be placed
nameMappingList:name_map:source `{type, database,schema}`	Create output name mapping for schema translations for source Refer to https://cloud.google.com/bigquery/docs/output-name-mapping
nameMappingList:name_map:target `{database,schema}`	Create output name mapping for schema translations for target BQ Refer to https://cloud.google.com/bigquery/docs/output-name-mapping
validation_config:source_config:source_type	DVT Validation Source datawarehouse : Teradata
validation_config:source_config:host	hostname/ IP address of on-prem Teradata
validation_config:source_config:port	Port number of on-prem Teradata
validation_config:source_config:user-name	User-name with permissions to query on source tables
validation_config:source_config:password	Secret Manager key name secret: <secret key name> For example - secret:edw_credentials
validation_config:target_config:target_type	BigQuery
validation_config:target_config:project-id	BigQuery GCP Project ID
validation_config:validation_params_file_path	GCS location of the CSV file or Excel sheet, containing table or file names along with DVT Validation Flags. Examples: gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.csv gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.xlsx *Read Instructions below to understand how to populate and upload this file/sheet*
validation_type	DVT validation type schema
validation_mode	DVT validation mode Possible values - gke, cloudrun
pod_operator_mem	Pod Operator memory configuration when DVT mode is chosen as GKE Default Value - 4000M
pod_operator_cpu	Pod Operator cpu configuration when DVT mode is chosen as GKE Default Value - 800m

SQL

JSON attribute	Description
batchDistribution	Defines the number of parallel Composer Airflow DAGs *[Number of DAGs = Total SQL files/batchDistribution]*
unique_id	Provide an unique name for every batch run identification
type	Type of Translation : sql
source	Source datawarehouse : teradata
translationErrorFilterRules `{field, matchtype, value}`	Field : Category Matchtype : equals Value : <> Allows bypassing of warnings and BQ dialect translation errors; Example Values - NoTargetSupportForFeature, NoTargetSupportForSetTables, NoTargetSupportForIndexBackedConstraints, NoTargetSupportForPartitionSemantics
migrationTask: type	Translation_Teradata2BQ
migrationTask:translationConfigDetails:sourceDialect:teradataDialect:mode	Teradata - SQL, BTEQ
migrationTask:translationConfigDetails:sourceEnv:defaultDatabase	BigQuery GCP Project ID
migrationTask:translationConfigDetails:targetDialect:bigqueryDialect	{}
gcsSourcePath	GCS location of input SQL Teradata Files
gcsTargetPath	GCS location where translated output SQL BQ dialect Files must be placed
nameMappingList:name_map:source `{type, database,schema}`	Create output name mapping for schema translations for source Refer to https://cloud.google.com/bigquery/docs/output-name-mapping
nameMappingList:name_map:target `{database,schema}`	Create output name mapping for schema translations for target BQ Refer to https://cloud.google.com/bigquery/docs/output-name-mapping
validation_config:source_config:source_type	DVT Validation Source datawarehouse : Teradata
validation_config:source_config:host	hostname/ IP address of on-prem Teradata
validation_config:source_config:port	Port number of on-prem Teradata
validation_config:source_config:user-name	User-name with permissions to query on source tables
validation_config:source_config:password	Secret Manager key name secret: <secret key name> For example - secret:edw_credentials
validation_config:target_config:target_type	BigQuery
validation_config:target_config:project-id	BigQuery GCP Project ID
validation_config:target_config:project-id	BigQuery GCP Project ID
validation_config:validation_params_file_path	GCS location of the CSV file or Excel sheet, containing table or file names along with DVT Validation Flags. Examples: gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.csv gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.xlsx *Read Instructions below to understand how to populate and upload this file/sheet*
validation_type	DVT validation type row/column
validation_mode	DVT validation mode Possible values - gke, cloudrun
pod_operator_mem	Pod Operator memory configuration when DVT mode is chosen as GKE Default Value - 4000M
pod_operator_cpu	Pod Operator cpu configuration when DVT mode is chosen as GKE Default Value - 800m

DML

JSON attribute	Description
batchDistribution	N/A
unique_id	Provide an unique name for every batch run identification
type	Type of Translation : dml
source	Source datawarehouse : teradata
translationErrorFilterRules `{field, matchtype, value}`	Field : Category Matchtype : equals Value : <> Allows bypassing of warnings and BQ dialect translation errors; Example Values - NoTargetSupportForFeature, NoTargetSupportForSetTables, NoTargetSupportForIndexBackedConstraints, NoTargetSupportForPartitionSemantics
migrationTask: type	Translation_Teradata2BQ
migrationTask:translationConfigDetails:sourceDialect:teradataDialect:mode	Teradata - SQL, BTEQ
migrationTask:translationConfigDetails:sourceEnv:defaultDatabase	BigQuery GCP Project ID
migrationTask:translationConfigDetails:targetDialect:bigqueryDialect	{}
gcsSourcePath	GCS location of input DML Teradata Files
gcsTargetPath	GCS location where translated output DML BQ dialect Files must be placed
nameMappingList:name_map:source `{type, database,schema}`	Create output name mapping for dml translations for source Refer to https://cloud.google.com/bigquery/docs/output-name-mapping
nameMappingList:name_map:target `{database,schema}`	Create output name mapping for dml translations for target BQ Refer to https://cloud.google.com/bigquery/docs/output-name-mapping

Instructions To Populate And Upload Validation Paramaters File

Please do not edit headers, data validation rules and other sheets in the Excel/Google Sheet as it'll alter the data validation checks put in-place.

Fill the columns according to the use-case. Data Validation Rules within the sheet will help to fill the fields correctly. To know more about which flags are relevant for which translation (ddl,sql,custom-query) and migration (data), please refer to the tab named DVT Guide. This sheet contains link to DVT Github pages which explain about all possible flags in DVT CLI.

You can refer to the tab Example Sheet to get an idea of which flags can be supplied for which validation.

To follow along with Excel sheet
- Open this Excel Template File in Microsoft Excel. (not in Google Sheets, as opening .xlsx in Google Sheets will corrupt the existing Data Validation Rules)
- Fill the sheet Validation . Once filled, save the file in .xlsx format. (Excel will convert the template file format .xltx to .xlsx file for you)
To follow along with CSV file format
- Open this Google Sheet in browser (Viewer Access).
- Make a copy, to get an editable sheet.
- Once filled, download CSV file using the Menu Option. (File --> Download --> Comma Separated Values (.csv))
Once filled, upload the corresponding Excel (.xlsx) file or CSV file in GCS bucket, (inside relevant folder within dmt-config-dmt-demo-project/validation) and specify the GCS path in configuration file.

Create LookerStudio Dashboard using template

Request access to template
Access Dashboard template by clicking here
Click on the three dots as shown in the below image and click on Make a copy

Existing Data Source in Template	BQ Table in Customer’s GCP Project
DVT Aggregated Results	<GCP_PROJECT_ID>.dmt_logs.dmt_dvt_aggregated_results
Validation Results	<GCP_PROJECT_ID>.dmt_logs.dmt_dvt_results
Schema Creation Results	<GCP_PROJECT_ID>.dmt_logs.dmt_schema_aggregated_results
Schema Creation Aggregated Results	<GCP_PROJECT_ID>.dmt_logs.dmt_schema_results
Translation Results	<GCP_PROJECT_ID>.dmt_logs.dmt_translation_results
Translation Aggregated Results	<GCP_PROJECT_ID>.dmt_logs.dmt_translation_aggregated_results
DML Validation Results	<GCP_PROJECT_ID>.dmt_logs.dmt_dml_validation_results
DML Validation Aggregated Results	<GCP_PROJECT_ID>.dmt_logs.dmt_dml_validation_aggregated_results
Report Table	<GCP_PROJECT_ID>.dmt_logs.dmt_report_table
File Table Mapping	<GCP_PROJECT_ID>.dmt_logs.dmt_file_table_mapping
{Data Source} Transfer Run Jobwise Details	<GCP_PROJECT_ID>.dmt_logs.dmt__transfer_run_jobwise_details
{Data Source} Transfer Tracking	<GCP_PROJECT_ID>.dmt_logs.dmt__transfer_tracking
{Data Source} Transfer Run Summary	<GCP_PROJECT_ID>.dmt_logs.dmt__transfer_run_summary

Now click on Copy Report. This will create a copy of the dashboard template.

5. In the new template click on Resource and then select Manage added data sources.To add/replace BigQuery datasets.

6. You’ll now see a new page with list of tables used in this template.Click on EDIT for one table at a time. This will navigate you to the next page.

7. Click on EDIT CONNECTION

8. In the search bar type your GCP Project Name and select it. And then select logs dataset and respective table. Now click on RECONNECT

9. You'll get a pop-up, click on Apply and then click on Done

10. Repeat Steps 5 to 8 for all the remaining data sources, until all the data sources have been replaced. 11. Your dashboard will be named Copy of DMT Dashboard. Click on the name to Rename it to any name that you would prefer for the dashboard

DMT Infrastructure Destroy

Pre destroy step

At the time of deployment if you haven't given _DELETE_BQ_TABLES substitution parameter as true then delete dmt_dataset and dmt_logs bigquery dataset from the Bigquery manually before destroy command.

Destroy DMT for Teradata using the following command:

gcloud builds submit . \
   --config cloudbuild_destroy.yaml \
   --substitutions _DATA_SOURCE="teradata"

Cloud build DMT uses the following default substitutions values

PROJECT_ID=${PROJECT_ID}
_MY_BUCKET_NAME=${PROJECT_ID}-dmt-state-bucket
_MY_CUSTOMER_NAME=${PROJECT_ID}
_DELETE_BQ_TABLES="false"

This default values needs to updated as per the values given at the time of deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DMT User Guide

About the Tool

Prerequisites

Assumptions

General Limitations

Feature Limitations

DMT Deployment

Clone the DMT git repository

Assign Executing User and Admin Permissions

Enable Google Cloud APIs

IAM permissions for Cloud Build Service Account

Deploying DMT infrastructure

Pre deployment Steps

1. Before you proceed with DMT terraform deployment, ensure that there is a GCS bucket created to store Terraform infrastructure State Files

2. [Mandatory for Oracle] Before you proceed with DMT terraform deployment, ensure that Oracle client .RPM library is uploaded in the GCS bucket

3. Mandatory for Shared VPC network configuration in Cloud Composer

Terraform Deployment

Post-Installation setup

Trigger DMT

Sample config.json

Configuration JSON Attributes

DDL

SQL

DML

Instructions To Populate And Upload Validation Paramaters File

Create LookerStudio Dashboard using template

DMT Infrastructure Destroy

Files

README.md

Latest commit

History

README.md

File metadata and controls

DMT User Guide

About the Tool

Prerequisites

Assumptions

General Limitations

Feature Limitations

DMT Deployment

Clone the DMT git repository

Assign Executing User and Admin Permissions

Enable Google Cloud APIs

IAM permissions for Cloud Build Service Account

Deploying DMT infrastructure

Pre deployment Steps

1. Before you proceed with DMT terraform deployment, ensure that there is a GCS bucket created to store Terraform infrastructure State Files

2. [Mandatory for Oracle] Before you proceed with DMT terraform deployment, ensure that Oracle client .RPM library is uploaded in the GCS bucket

3. Mandatory for Shared VPC network configuration in Cloud Composer

Terraform Deployment

Post-Installation setup

Trigger DMT

Sample config.json

Configuration JSON Attributes

DDL

SQL

DML

Instructions To Populate And Upload Validation Paramaters File

Create LookerStudio Dashboard using template

DMT Infrastructure Destroy