- For DMT introduction, please refer to /README.md
- For Teradata data migration deployment user guide, please refer to /docs/installation/README_teradata_datamigration.md
- For Hive data migration deployment user guide, please refer to /docs/installation/README_hive_datamigration.md
- For Redshift data migration deployment user guide, please refer to /docs/installation/README_redshift_datamigration.md
-
Firewall rules - enable firewall between the source database and Google Cloud VPC
[For Redshift]
For AWS Redshift connectivity, please refer guide https://cloud.google.com/bigquery/docs/migration/redshift-vpc
Note:- It is necessary to create VPC GCC subnetwork region as the same region (in GCP) to which VPN tunnel is established (to connect to AWS Redshift cluster)
-
[Mandatory] BigQuery Transfer Service Agent VM requires access to internet to update packages in Agent VM environment and hence the recommended approach is to have a Cloud NAT in place in the same VPC, subnet and regions as your Teradata Agent VM
Cloud NAT is deployed automatically using default settings through terraform with standard naming conventions and automatic IP allocations.
If you wish to use your own Cloud NAT and Cloud router with custom allocations ensure they are created before proceeding with the deployment.
In the case of manual creation of Cloud NAT and Cloud router, to turn off automatic creation of Cloud NAT, change the value of the variable create_nat = “false” in ~/terraform/datamigration/gce/variables.tf after cloning the DMT repository
If you are okay with automatic NAT creation through terraform, please proceed without any changes.
The Cloud Router and NAT should only be created in the same region where the VPC for Cloud Composer and Migration Agent VM are going to be created
The DVT Cloud Run and Cloud Composer should only be created in the same region.
-
Using Standard Deployment for infrastructure will ensure you have default (standard) names for GCP service, service accounts, and default folder structures. This is the recommended approach.
-
By default, all services will be followed by the project-id to denote unique identifier names for all buckets and services. If you wish to change this ensure that you provide the substitution parameter for _MY_CUSTOMER_NAME as part of gcloud builds submit command later during deployment
-
The tool will deploy by default using below location/regions
- us-central1-a
- US location
To change the locations, go to variables.tf file under each GCP service terraform module and update the location values according to your requirements before running the Bash wrapper script for Infrastructure deployment.
It is recommended to keep the migration pipeline components in the same regions as much as possible.
-
By default, GCP services requiring a VPC and subnet will be deployed in default VPC and subnet. To provide custom VPCs and subnets, go to variables.tf file under each applicable GCP service terraform module and update the vpc names and subnet names. This step is applicable for
- Cloud Composer (gcc)
- Teradata Agent Compute Engine VM (gce)
- Cloud Run
- Best practice - provide input DDLs and SQLs in the form of one DDL or SQL query per file.
- Incremental data migration is supported for HIVE only.
- Current release supports DDLs and SELECT queries only. DML statements will be supported with dry run option only.
- Current release doesn't support data transfer for Oracle or Incremental data load for Teradata, Redshift.
Deployment of DMT through terraform modules is split into 2 parts
- Translation – For DDL, DML, SQL translation and validation.
- Data Migration - For data migration from different sources (Teradata and Hive)
The code repository is hosted on Github and can be cloned using below command.
git clone https://github.com/GoogleCloudPlatform/data-migration-tool
Navigate to clone directory using below command
cd data-migration-tool
Checkout the main branch using below command
git checkout main
Admin User who will be executing the deployment of DMT through Cloud Build will require the below set of permissions
- roles/bigquery.dataViewer
- roles/bigquery.user
- roles/cloudbuild.builds.editor
- roles/run.viewer
- roles/compute.admin
- roles/iam.serviceAccountViewer
- roles/logging.viewer
- roles/run.viewer
- roles/serviceusage.serviceUsageConsumer
- roles/storage.admin
- roles/iam.serviceAccountViewer
- projects/${PROJECT_ID}/roles/DMTAdminAdditionalPermissions
And, user who will be using the DMT tool will require the below set of permissions
- roles/bigquery.dataViewer
- roles/bigquery.admin
- roles/run.viewer
- roles/composer.user
- roles/storage.admin
- roles/vpcaccess.admin
- roles/logging.viewer
- projects/${PROJECT_ID}/roles/DMTUserAdditionalPermissions
Note - DMTUserAdditionalPermissions role is custom DMT user role and DMTAdminAdditionalPermissions is also custom role for DMT admin
export ADMIN_ACCOUNT=<EXECUTING_ADMIN_ACCOUNT>
export USER_ACCOUNT=<EXECUTING_USER_ACCOUNT>
export SOURCE_PROJECT=<YOUR_PROJECT_ID>
DMT requires additional user/admin permission except predefined role, you can execute the Bash script dmt-user-roles-setup.sh and dmt-admin-roles-setup.sh present in the root directory to create custom dmt user/admin additional permission role
bash dmt-user-roles-setup.sh
bash dmt-admin-roles-setup.sh
To assign these roles, you can execute the Bash script dmt-user-setup.sh and dmt-admin-user-setup.sh present in the root directory
bash dmt-user-setup.sh
bash dmt-admin-user-setup.sh
From the Cloud Shell, you can enable Google Cloud Services using the gcloud command line interface in your Google Cloud project.
export SOURCE_PROJECT=<YOUR_PROJECT_ID>
gcloud config set project $SOURCE_PROJECT
gcloud services enable serviceusage.googleapis.com \
cloudresourcemanager.googleapis.com \
cloudbuild.googleapis.com
Identify the Cloud Build default Service Account and assign the below IAM roles for the execution of the deployment of DMT architecture.
Google Cloud Console -> Cloud Build -> Settings
IAM roles required for the created Service Account
Artifact Registry Administrator
BigQuery Admin
Cloud Build Service Account
Cloud Run Admin
Composer Administrator
Compute Instance Admin (v1)
Compute Network Admin
Create Service Accounts
Logs Viewer
Project IAM Admin
Pub/Sub Admin
Secret Manager Admin
Service Account User
Service Usage Admin
Storage Admin
To assign these roles, you can execute the Bash script cloudbuild-sa-iam-setup.sh present in the root directory
export BUILD_ACCOUNT=<CLOUDBUILD_SERVICE_ACCOUNT>
export SOURCE_PROJECT=<YOUR_PROJECT_ID>
bash cloudbuild-sa-iam-setup.sh
This script will assign all necessary IAM permissions to the Cloud Build Service Account for deployment of the DMT tool.
Once the execution of the bash script is successful, you can proceed directly to the next step Deploying DMT Infrastructure
Alternatively, you can also assign the roles by running gcloud command from your cloud shell
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/bigquery.admin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/run.admin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/composer.admin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/compute.instanceAdmin.v1"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/compute.networkAdmin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/iam.serviceAccountCreator"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/logging.viewer"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/resourcemanager.projectIamAdmin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/pubsub.admin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/secretmanager.admin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/iam.serviceAccountUser"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/serviceusage.serviceUsageAdmin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/storage.admin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/artifactregistry.admin"
Translations deployment will take care of the workflow needed to perform
- DDL, DML and SQL translations
- Schema migration to BQ (Creation of tables from translated DDLs)
- Schema Validation and Custom query(row/column for SQL validation) using DVT tool
- Options to run DVT on cloud run and GKE airflow pod operator depending on scaling requirements.
Note : SQL translation and validation can only be run after data loading happens into Bigquery
If you choose the deployment of Translation + Data Migration, in addition to the translation activities mentioned above, you will be able to
- Perform initial data loads
- Row and column data validation using DVT tool
1. Before you proceed with DMT terraform deployment, ensure that there is a GCS bucket created to store Terraform infrastructure State Files
export SOURCE_PROJECT=<YOUR_PROJECT_ID>
gcloud config set project ${SOURCE_PROJECT}
gcloud storage buckets create gs://${SOURCE_PROJECT}-dmt-state-bucket
2. [Mandatory for Oracle] Before you proceed with DMT terraform deployment, ensure that Oracle client .RPM library is uploaded in the GCS bucket
Guideline to download oracle_client_rpm_file and upload in GCS bucket -
-
Download oracle instant client library(.rpm extension) file from Oracle downloads: https://www.oracle.com/in/database/technologies/instant-client/linux-x86-64-downloads.html
-
Upload the file in GCS bucket
gsutil cp <downloaded_oracle_client_rpm_file> gs://<gcs_bucket>/<nested_folders_if_any>/
(File can also be uploaded using cloud console)
It is advisable that oracle client rpm file name should be same as downloaded
- Set the below environment variables(This would be used while running deployment command) -
export ORACLE_RPM_GCS_BUCKET_PATH=<Fully Qualified RPM file GCS bucket path>
For example, Fully Qualified RPM file GCS bucket path is
gs://<gcs_bucket>/<nested_folders_if_any>/downloaded_filename.rpm
export ORACLE_INSTANT_CLIENT_LIB_VERSION=<Oracle_Instant_Client_Lib_Version>
The version canbe found from the downloaded file or from the Oracle download page from where the library is downloaded
Note: This library is required by data validation tool and while running deployment command it is required to pass library details as substitution parameters
In Shared VPC networking designated host project provides VPC network and sub-network to the service project where cloud composer environment would be deployed. The service project must be attached with host project to use the VPC.
Perform below predeployment steps to setup/configure shared VPC for composer -
- Setup the environment variable if it doesn't exist.
export HOST_PROJECT=<HOST_PROJECT_ID>
export SOURCE_PROJECT=<YOUR_PROJECT_ID>
- Network resource configuration - VPC network should have subnetwork with the region available for cloud composer see available regions list under Products available by location section deployment. Subnetwork should have 2 Secondary IP ranges created explicitly which is required for GKE pods and services.
Note:-
-
Mask size for secondary IP ranges should be between /9 to /23 bits based on composer environment size.
For Example, /17 and /22 is recommended for medium composer environment.
-
While creating subnet or using existing one Enable Private Google Access (This would be recomended even if plan is to use Public IP environments)
- Enable GKE API in your host and service projects if not enabled
gcloud services enable container.googleapis.com --project $HOST_PROJECT
gcloud services enable container.googleapis.com --project $SOURCE_PROJECT
-
Provide below permission to service project service account in host project -
a. Create environment variable for service project number(Service project number is available in dashboard page of project under Project info section) -
export SERVICE_PROJECT_NUMBER=<service_project_number>
b. Provide Compute Network User Permission to Google APIs service agent
export GOOGLE_API_SERVICE_AGENT=$SERVICE_PROJECT_NUMBER@cloudservices.gserviceaccount.com
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \ --member="serviceAccount:$GOOGLE_API_SERVICE_AGENT" \ --role="roles/compute.networkUser"
c. Provide Compute Network User and Kubernetes Engine Host Service Agent User Permission to GKE service agent
export GKE_SERVICE_AGENT=service-$SERVICE_PROJECT_NUMBER@container-engine-robot.iam.gserviceaccount.com
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \ --member="serviceAccount:$GKE_SERVICE_AGENT" \ --role="roles/compute.networkUser"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \ --member="serviceAccount:$GKE_SERVICE_AGENT" \ --role="roles/container.hostServiceAgentUser"
d. Provide Permission to Composer Service Agent
- Either provide Composer Shared VPC Agent Permission to Composer Service Agent Account in case for Private environment
export COMPOSER_SERVICE_AGENT=service-$SERVICE_PROJECT_NUMBER@cloudcomposer-accounts.iam.gserviceaccount.com
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \ --member="serviceAccount:$COMPOSER_SERVICE_AGENT" \ --role="roles/composer.sharedVpcAgent"
Or, Provide Compute Network User Permission to Composer Agent Service Account in case for Public environment
export COMPOSER_SERVICE_AGENT=service-$SERVICE_PROJECT_NUMBER@cloudcomposer-accounts.iam.gserviceaccount.com
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \ --member="serviceAccount:$COMPOSER_SERVICE_AGENT" \ --role="roles/compute.networkUser"
For more information and references please visit Configure Shared VPC networking documentation
- Set default GCP project in your Cloud shell or terminal or bastion host with terraform, terragrunt and gcloud sdk installed
gcloud config set project ${SOURCE_PROJECT}
- Check the current project
gcloud config get project
- Run the command
gcloud auth application-default login
-
Login with your Google Cloud credentials and ensure you have the required IAM permissions listed in the step [Assign Executing User Permissions]
-
DMT tool leverages default VPC network and subnetwork for Cloud Composer and Teradata Agent VM Compute Engine. If you wish to use a custom VPC and subnet, provide the right values for custom VPC and subnet in these files
~/terraform/translation/gcc/variables.tf ~/terraform/datamigration/gce/variables.tf
Note : DMT tool does not create VPC and subnets, these should be created beforehand.
It is recommended to use the same VPC and subnet for Cloud Composer and Teradata Agent VM with the VPC and subnet able to access source Teradata and Hive Data Warehouses necessary ports.
~/terraform/translation/gcc/variables.tf Click here to view source file
~/terraform/datamigration//gce/variables.tf
-
If you would like to customize additional variables, please ensure the appropriate values are set for those variables in variables.tf file in each services’ folder - pubsub, gcc, gcs, bq, cloudrun
~/terraform/translation/gcs/variables.tf
~/terraform/translation/pubsub/variables.tf
~/terraform/translation/cloudrun/variables.tf
~/terraform/translation/gcc/variables.tf
~/terraform/translation/bq/variables.tf
(Note: Please note that region for cloud composer and dvt cloud run should be same)
- Navigate inside the dmt repo
cd data-migration-tool
gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=<source data warehouse>
Cloud build DMT uses the following default values
-
PROJECT_ID=${PROJECT_ID}
-
_MY_BUCKET_NAME=${PROJECT_ID}-dmt-state-bucket
-
_MY_CUSTOMER_NAME=${PROJECT_ID}
-
_DELETE_BQ_TABLES="false"
If you wish to change the default customer name tagged as suffix for all services and also change the name of the bucket for storing Terraform State Files (should be created beforehand)
Provide the values as substitution variables in gcloud builds command
For example,
If bucket name for state files = demo-state
Required customer name = democust
gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_MY_BUCKET_NAME=demo-state,_MY_CUSTOMER_NAME=democust,_DATA_SOURCE=hive
Otherwise DMT deploys with default predefined naming conventions for State File GCS bucket and Customer Name
Default false value of _DELETE_BQ_TABLES won't allow to delete BigQuery tables when you want to destroy DMT infrastructure in future. So if you wish to destroy BigQuery tables with destroy script, set the value to true in gcloud builds command.
For example,
gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DELETE_BQ_TABLES="true",_DATA_SOURCE=<data_source>
Otherwise DMT deploys with default predefined values.
Additional substitution variable for Oracle data source
- _ORACLE_INSTANTCLIENT_FILE_PATH=${ORACLE_RPM_GCS_BUCKET_PATH}
- _ORACLE_ODBC_VERSION_NUMBER=${ORACLE_INSTANT_CLIENT_LIB_VERSION}
Used environment variable created in pre-deployment steps
For instance, if you would like to leverage DMT for
Teradata to BigQuery Migration
gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=teradata
Hive to Bigquery Migrations
gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=hive
Oracle to Bigquery Migrations
gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_ORACLE_INSTANTCLIENT_FILE_PATH=${ORACLE_RPM_GCS_BUCKET_PATH},_ORACLE_ODBC_VERSION_NUMBER=${ORACLE_INSTANT_CLIENT_LIB_VERSION}
Note:
If Data Migration is not required, you can choose to skip _DATA_SOURCE substitution variable which will only deploy the DMT Translation Architecture
- Oracle initial release supports only schema migration so
_DATA_SOURCE
substitution variable not required.
All the available deployment configurations and corresponding gcloud build commands are listed below -
DMT Purpose | Source | Command |
Translation Only | All Sources (except Oracle) | gcloud builds submit . --project ${SOURCE_PROJECT} \ \
--config cloudbuild_deploy.yaml
|
Translation for Oracle | Oracle | gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_ORACLE_INSTANTCLIENT_FILE_PATH=${ORACLE_RPM_GCS_BUCKET_PATH},_ORACLE_ODBC_VERSION_NUMBER=${ORACLE_INSTANT_CLIENT_LIB_VERSION}
|
Translation + Data Migration | Teradata | gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=teradata
|
Translation + Data Migration | Hive | gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=hive
|
Translation + Data Migration | Redshift | gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=redshift
|
Script execution starting logs
starting build "206986f5-4e41-48f6-bff7-xyz"
FETCHSOURCE
Fetching storage object: gs://dmt-1_cloudbuild/source/1684304084.135633-xyz.tgz#1684304086786340
Copying gs://dmt-1_cloudbuild/source/1684304084.135633-xyz.tgz#1684304086786340...
/ [0 files][ 0.0 B/115.2 KiB]
/ [1 files][115.2 KiB/115.2 KiB]
Operation completed over 1 objects/115.2 KiB.
BUILD
Pulling image: gcr.io/google.com/cloudsdktool/cloud-sdk:latest
latest: Pulling from google.com/cloudsdktool/cloud-sdk
918547b94326: Pulling fs layer
9ecbbb83fff4: Pulling fs layer
831b9ef5c145: Pulling fs layer
4718bdb8c7d1: Pulling fs layer
671167368ac0: Pulling fs layer
014f91216bbc: Pulling fs layer
4718bdb8c7d1: Waiting
671167368ac0: Waiting
014f91216bbc: Waiting
831b9ef5c145: Verifying Checksum
9ecbbb83fff4: Verifying Checksum
9ecbbb83fff4: Download complete
918547b94326: Verifying Checksum
918547b94326: Download complete
014f91216bbc: Verifying Checksum
014f91216bbc: Download complete
671167368ac0: Verifying Checksum
671167368ac0: Download complete
918547b94326: Pull complete
9ecbbb83fff4: Pull complete
831b9ef5c145: Pull complete
4718bdb8c7d1: Download complete
4718bdb8c7d1: Pull complete
671167368ac0: Pull complete
014f91216bbc: Pull complete
- This automated infra setup for GCS, Pubsub, Cloud Composer, Cloud Run, BQ DVT dataset is expected to take approximately 35-45 minutes to complete. Cloud Composer creation takes the most amount of time
You can monitor the status of the Deployment in your Cloud Build Logs by navigating to below path -
Google Cloud Console -> Cloud Build -> History
Since the chosen option is Translation + Data Migration for Teradata
Cloud Build will automatically trigger the deployment of Data Migration architecture as well.
Successful execution log -
Apply complete! Resources: 2 added, 0 changed, 0 destroyed.
google_compute_router.router[0]: Still creating... [20s elapsed]
google_compute_router.router[0]: Creation complete after 21s [id=projects/dmt-1/regions/us-central1/routers/dmt-cloud-router]
google_compute_router_nat.nat[0]: Creating...
google_compute_router_nat.nat[0]: Still creating... [10s elapsed]
google_compute_router_nat.nat[0]: Still creating... [20s elapsed]
google_compute_router_nat.nat[0]: Creation complete after 22s [id=dmt-1/us-central1/dmt-cloud-router/dmt-cloud-nat]
google_compute_disk.dm-disk-terdata-bq: Creating...
google_compute_disk.dm-disk-terdata-bq: Creation complete after 1s [id=projects/dmt-1/zones/us-central1-a/disks/dm-disk-terdata-bq]
google_compute_instance.dm-vm-teradata-bq: Creating...
google_compute_instance.dm-vm-teradata-bq: Still creating... [10s elapsed]
google_compute_instance.dm-vm-teradata-bq: Creation complete after 12s [id=projects/dmt-1/zones/us-central1-a/instances/dm-vm-teradata-bq]
Apply complete! Resources: 10 added, 0 changed, 0 destroyed.
PUSH
DONE
Follow the below steps for Teradata and Oracle
Check if the environment values still exist and are correct, if not, set them again for post-installation steps
Ensure you have an existing source database environment for testing:
- For Teradata,
- If you do not have one, you can create one following the steps here.
- If you want to follow the qwiklab to set up TD instance with tpch installed when finished, you can follow the steps here.
- For Oracle,
- If you do not have one, you can create one following the steps here.
Set environment variables in Cloud Shell
export SOURCE_PROJECT=<YOUR_PROJECT_ID>
<SOURCE_PROJECT> - Project ID where you will be deploying DMT tool
- Grant Cloud Run invoker role to PubSub service account (dmt-sa-pubsub) under GCP IAM roles.
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:dmt-sa-pubsub@$SOURCE_PROJECT.iam.gserviceaccount.com" \
--role="roles/run.invoker"
-
Go to the GCP console and search for Kubernetes Clusters - select your composer cluster instance and identify the IP subnet range of the Pod Address range.
-
Go to your VPC network (default in this case) and create a firewall rule with the name pod-operator-firewall.
- Direction of Traffic - Ingress
- Action on Match - Allow
- Targets - All instances in the network [Teradata/Oracle GCE instance]
- Source Filter - IPv4
- Source IPv4 ranges - that you got in step #2. E.g: 10.57.0.0/17
- Protocols and Ports - Allow all / Allowed Teradata ports
gcloud compute firewall-rules create pod-operator-firewall \
--project=$SOURCE_PROJECT \
--direction=INGRESS \
--priority=1000 \
--network=<YOUR Composer VPC name> \
--action=ALLOW \
--rules=all \
--source-ranges=<YOUR Composer Kubernetes Pod Cluster IP obtained from Step 2>
Note. For real customer use cases, firewall ports from Composer Cluster Pod Address Range should be able to reach the on-prem Teradata/ Oracle instance through firewall rules being created
-
Go to Secret Manager and create a new secret with the name secret-edw_credentials
This should contain the password for Teradata/Oracle DB. It is recommended to keep the name of the secret as secret-edw_credentials
Note (Real customer use cases): please create vpc access connectors as part of the VPC that has firewalls open to connect to customer’s on prem teradata instance. The Cloud Run service (to execute the DVT tool) requires the VPC access connector to be attached to it to hit the Teradata/Oracle DB
Note (If you are a Googler): If you are a Googler and are running a test environment with Teradata express edition or Oracle on GCP Compute VMs, ensure you add a network tag cc-dvt to the on TD/Oracle instance
-
Validate if all DAG Airflows in the Cloud Composer instance are in paused state through Airflow URI
-
Switch all the DAG status to Active
-
Validate workload_identity_creator_dag in Airflow UI is executed one time automatically on turning DAGs active and is successful
-
If you intend to migrate tables over to BQ dataset other than dmt-teradata-dataset, please ensure this dataset is manually created and is provided in the config files later (create tables from DDLs)
-
(Mandatory for teradata ddl extraction) Teradata ddl extraction requires teradata jdbc jar. Upload the jdbc jar file to your GCS config bucket under the
software/teradata
folder as shown below:-
Download the jar from teradata downloads: https://downloads.teradata.com/download/connectivity/jdbc-driver
-
Upload the jar to your GCS config bucket and ensure that the jar file name exactly matches
terajdbc4.jar
.
Make sure the file is placed in thesoftware/teradata
path
(e.g.gs://YOUR_DMT_CONFIG_BUCKET/software/teradata/terajdbc4.jar
)
-
-
(Optional - Only for Oracle) Remove GCS bucket created to store .RPM file for setup.
Build connection between CloudRun and Source database [Applicable for network connectivity uses VPN connectivity]
- Create serverless VPC Connector (specify same network as well as region which has firewall rules allowed to connect to on prem/ cloud data source)
gcloud compute networks vpc-access connectors create crun-dvt-connector \
--network default \
--region us-central1 \
--range 10.6.0.0/28 \
--min-instances 2 \
--max-instances 3 \
--machine-type f1-micro
-
Attach the network tag cc-dvt as the target or provide directly the IP address of the on prem TD/Oracle instance. (This tag can be used as target tag while creating a firewall rule at a later stage)
-
Create an ingress firewall rule (in source database network) to deny traffic from connector network tag:
gcloud compute firewall-rules create deny-vpc-connector --action=DENY --rules=all --source-tags=vpc-connector-us-central1-crun-dvt-connector --direction=INGRESS --network=default --priority=990
- Create an ingress firewall rule targeting the Teradata/Oracle IP that you want the VPC connector to access. Set the priority for this rule to be a lower value than the priority of the rule you made in Step 25. (Here **cc-dvt **is the network tag attached to Teradata/Oracle VM)
gcloud compute firewall-rules create allow-vpc-connector-for-select-resources --allow=all --source-tags=vpc-connector-us-central1-crun-dvt-connector --direction=INGRESS --network=default --target-tags=cc-dvt --priority=980
-
Deploy a new version of the Cloud Run DVT service (edw-dvt-tool-<customerid>) after performing the below changes
Secret: secret-edw_credentials
Reference Method : Exposed as Environment variable
Environment_variables
Name 1: edw_credentials
- Translations
- Schema Migration to Google BigQuery
- Schema Validation
-
Prepare sample config.json file including Input/output buckets, translation type (DDL, SQL or DML), dialects, Name mapping properties and DVT connection details
-
Upload required SQL and DDL files into respective bucket paths defined in config.json
For example,
DDL = gs://dmt-translation-control/input/ddl
SQL = gs://dmt-translation-control/input/sql (SQL files), gs://dmt-translation-control/input/sql/ddl (DDL/Metadata files)
DML = gs://dmt-translation-control/input/dml (DML files), gs://dmt-translation-control/input/sql/ddl (DDL/Metadata files)
-
Upload config.json into the config folder of the bucket
For example,
DDL = gs://dmt-config-control/input/ddl
SQL = gs://dmt-config-control/input/sql
DML = gs://dmt-config-control/input/dml
-
In case of DDL - validate execution of DAGs
- controller_dag
- extract_ddl_dag - if extract_ddl is yes
- batch_sql_translation
- schema_dag
- validation_dag
- validation_dag - if validation_mode is gke
- validation_crun if validation_mode is cloudrun
-
In case of SQL - validate execution of DAGs
- Controller_dag
- Batch_sql_translation
- validation_dag
- validation_dag - if validation_mode is gke
- validation_crun if validation_mode is cloudrun
-
In case of DML - validate execution of DAGs
- Controller_dag
- Batch_sql_translation
- Dml_validation_dag
-
Translated DDL and SQL files can be found in output buckets
DDL = gs://dmt-translation-control/output/ddl
SQL = gs://dmt-translation-control/output/sql
DML = gs://dmt-translation-control/output/dml
-
Translation Summary results can be found in the output buckets along with translated files - batch_translation_report.csv
Click here for sample config.json files
JSON attribute | Description |
batchDistribution | Defines the number of parallel Composer Airflow DAGs
[Number of DAGs = Total DDL files/batchDistribution] |
extract_ddl | Provide flag for ddl/schema extraction from source database |
unique_id | Provide an unique name for every batch run identification |
type | Type of Translation : ddl |
source | Source datawarehouse : teradata |
translationErrorFilterRules
|
Field : Category
Matchtype : equals Value : <> Allows bypassing of warnings and BQ dialect translation errors; Example Values - NoTargetSupportForFeature, NoTargetSupportForSetTables, NoTargetSupportForIndexBackedConstraints, NoTargetSupportForPartitionSemantics |
migrationTask: type | Translation_Teradata2BQ |
migrationTask:translationConfigDetails:sourceDialect:teradataDialect:mode | Teradata - SQL, BTEQ |
migrationTask:translationConfigDetails:sourceEnv:defaultDatabase | BigQuery GCP Project ID |
migrationTask:translationConfigDetails:targetDialect:bigqueryDialect | {} |
gcsSourcePath | GCS location of input DDL Teradata Files |
gcsTargetPath | GCS location where translated output DDL BQ dialect Files must be placed |
nameMappingList:name_map:source
|
Create output name mapping for schema translations for source
Refer to https://cloud.google.com/bigquery/docs/output-name-mapping |
nameMappingList:name_map:target
|
Create output name mapping for schema translations for target BQ
Refer to https://cloud.google.com/bigquery/docs/output-name-mapping |
validation_config:source_config:source_type | DVT Validation
Source datawarehouse : Teradata |
validation_config:source_config:host | hostname/ IP address of on-prem Teradata |
validation_config:source_config:port | Port number of on-prem Teradata |
validation_config:source_config:user-name | User-name with permissions to query on source tables |
validation_config:source_config:password | Secret Manager key name
secret: <secret key name> For example - secret:edw_credentials |
validation_config:target_config:target_type | BigQuery |
validation_config:target_config:project-id | BigQuery GCP Project ID |
validation_config:validation_params_file_path | GCS location of the CSV file or Excel sheet, containing table or file names along with DVT Validation Flags.
Examples: gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.csv gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.xlsx Read Instructions below to understand how to populate and upload this file/sheet |
validation_type | DVT validation type schema |
validation_mode | DVT validation mode
Possible values - gke, cloudrun |
pod_operator_mem | Pod Operator memory configuration when DVT mode is chosen as GKE
Default Value - 4000M |
pod_operator_cpu | Pod Operator cpu configuration when DVT mode is chosen as GKE
Default Value - 800m |
JSON attribute | Description |
batchDistribution | Defines the number of parallel Composer Airflow DAGs
[Number of DAGs = Total SQL files/batchDistribution] |
unique_id | Provide an unique name for every batch run identification |
type | Type of Translation : sql |
source | Source datawarehouse : teradata |
translationErrorFilterRules
|
Field : Category
Matchtype : equals Value : <> Allows bypassing of warnings and BQ dialect translation errors; Example Values - NoTargetSupportForFeature, NoTargetSupportForSetTables, NoTargetSupportForIndexBackedConstraints, NoTargetSupportForPartitionSemantics |
migrationTask: type | Translation_Teradata2BQ |
migrationTask:translationConfigDetails:sourceDialect:teradataDialect:mode | Teradata - SQL, BTEQ |
migrationTask:translationConfigDetails:sourceEnv:defaultDatabase | BigQuery GCP Project ID |
migrationTask:translationConfigDetails:targetDialect:bigqueryDialect | {} |
gcsSourcePath | GCS location of input SQL Teradata Files |
gcsTargetPath | GCS location where translated output SQL BQ dialect Files must be placed |
nameMappingList:name_map:source
|
Create output name mapping for schema translations for source
Refer to https://cloud.google.com/bigquery/docs/output-name-mapping |
nameMappingList:name_map:target
|
Create output name mapping for schema translations for target BQ
Refer to https://cloud.google.com/bigquery/docs/output-name-mapping |
validation_config:source_config:source_type | DVT Validation
Source datawarehouse : Teradata |
validation_config:source_config:host | hostname/ IP address of on-prem Teradata |
validation_config:source_config:port | Port number of on-prem Teradata |
validation_config:source_config:user-name | User-name with permissions to query on source tables |
validation_config:source_config:password | Secret Manager key name
secret: <secret key name> For example - secret:edw_credentials |
validation_config:target_config:target_type | BigQuery |
validation_config:target_config:project-id | BigQuery GCP Project ID |
validation_config:target_config:project-id | BigQuery GCP Project ID |
validation_config:validation_params_file_path | GCS location of the CSV file or Excel sheet, containing table or file names along with DVT Validation Flags.
Examples: gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.csv gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.xlsx Read Instructions below to understand how to populate and upload this file/sheet |
validation_type | DVT validation type
row/column |
validation_mode | DVT validation mode
Possible values - gke, cloudrun |
pod_operator_mem | Pod Operator memory configuration when DVT mode is chosen as GKE
Default Value - 4000M |
pod_operator_cpu | Pod Operator cpu configuration when DVT mode is chosen as GKE
Default Value - 800m |
JSON attribute | Description |
batchDistribution | N/A |
unique_id | Provide an unique name for every batch run identification |
type | Type of Translation : dml |
source | Source datawarehouse : teradata |
translationErrorFilterRules
|
Field : Category
Matchtype : equals Value : <> Allows bypassing of warnings and BQ dialect translation errors; Example Values - NoTargetSupportForFeature, NoTargetSupportForSetTables, NoTargetSupportForIndexBackedConstraints, NoTargetSupportForPartitionSemantics |
migrationTask: type | Translation_Teradata2BQ |
migrationTask:translationConfigDetails:sourceDialect:teradataDialect:mode | Teradata - SQL, BTEQ |
migrationTask:translationConfigDetails:sourceEnv:defaultDatabase | BigQuery GCP Project ID |
migrationTask:translationConfigDetails:targetDialect:bigqueryDialect | {} |
gcsSourcePath | GCS location of input DML Teradata Files |
gcsTargetPath | GCS location where translated output DML BQ dialect Files must be placed |
nameMappingList:name_map:source
|
Create output name mapping for dml translations for source
Refer to https://cloud.google.com/bigquery/docs/output-name-mapping |
nameMappingList:name_map:target
|
Create output name mapping for dml translations for target BQ
Refer to https://cloud.google.com/bigquery/docs/output-name-mapping |
Please do not edit headers, data validation rules and other sheets in the Excel/Google Sheet as it'll alter the data validation checks put in-place.
Fill the columns according to the use-case. Data Validation Rules within the sheet will help to fill the fields correctly.
To know more about which flags are relevant for which translation (ddl,sql,custom-query) and migration (data), please refer to the tab named DVT Guide
. This sheet contains link to DVT Github pages which explain about all possible flags in DVT CLI.
You can refer to the tab Example Sheet
to get an idea of which flags can be supplied for which validation.
-
To follow along with Excel sheet
- Open this Excel Template File in Microsoft Excel. (not in Google Sheets, as opening .xlsx in Google Sheets will corrupt the existing Data Validation Rules)
- Fill the sheet
Validation
. Once filled, save the file in.xlsx
format. (Excel will convert the template file format.xltx
to.xlsx
file for you)
-
To follow along with CSV file format
- Open this Google Sheet in browser (Viewer Access).
- Make a copy, to get an editable sheet.
- Once filled, download CSV file using the Menu Option. (File --> Download --> Comma Separated Values (.csv))
Once filled, upload the corresponding Excel (.xlsx) file or CSV file in GCS bucket, (inside relevant folder within dmt-config-dmt-demo-project/validation) and specify the GCS path in configuration file.
- Request access to template
- Access Dashboard template by clicking here
- Click on the three dots as shown in the below image and click on Make a copy
Existing Data Source in Template | BQ Table in Customer’s GCP Project |
DVT Aggregated Results | <GCP_PROJECT_ID>.dmt_logs.dmt_dvt_aggregated_results |
Validation Results | <GCP_PROJECT_ID>.dmt_logs.dmt_dvt_results |
Schema Creation Results | <GCP_PROJECT_ID>.dmt_logs.dmt_schema_aggregated_results |
Schema Creation Aggregated Results | <GCP_PROJECT_ID>.dmt_logs.dmt_schema_results |
Translation Results | <GCP_PROJECT_ID>.dmt_logs.dmt_translation_results |
Translation Aggregated Results | <GCP_PROJECT_ID>.dmt_logs.dmt_translation_aggregated_results |
DML Validation Results | <GCP_PROJECT_ID>.dmt_logs.dmt_dml_validation_results |
DML Validation Aggregated Results | <GCP_PROJECT_ID>.dmt_logs.dmt_dml_validation_aggregated_results |
Report Table | <GCP_PROJECT_ID>.dmt_logs.dmt_report_table |
File Table Mapping | <GCP_PROJECT_ID>.dmt_logs.dmt_file_table_mapping |
{Data Source} Transfer Run Jobwise Details | <GCP_PROJECT_ID>.dmt_logs.dmt__transfer_run_jobwise_details |
{Data Source} Transfer Tracking | <GCP_PROJECT_ID>.dmt_logs.dmt__transfer_tracking |
{Data Source} Transfer Run Summary | <GCP_PROJECT_ID>.dmt_logs.dmt__transfer_run_summary |
- Now click on Copy Report. This will create a copy of the dashboard template.
5. In the new template click on Resource and then select Manage added data sources.To add/replace BigQuery datasets.
6. You’ll now see a new page with list of tables used in this template.Click on EDIT for one table at a time. This will navigate you to the next page.
8. In the search bar type your GCP Project Name and select it. And then select logs dataset and respective table. Now click on RECONNECT
9. You'll get a pop-up, click on Apply and then click on Done
10. Repeat Steps 5 to 8 for all the remaining data sources, until all the data sources have been replaced. 11. Your dashboard will be named Copy of DMT Dashboard. Click on the name to Rename it to any name that you would prefer for the dashboard
Pre destroy step
- At the time of deployment if you haven't given _DELETE_BQ_TABLES substitution parameter as true then delete
dmt_dataset
anddmt_logs
bigquery dataset from the Bigquery manually before destroy command.
Destroy DMT for Teradata using the following command:
gcloud builds submit . \
--config cloudbuild_destroy.yaml \
--substitutions _DATA_SOURCE="teradata"
Cloud build DMT uses the following default substitutions values
-
PROJECT_ID=${PROJECT_ID}
-
_MY_BUCKET_NAME=${PROJECT_ID}-dmt-state-bucket
-
_MY_CUSTOMER_NAME=${PROJECT_ID}
-
_DELETE_BQ_TABLES="false"
This default values needs to updated as per the values given at the time of deployment