Skip to content

Latest commit

 

History

History
1891 lines (1398 loc) · 52.3 KB

File metadata and controls

1891 lines (1398 loc) · 52.3 KB

DMT User Guide

About the Tool

Prerequisites

  • Firewall rules - enable firewall between the source database and Google Cloud VPC

    [For Redshift]

    For AWS Redshift connectivity, please refer guide https://cloud.google.com/bigquery/docs/migration/redshift-vpc

    Note:- It is necessary to create VPC GCC subnetwork region as the same region (in GCP) to which VPN tunnel is established (to connect to AWS Redshift cluster)

  • [Mandatory] BigQuery Transfer Service Agent VM requires access to internet to update packages in Agent VM environment and hence the recommended approach is to have a Cloud NAT in place in the same VPC, subnet and regions as your Teradata Agent VM

    Cloud NAT is deployed automatically using default settings through terraform with standard naming conventions and automatic IP allocations.

    If you wish to use your own Cloud NAT and Cloud router with custom allocations ensure they are created before proceeding with the deployment.

    In the case of manual creation of Cloud NAT and Cloud router, to turn off automatic creation of Cloud NAT, change the value of the variable create_nat = “false” in ~/terraform/datamigration/gce/variables.tf after cloning the DMT repository

    If you are okay with automatic NAT creation through terraform, please proceed without any changes.

The Cloud Router and NAT should only be created in the same region where the VPC for Cloud Composer and Migration Agent VM are going to be created
The DVT Cloud Run and Cloud Composer should only be created in the same region.

Assumptions

  • Using Standard Deployment for infrastructure will ensure you have default (standard) names for GCP service, service accounts, and default folder structures. This is the recommended approach.

  • By default, all services will be followed by the project-id to denote unique identifier names for all buckets and services. If you wish to change this ensure that you provide the substitution parameter for _MY_CUSTOMER_NAME as part of gcloud builds submit command later during deployment

  • The tool will deploy by default using below location/regions

    • us-central1-a
    • US location

    To change the locations, go to variables.tf file under each GCP service terraform module and update the location values according to your requirements before running the Bash wrapper script for Infrastructure deployment.

    It is recommended to keep the migration pipeline components in the same regions as much as possible.

  • By default, GCP services requiring a VPC and subnet will be deployed in default VPC and subnet. To provide custom VPCs and subnets, go to variables.tf file under each applicable GCP service terraform module and update the vpc names and subnet names. This step is applicable for

    • Cloud Composer (gcc)
    • Teradata Agent Compute Engine VM (gce)
    • Cloud Run

General Limitations

  • Best practice - provide input DDLs and SQLs in the form of one DDL or SQL query per file.
  • Incremental data migration is supported for HIVE only.

Feature Limitations

  • Current release supports DDLs and SELECT queries only. DML statements will be supported with dry run option only.
  • Current release doesn't support data transfer for Oracle or Incremental data load for Teradata, Redshift.

DMT Deployment

Deployment of DMT through terraform modules is split into 2 parts

  • Translation – For DDL, DML, SQL translation and validation.
  • Data Migration - For data migration from different sources (Teradata and Hive)

Clone the DMT git repository

The code repository is hosted on Github and can be cloned using below command.

git clone https://github.com/GoogleCloudPlatform/data-migration-tool

Navigate to clone directory using below command

cd data-migration-tool

Checkout the main branch using below command

git checkout main

Assign Executing User and Admin Permissions

Admin User who will be executing the deployment of DMT through Cloud Build will require the below set of permissions

  • roles/bigquery.dataViewer
  • roles/bigquery.user
  • roles/cloudbuild.builds.editor
  • roles/run.viewer
  • roles/compute.admin
  • roles/iam.serviceAccountViewer
  • roles/logging.viewer
  • roles/run.viewer
  • roles/serviceusage.serviceUsageConsumer
  • roles/storage.admin
  • roles/iam.serviceAccountViewer
  • projects/${PROJECT_ID}/roles/DMTAdminAdditionalPermissions

And, user who will be using the DMT tool will require the below set of permissions

  • roles/bigquery.dataViewer
  • roles/bigquery.admin
  • roles/run.viewer
  • roles/composer.user
  • roles/storage.admin
  • roles/vpcaccess.admin
  • roles/logging.viewer
  • projects/${PROJECT_ID}/roles/DMTUserAdditionalPermissions

Note - DMTUserAdditionalPermissions role is custom DMT user role and DMTAdminAdditionalPermissions is also custom role for DMT admin

 export ADMIN_ACCOUNT=<EXECUTING_ADMIN_ACCOUNT>
 export USER_ACCOUNT=<EXECUTING_USER_ACCOUNT>
 export SOURCE_PROJECT=<YOUR_PROJECT_ID>

DMT requires additional user/admin permission except predefined role, you can execute the Bash script dmt-user-roles-setup.sh and dmt-admin-roles-setup.sh present in the root directory to create custom dmt user/admin additional permission role

bash dmt-user-roles-setup.sh
bash dmt-admin-roles-setup.sh

To assign these roles, you can execute the Bash script dmt-user-setup.sh and dmt-admin-user-setup.sh present in the root directory

bash dmt-user-setup.sh
bash dmt-admin-user-setup.sh

Enable Google Cloud APIs

From the Cloud Shell, you can enable Google Cloud Services using the gcloud command line interface in your Google Cloud project.

export SOURCE_PROJECT=<YOUR_PROJECT_ID>
gcloud config set project $SOURCE_PROJECT
gcloud services enable serviceusage.googleapis.com \
			     cloudresourcemanager.googleapis.com \
			     cloudbuild.googleapis.com

IAM permissions for Cloud Build Service Account

Identify the Cloud Build default Service Account and assign the below IAM roles for the execution of the deployment of DMT architecture.

Google Cloud Console -> Cloud Build -> Settings

alt_text

IAM roles required for the created Service Account

Artifact Registry Administrator
BigQuery Admin
Cloud Build Service Account
Cloud Run Admin
Composer Administrator
Compute Instance Admin (v1)
Compute Network Admin
Create Service Accounts
Logs Viewer
Project IAM Admin
Pub/Sub Admin
Secret Manager Admin
Service Account User
Service Usage Admin
Storage Admin

To assign these roles, you can execute the Bash script cloudbuild-sa-iam-setup.sh present in the root directory

export BUILD_ACCOUNT=<CLOUDBUILD_SERVICE_ACCOUNT>
export SOURCE_PROJECT=<YOUR_PROJECT_ID>
bash cloudbuild-sa-iam-setup.sh

This script will assign all necessary IAM permissions to the Cloud Build Service Account for deployment of the DMT tool.

Once the execution of the bash script is successful, you can proceed directly to the next step Deploying DMT Infrastructure

Alternatively, you can also assign the roles by running gcloud command from your cloud shell

gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/bigquery.admin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/run.admin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/composer.admin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/compute.instanceAdmin.v1"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/compute.networkAdmin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/iam.serviceAccountCreator"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/logging.viewer"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/resourcemanager.projectIamAdmin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/pubsub.admin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/secretmanager.admin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/iam.serviceAccountUser"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/serviceusage.serviceUsageAdmin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/storage.admin"
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:$BUILD_ACCOUNT" \
--role="roles/artifactregistry.admin"

Deploying DMT infrastructure

Translations deployment will take care of the workflow needed to perform

  • DDL, DML and SQL translations
  • Schema migration to BQ (Creation of tables from translated DDLs)
  • Schema Validation and Custom query(row/column for SQL validation) using DVT tool
    • Options to run DVT on cloud run and GKE airflow pod operator depending on scaling requirements.

Note : SQL translation and validation can only be run after data loading happens into Bigquery

If you choose the deployment of Translation + Data Migration, in addition to the translation activities mentioned above, you will be able to

  • Perform initial data loads
  • Row and column data validation using DVT tool

Pre deployment Steps

1. Before you proceed with DMT terraform deployment, ensure that there is a GCS bucket created to store Terraform infrastructure State Files

export SOURCE_PROJECT=<YOUR_PROJECT_ID>
gcloud config set project ${SOURCE_PROJECT}
gcloud storage buckets create gs://${SOURCE_PROJECT}-dmt-state-bucket

2. [Mandatory for Oracle] Before you proceed with DMT terraform deployment, ensure that Oracle client .RPM library is uploaded in the GCS bucket

Guideline to download oracle_client_rpm_file and upload in GCS bucket -

  1. Download oracle instant client library(.rpm extension) file from Oracle downloads: https://www.oracle.com/in/database/technologies/instant-client/linux-x86-64-downloads.html

  2. Upload the file in GCS bucket

gsutil cp <downloaded_oracle_client_rpm_file> gs://<gcs_bucket>/<nested_folders_if_any>/

(File can also be uploaded using cloud console)

It is advisable that oracle client rpm file name should be same as downloaded

  1. Set the below environment variables(This would be used while running deployment command) -
export ORACLE_RPM_GCS_BUCKET_PATH=<Fully Qualified RPM file GCS bucket path>

For example, Fully Qualified RPM file GCS bucket path is gs://<gcs_bucket>/<nested_folders_if_any>/downloaded_filename.rpm

export ORACLE_INSTANT_CLIENT_LIB_VERSION=<Oracle_Instant_Client_Lib_Version>

The version canbe found from the downloaded file or from the Oracle download page from where the library is downloaded

Note: This library is required by data validation tool and while running deployment command it is required to pass library details as substitution parameters

3. Mandatory for Shared VPC network configuration in Cloud Composer

In Shared VPC networking designated host project provides VPC network and sub-network to the service project where cloud composer environment would be deployed. The service project must be attached with host project to use the VPC.

Perform below predeployment steps to setup/configure shared VPC for composer -

  1. Setup the environment variable if it doesn't exist.
export HOST_PROJECT=<HOST_PROJECT_ID>
export SOURCE_PROJECT=<YOUR_PROJECT_ID>
  1. Network resource configuration - VPC network should have subnetwork with the region available for cloud composer see available regions list under Products available by location section deployment. Subnetwork should have 2 Secondary IP ranges created explicitly which is required for GKE pods and services.

Note:-

  • Mask size for secondary IP ranges should be between /9 to /23 bits based on composer environment size.

    For Example, /17 and /22 is recommended for medium composer environment.

  • While creating subnet or using existing one Enable Private Google Access (This would be recomended even if plan is to use Public IP environments)

  1. Enable GKE API in your host and service projects if not enabled
gcloud services enable container.googleapis.com --project $HOST_PROJECT
gcloud services enable container.googleapis.com --project $SOURCE_PROJECT
  1. Provide below permission to service project service account in host project -

    a. Create environment variable for service project number(Service project number is available in dashboard page of project under Project info section) -

    export SERVICE_PROJECT_NUMBER=<service_project_number>
    

    b. Provide Compute Network User Permission to Google APIs service agent

    export GOOGLE_API_SERVICE_AGENT=$SERVICE_PROJECT_NUMBER@cloudservices.gserviceaccount.com
    
    gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
    --member="serviceAccount:$GOOGLE_API_SERVICE_AGENT" \
    --role="roles/compute.networkUser"
    

    c. Provide Compute Network User and Kubernetes Engine Host Service Agent User Permission to GKE service agent

    export GKE_SERVICE_AGENT=service-$SERVICE_PROJECT_NUMBER@container-engine-robot.iam.gserviceaccount.com
    
    gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
    --member="serviceAccount:$GKE_SERVICE_AGENT" \
    --role="roles/compute.networkUser"
    
    gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
    --member="serviceAccount:$GKE_SERVICE_AGENT" \
    --role="roles/container.hostServiceAgentUser"
    

    d. Provide Permission to Composer Service Agent

    1. Either provide Composer Shared VPC Agent Permission to Composer Service Agent Account in case for Private environment
    export COMPOSER_SERVICE_AGENT=service-$SERVICE_PROJECT_NUMBER@cloudcomposer-accounts.iam.gserviceaccount.com
    
       gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
       --member="serviceAccount:$COMPOSER_SERVICE_AGENT" \
       --role="roles/composer.sharedVpcAgent"
    

    Or, Provide Compute Network User Permission to Composer Agent Service Account in case for Public environment

    export COMPOSER_SERVICE_AGENT=service-$SERVICE_PROJECT_NUMBER@cloudcomposer-accounts.iam.gserviceaccount.com
    
       gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
       --member="serviceAccount:$COMPOSER_SERVICE_AGENT" \
       --role="roles/compute.networkUser"
    

For more information and references please visit Configure Shared VPC networking documentation

Terraform Deployment

  1. Set default GCP project in your Cloud shell or terminal or bastion host with terraform, terragrunt and gcloud sdk installed
gcloud config set project ${SOURCE_PROJECT}
  1. Check the current project
gcloud config get project
  1. Run the command
gcloud auth application-default login
  1. Login with your Google Cloud credentials and ensure you have the required IAM permissions listed in the step [Assign Executing User Permissions]

  2. DMT tool leverages default VPC network and subnetwork for Cloud Composer and Teradata Agent VM Compute Engine. If you wish to use a custom VPC and subnet, provide the right values for custom VPC and subnet in these files

~/terraform/translation/gcc/variables.tf ~/terraform/datamigration/gce/variables.tf

Note : DMT tool does not create VPC and subnets, these should be created beforehand.

It is recommended to use the same VPC and subnet for Cloud Composer and Teradata Agent VM with the VPC and subnet able to access source Teradata and Hive Data Warehouses necessary ports.

~/terraform/translation/gcc/variables.tf Click here to view source file

~/terraform/datamigration//gce/variables.tf

  1. If you would like to customize additional variables, please ensure the appropriate values are set for those variables in variables.tf file in each services’ folder - pubsub, gcc, gcs, bq, cloudrun

    ~/terraform/translation/gcs/variables.tf

    ~/terraform/translation/pubsub/variables.tf

    ~/terraform/translation/cloudrun/variables.tf

    ~/terraform/translation/gcc/variables.tf

    ~/terraform/translation/bq/variables.tf

(Note: Please note that region for cloud composer and dvt cloud run should be same)

  1. Navigate inside the dmt repo
cd data-migration-tool
gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=<source data warehouse>

Cloud build DMT uses the following default values

  • PROJECT_ID=${PROJECT_ID}

  • _MY_BUCKET_NAME=${PROJECT_ID}-dmt-state-bucket

  • _MY_CUSTOMER_NAME=${PROJECT_ID}

  • _DELETE_BQ_TABLES="false"

If you wish to change the default customer name tagged as suffix for all services and also change the name of the bucket for storing Terraform State Files (should be created beforehand)

Provide the values as substitution variables in gcloud builds command

For example,

If bucket name for state files = demo-state

Required customer name = democust

gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_MY_BUCKET_NAME=demo-state,_MY_CUSTOMER_NAME=democust,_DATA_SOURCE=hive

Otherwise DMT deploys with default predefined naming conventions for State File GCS bucket and Customer Name

Default false value of _DELETE_BQ_TABLES won't allow to delete BigQuery tables when you want to destroy DMT infrastructure in future. So if you wish to destroy BigQuery tables with destroy script, set the value to true in gcloud builds command.

For example,

gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DELETE_BQ_TABLES="true",_DATA_SOURCE=<data_source>

Otherwise DMT deploys with default predefined values.

Additional substitution variable for Oracle data source

  • _ORACLE_INSTANTCLIENT_FILE_PATH=${ORACLE_RPM_GCS_BUCKET_PATH}
  • _ORACLE_ODBC_VERSION_NUMBER=${ORACLE_INSTANT_CLIENT_LIB_VERSION}

Used environment variable created in pre-deployment steps

For instance, if you would like to leverage DMT for

Teradata to BigQuery Migration

gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=teradata

Hive to Bigquery Migrations

gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=hive

Oracle to Bigquery Migrations

gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_ORACLE_INSTANTCLIENT_FILE_PATH=${ORACLE_RPM_GCS_BUCKET_PATH},_ORACLE_ODBC_VERSION_NUMBER=${ORACLE_INSTANT_CLIENT_LIB_VERSION}

Note:

If Data Migration is not required, you can choose to skip _DATA_SOURCE substitution variable which will only deploy the DMT Translation Architecture

  • Oracle initial release supports only schema migration so _DATA_SOURCE substitution variable not required.

All the available deployment configurations and corresponding gcloud build commands are listed below -

DMT Purpose Source Command
Translation Only All Sources (except Oracle) gcloud builds submit . --project ${SOURCE_PROJECT} \ \ --config cloudbuild_deploy.yaml
Translation for Oracle Oracle gcloud builds submit . --project ${SOURCE_PROJECT} \ --config cloudbuild_deploy.yaml \ --substitutions \ _ORACLE_INSTANTCLIENT_FILE_PATH=${ORACLE_RPM_GCS_BUCKET_PATH},_ORACLE_ODBC_VERSION_NUMBER=${ORACLE_INSTANT_CLIENT_LIB_VERSION}
Translation + Data Migration Teradata gcloud builds submit . --project ${SOURCE_PROJECT} \ --config cloudbuild_deploy.yaml \ --substitutions \ _DATA_SOURCE=teradata
Translation + Data Migration Hive gcloud builds submit . --project ${SOURCE_PROJECT} \ --config cloudbuild_deploy.yaml \ --substitutions \ _DATA_SOURCE=hive
Translation + Data Migration Redshift gcloud builds submit . --project ${SOURCE_PROJECT} \ --config cloudbuild_deploy.yaml \ --substitutions \ _DATA_SOURCE=redshift

Script execution starting logs

starting build "206986f5-4e41-48f6-bff7-xyz"

FETCHSOURCE
Fetching storage object: gs://dmt-1_cloudbuild/source/1684304084.135633-xyz.tgz#1684304086786340
Copying gs://dmt-1_cloudbuild/source/1684304084.135633-xyz.tgz#1684304086786340...
/ [0 files][    0.0 B/115.2 KiB]                                                
/ [1 files][115.2 KiB/115.2 KiB]                                                
Operation completed over 1 objects/115.2 KiB.                                    
BUILD
Pulling image: gcr.io/google.com/cloudsdktool/cloud-sdk:latest
latest: Pulling from google.com/cloudsdktool/cloud-sdk
918547b94326: Pulling fs layer
9ecbbb83fff4: Pulling fs layer
831b9ef5c145: Pulling fs layer
4718bdb8c7d1: Pulling fs layer
671167368ac0: Pulling fs layer
014f91216bbc: Pulling fs layer
4718bdb8c7d1: Waiting
671167368ac0: Waiting
014f91216bbc: Waiting
831b9ef5c145: Verifying Checksum
9ecbbb83fff4: Verifying Checksum
9ecbbb83fff4: Download complete
918547b94326: Verifying Checksum
918547b94326: Download complete
014f91216bbc: Verifying Checksum
014f91216bbc: Download complete
671167368ac0: Verifying Checksum
671167368ac0: Download complete
918547b94326: Pull complete
9ecbbb83fff4: Pull complete
831b9ef5c145: Pull complete
4718bdb8c7d1: Download complete
4718bdb8c7d1: Pull complete
671167368ac0: Pull complete
014f91216bbc: Pull complete
  1. This automated infra setup for GCS, Pubsub, Cloud Composer, Cloud Run, BQ DVT dataset is expected to take approximately 35-45 minutes to complete. Cloud Composer creation takes the most amount of time

You can monitor the status of the Deployment in your Cloud Build Logs by navigating to below path -

Google Cloud Console -> Cloud Build -> History

Since the chosen option is Translation + Data Migration for Teradata

Cloud Build will automatically trigger the deployment of Data Migration architecture as well.

Successful execution log - 
Apply complete! Resources: 2 added, 0 changed, 0 destroyed.
google_compute_router.router[0]: Still creating... [20s elapsed]
google_compute_router.router[0]: Creation complete after 21s [id=projects/dmt-1/regions/us-central1/routers/dmt-cloud-router]
google_compute_router_nat.nat[0]: Creating...
google_compute_router_nat.nat[0]: Still creating... [10s elapsed]
google_compute_router_nat.nat[0]: Still creating... [20s elapsed]
google_compute_router_nat.nat[0]: Creation complete after 22s [id=dmt-1/us-central1/dmt-cloud-router/dmt-cloud-nat]
google_compute_disk.dm-disk-terdata-bq: Creating...
google_compute_disk.dm-disk-terdata-bq: Creation complete after 1s [id=projects/dmt-1/zones/us-central1-a/disks/dm-disk-terdata-bq]
google_compute_instance.dm-vm-teradata-bq: Creating...
google_compute_instance.dm-vm-teradata-bq: Still creating... [10s elapsed]
google_compute_instance.dm-vm-teradata-bq: Creation complete after 12s [id=projects/dmt-1/zones/us-central1-a/instances/dm-vm-teradata-bq]

Apply complete! Resources: 10 added, 0 changed, 0 destroyed.
PUSH
DONE

Post-Installation setup

Follow the below steps for Teradata and Oracle

Check if the environment values still exist and are correct, if not, set them again for post-installation steps

Ensure you have an existing source database environment for testing:

Set environment variables in Cloud Shell

export SOURCE_PROJECT=<YOUR_PROJECT_ID>

<SOURCE_PROJECT> - Project ID where you will be deploying DMT tool

  1. Grant Cloud Run invoker role to PubSub service account (dmt-sa-pubsub) under GCP IAM roles.
gcloud projects add-iam-policy-binding $SOURCE_PROJECT \
--member="serviceAccount:dmt-sa-pubsub@$SOURCE_PROJECT.iam.gserviceaccount.com" \
--role="roles/run.invoker"
  1. Go to the GCP console and search for Kubernetes Clusters - select your composer cluster instance and identify the IP subnet range of the Pod Address range.

  2. Go to your VPC network (default in this case) and create a firewall rule with the name pod-operator-firewall.

  • Direction of Traffic - Ingress
  • Action on Match - Allow
  • Targets - All instances in the network [Teradata/Oracle GCE instance]
  • Source Filter - IPv4
  • Source IPv4 ranges - that you got in step #2. E.g: 10.57.0.0/17
  • Protocols and Ports - Allow all / Allowed Teradata ports
gcloud compute firewall-rules create pod-operator-firewall \
--project=$SOURCE_PROJECT \
--direction=INGRESS \
--priority=1000 \
--network=<YOUR Composer VPC name> \
--action=ALLOW \
--rules=all \
--source-ranges=<YOUR Composer Kubernetes Pod Cluster IP obtained from Step 2>

Note. For real customer use cases, firewall ports from Composer Cluster Pod Address Range should be able to reach the on-prem Teradata/ Oracle instance through firewall rules being created

  1. Go to Secret Manager and create a new secret with the name secret-edw_credentials

    This should contain the password for Teradata/Oracle DB. It is recommended to keep the name of the secret as secret-edw_credentials

Note (Real customer use cases): please create vpc access connectors as part of the VPC that has firewalls open to connect to customer’s on prem teradata instance. The Cloud Run service (to execute the DVT tool) requires the VPC access connector to be attached to it to hit the Teradata/Oracle DB

Note (If you are a Googler): If you are a Googler and are running a test environment with Teradata express edition or Oracle on GCP Compute VMs, ensure you add a network tag cc-dvt to the on TD/Oracle instance

  1. Validate if all DAG Airflows in the Cloud Composer instance are in paused state through Airflow URI

  2. Switch all the DAG status to Active

  3. Validate workload_identity_creator_dag in Airflow UI is executed one time automatically on turning DAGs active and is successful

alt_text

  1. If you intend to migrate tables over to BQ dataset other than dmt-teradata-dataset, please ensure this dataset is manually created and is provided in the config files later (create tables from DDLs)

  2. (Mandatory for teradata ddl extraction) Teradata ddl extraction requires teradata jdbc jar. Upload the jdbc jar file to your GCS config bucket under the software/teradata folder as shown below:

    1. Download the jar from teradata downloads: https://downloads.teradata.com/download/connectivity/jdbc-driver

    2. Upload the jar to your GCS config bucket and ensure that the jar file name exactly matches terajdbc4.jar.
      Make sure the file is placed in the software/teradata path
      (e.g. gs://YOUR_DMT_CONFIG_BUCKET/software/teradata/terajdbc4.jar)

  3. (Optional - Only for Oracle) Remove GCS bucket created to store .RPM file for setup.

Build connection between CloudRun and Source database [Applicable for network connectivity uses VPN connectivity]

  1. Create serverless VPC Connector (specify same network as well as region which has firewall rules allowed to connect to on prem/ cloud data source)
gcloud compute networks vpc-access connectors create crun-dvt-connector \
--network default \
--region us-central1 \
--range 10.6.0.0/28 \
--min-instances 2 \
--max-instances 3 \
--machine-type f1-micro
  1. Attach the network tag cc-dvt as the target or provide directly the IP address of the on prem TD/Oracle instance. (This tag can be used as target tag while creating a firewall rule at a later stage)

  2. Create an ingress firewall rule (in source database network) to deny traffic from connector network tag:

gcloud compute firewall-rules create deny-vpc-connector --action=DENY --rules=all --source-tags=vpc-connector-us-central1-crun-dvt-connector --direction=INGRESS --network=default --priority=990
  1. Create an ingress firewall rule targeting the Teradata/Oracle IP that you want the VPC connector to access. Set the priority for this rule to be a lower value than the priority of the rule you made in Step 25. (Here **cc-dvt **is the network tag attached to Teradata/Oracle VM)
gcloud compute firewall-rules create allow-vpc-connector-for-select-resources --allow=all --source-tags=vpc-connector-us-central1-crun-dvt-connector --direction=INGRESS --network=default --target-tags=cc-dvt --priority=980
  1. Deploy a new version of the Cloud Run DVT service (edw-dvt-tool-<customerid>) after performing the below changes

    1. Setting the connector alt_text
    2. Go to the Secrets section in the Cloud Run and configure as shown below

    Secret: secret-edw_credentials

    Reference Method : Exposed as Environment variable

    Environment_variables

    Name 1: edw_credentials

alt_text

Trigger DMT

  • Translations
  • Schema Migration to Google BigQuery
  • Schema Validation
  1. Prepare sample config.json file including Input/output buckets, translation type (DDL, SQL or DML), dialects, Name mapping properties and DVT connection details

  2. Upload required SQL and DDL files into respective bucket paths defined in config.json

    For example,

    DDL = gs://dmt-translation-control/input/ddl

    SQL = gs://dmt-translation-control/input/sql (SQL files), gs://dmt-translation-control/input/sql/ddl (DDL/Metadata files)

    DML = gs://dmt-translation-control/input/dml (DML files), gs://dmt-translation-control/input/sql/ddl (DDL/Metadata files)

  3. Upload config.json into the config folder of the bucket

    For example,

    DDL = gs://dmt-config-control/input/ddl

    SQL = gs://dmt-config-control/input/sql

    DML = gs://dmt-config-control/input/dml

  4. In case of DDL - validate execution of DAGs

    1. controller_dag
    2. extract_ddl_dag - if extract_ddl is yes
    3. batch_sql_translation
    4. schema_dag
    5. validation_dag
      1. validation_dag - if validation_mode is gke
      2. validation_crun if validation_mode is cloudrun
  5. In case of SQL - validate execution of DAGs

    1. Controller_dag
    2. Batch_sql_translation
    3. validation_dag
      1. validation_dag - if validation_mode is gke
      2. validation_crun if validation_mode is cloudrun
  6. In case of DML - validate execution of DAGs

    1. Controller_dag
    2. Batch_sql_translation
    3. Dml_validation_dag
  7. Translated DDL and SQL files can be found in output buckets

    DDL = gs://dmt-translation-control/output/ddl

    SQL = gs://dmt-translation-control/output/sql

    DML = gs://dmt-translation-control/output/dml

  8. Translation Summary results can be found in the output buckets along with translated files - batch_translation_report.csv

Sample config.json

Click here for sample config.json files

Configuration JSON Attributes

DDL

JSON attribute Description
batchDistribution Defines the number of parallel Composer Airflow DAGs

[Number of DAGs = Total DDL files/batchDistribution]

extract_ddl Provide flag for ddl/schema extraction from source database
unique_id Provide an unique name for every batch run identification
type Type of Translation : ddl
source Source datawarehouse : teradata
translationErrorFilterRules

{field, matchtype, value}
Field : Category

Matchtype : equals

Value : <>

Allows bypassing of warnings and BQ dialect translation errors;

Example Values - NoTargetSupportForFeature, NoTargetSupportForSetTables, NoTargetSupportForIndexBackedConstraints, NoTargetSupportForPartitionSemantics

migrationTask: type Translation_Teradata2BQ
migrationTask:translationConfigDetails:sourceDialect:teradataDialect:mode Teradata - SQL, BTEQ
migrationTask:translationConfigDetails:sourceEnv:defaultDatabase BigQuery GCP Project ID
migrationTask:translationConfigDetails:targetDialect:bigqueryDialect {}
gcsSourcePath GCS location of input DDL Teradata Files
gcsTargetPath GCS location where translated output DDL BQ dialect Files must be placed
nameMappingList:name_map:source

{type, database,schema}
Create output name mapping for schema translations for source

Refer to https://cloud.google.com/bigquery/docs/output-name-mapping

nameMappingList:name_map:target

{database,schema}
Create output name mapping for schema translations for target BQ

Refer to https://cloud.google.com/bigquery/docs/output-name-mapping

validation_config:source_config:source_type DVT Validation

Source datawarehouse : Teradata

validation_config:source_config:host hostname/ IP address of on-prem Teradata
validation_config:source_config:port Port number of on-prem Teradata
validation_config:source_config:user-name User-name with permissions to query on source tables
validation_config:source_config:password Secret Manager key name

secret: <secret key name>

For example - secret:edw_credentials

validation_config:target_config:target_type BigQuery
validation_config:target_config:project-id BigQuery GCP Project ID
validation_config:validation_params_file_path GCS location of the CSV file or Excel sheet, containing table or file names along with DVT Validation Flags.

Examples:

gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.csv

gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.xlsx

Read Instructions below to understand how to populate and upload this file/sheet

validation_type DVT validation type schema
validation_mode DVT validation mode

Possible values - gke, cloudrun

pod_operator_mem Pod Operator memory configuration when DVT mode is chosen as GKE

Default Value - 4000M

pod_operator_cpu Pod Operator cpu configuration when DVT mode is chosen as GKE

Default Value - 800m

SQL

JSON attribute Description
batchDistribution Defines the number of parallel Composer Airflow DAGs

[Number of DAGs = Total SQL files/batchDistribution]

unique_id Provide an unique name for every batch run identification
type Type of Translation : sql
source Source datawarehouse : teradata
translationErrorFilterRules

{field, matchtype, value}
Field : Category

Matchtype : equals

Value : <>

Allows bypassing of warnings and BQ dialect translation errors;

Example Values - NoTargetSupportForFeature, NoTargetSupportForSetTables, NoTargetSupportForIndexBackedConstraints, NoTargetSupportForPartitionSemantics

migrationTask: type Translation_Teradata2BQ
migrationTask:translationConfigDetails:sourceDialect:teradataDialect:mode Teradata - SQL, BTEQ
migrationTask:translationConfigDetails:sourceEnv:defaultDatabase BigQuery GCP Project ID
migrationTask:translationConfigDetails:targetDialect:bigqueryDialect {}
gcsSourcePath GCS location of input SQL Teradata Files
gcsTargetPath GCS location where translated output SQL BQ dialect Files must be placed
nameMappingList:name_map:source

{type, database,schema}
Create output name mapping for schema translations for source

Refer to https://cloud.google.com/bigquery/docs/output-name-mapping

nameMappingList:name_map:target

{database,schema}
Create output name mapping for schema translations for target BQ

Refer to https://cloud.google.com/bigquery/docs/output-name-mapping

validation_config:source_config:source_type DVT Validation

Source datawarehouse : Teradata

validation_config:source_config:host hostname/ IP address of on-prem Teradata
validation_config:source_config:port Port number of on-prem Teradata
validation_config:source_config:user-name User-name with permissions to query on source tables
validation_config:source_config:password Secret Manager key name

secret: <secret key name>

For example - secret:edw_credentials

validation_config:target_config:target_type BigQuery
validation_config:target_config:project-id BigQuery GCP Project ID
validation_config:target_config:project-id BigQuery GCP Project ID
validation_config:validation_params_file_path GCS location of the CSV file or Excel sheet, containing table or file names along with DVT Validation Flags.

Examples:

gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.csv

gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.xlsx

Read Instructions below to understand how to populate and upload this file/sheet

validation_type DVT validation type

row/column

validation_mode DVT validation mode

Possible values - gke, cloudrun

pod_operator_mem Pod Operator memory configuration when DVT mode is chosen as GKE

Default Value - 4000M

pod_operator_cpu Pod Operator cpu configuration when DVT mode is chosen as GKE

Default Value - 800m

DML

JSON attribute Description
batchDistribution N/A
unique_id Provide an unique name for every batch run identification
type Type of Translation : dml
source Source datawarehouse : teradata
translationErrorFilterRules

{field, matchtype, value}
Field : Category

Matchtype : equals

Value : <>

Allows bypassing of warnings and BQ dialect translation errors;

Example Values - NoTargetSupportForFeature, NoTargetSupportForSetTables, NoTargetSupportForIndexBackedConstraints, NoTargetSupportForPartitionSemantics

migrationTask: type Translation_Teradata2BQ
migrationTask:translationConfigDetails:sourceDialect:teradataDialect:mode Teradata - SQL, BTEQ
migrationTask:translationConfigDetails:sourceEnv:defaultDatabase BigQuery GCP Project ID
migrationTask:translationConfigDetails:targetDialect:bigqueryDialect {}
gcsSourcePath GCS location of input DML Teradata Files
gcsTargetPath GCS location where translated output DML BQ dialect Files must be placed
nameMappingList:name_map:source

{type, database,schema}
Create output name mapping for dml translations for source

Refer to https://cloud.google.com/bigquery/docs/output-name-mapping

nameMappingList:name_map:target

{database,schema}
Create output name mapping for dml translations for target BQ

Refer to https://cloud.google.com/bigquery/docs/output-name-mapping

Instructions To Populate And Upload Validation Paramaters File

Please do not edit headers, data validation rules and other sheets in the Excel/Google Sheet as it'll alter the data validation checks put in-place.


Fill the columns according to the use-case. Data Validation Rules within the sheet will help to fill the fields correctly. To know more about which flags are relevant for which translation (ddl,sql,custom-query) and migration (data), please refer to the tab named DVT Guide. This sheet contains link to DVT Github pages which explain about all possible flags in DVT CLI.

You can refer to the tab Example Sheet to get an idea of which flags can be supplied for which validation.

  • To follow along with Excel sheet

    • Open this Excel Template File in Microsoft Excel. (not in Google Sheets, as opening .xlsx in Google Sheets will corrupt the existing Data Validation Rules)
    • Fill the sheet Validation . Once filled, save the file in .xlsx format. (Excel will convert the template file format .xltx to .xlsx file for you)

  • To follow along with CSV file format

    • Open this Google Sheet in browser (Viewer Access).
    • Make a copy, to get an editable sheet.
    • Once filled, download CSV file using the Menu Option. (File --> Download --> Comma Separated Values (.csv))

    Once filled, upload the corresponding Excel (.xlsx) file or CSV file in GCS bucket, (inside relevant folder within dmt-config-dmt-demo-project/validation) and specify the GCS path in configuration file.

Create LookerStudio Dashboard using template

  1. Request access to template
  2. Access Dashboard template by clicking here
  3. Click on the three dots as shown in the below image and click on Make a copy

alt_text

Existing Data Source in Template BQ Table in Customer’s GCP Project
DVT Aggregated Results <GCP_PROJECT_ID>.dmt_logs.dmt_dvt_aggregated_results
Validation Results <GCP_PROJECT_ID>.dmt_logs.dmt_dvt_results
Schema Creation Results <GCP_PROJECT_ID>.dmt_logs.dmt_schema_aggregated_results
Schema Creation Aggregated Results <GCP_PROJECT_ID>.dmt_logs.dmt_schema_results
Translation Results <GCP_PROJECT_ID>.dmt_logs.dmt_translation_results
Translation Aggregated Results <GCP_PROJECT_ID>.dmt_logs.dmt_translation_aggregated_results
DML Validation Results <GCP_PROJECT_ID>.dmt_logs.dmt_dml_validation_results
DML Validation Aggregated Results <GCP_PROJECT_ID>.dmt_logs.dmt_dml_validation_aggregated_results
Report Table <GCP_PROJECT_ID>.dmt_logs.dmt_report_table
File Table Mapping <GCP_PROJECT_ID>.dmt_logs.dmt_file_table_mapping
{Data Source} Transfer Run Jobwise Details <GCP_PROJECT_ID>.dmt_logs.dmt__transfer_run_jobwise_details
{Data Source} Transfer Tracking <GCP_PROJECT_ID>.dmt_logs.dmt__transfer_tracking
{Data Source} Transfer Run Summary <GCP_PROJECT_ID>.dmt_logs.dmt__transfer_run_summary
  1. Now click on Copy Report. This will create a copy of the dashboard template.

alt_text 5. In the new template click on Resource and then select Manage added data sources.To add/replace BigQuery datasets.

alt_text 6. You’ll now see a new page with list of tables used in this template.Click on EDIT for one table at a time. This will navigate you to the next page.

alt_text 7. Click on EDIT CONNECTION

alt_text 8. In the search bar type your GCP Project Name and select it. And then select logs dataset and respective table. Now click on RECONNECT

alt_text 9. You'll get a pop-up, click on Apply and then click on Done

alt_text alt_text 10. Repeat Steps 5 to 8 for all the remaining data sources, until all the data sources have been replaced. 11. Your dashboard will be named Copy of DMT Dashboard. Click on the name to Rename it to any name that you would prefer for the dashboard

alt_text

DMT Infrastructure Destroy

Pre destroy step

  • At the time of deployment if you haven't given _DELETE_BQ_TABLES substitution parameter as true then delete dmt_dataset and dmt_logs bigquery dataset from the Bigquery manually before destroy command.

Destroy DMT for Teradata using the following command:

gcloud builds submit . \
   --config cloudbuild_destroy.yaml \
   --substitutions _DATA_SOURCE="teradata"

Cloud build DMT uses the following default substitutions values

  • PROJECT_ID=${PROJECT_ID}

  • _MY_BUCKET_NAME=${PROJECT_ID}-dmt-state-bucket

  • _MY_CUSTOMER_NAME=${PROJECT_ID}

  • _DELETE_BQ_TABLES="false"

This default values needs to updated as per the values given at the time of deployment