Skip to content

ETL - A Project demonstrates Batch Processing with Google Cloud Dataflow and BigQuery

Notifications You must be signed in to change notification settings

tchken/Cloud-Dataflow-Batch-Processing

Repository files navigation

ETL Job - Cloud-Dataflow-Batch-Processing

  1. store dataset (.csv) in a Google Cloud Storage bucket.
  2. create a Dataflow batch job that read and process the csv file.
  3. in the Dataflow job, apply a "Group By" transform to get the count of listings by the "neighbourhood" field.
  4. store both the original csv data and the transformed data into their own separate BigQuery tables.

alt text

gcloud command - connecting cloud shell, project setting, and check

  • gcloud auth list

  • gcloud config list project

  • export PROJECT=""

  • gcloud config set project $PROJECT

  • gsutil mb - c regional -l us-east4 gs://$PROJECT

  • gsutil cp ./ gs://$PROJECT/

  • bq mk

  • export GOOGLE_APPLICATION_CREDENTIALS="/filename.json"

  • bq show -j --project_id=<project_id dataflow_job>

Additional setup and install apache-beam

  • python2.7 -m virtual env

  • source env/bin/activate

  • deactivate (after job)

  • pip freeze -r requirements.txt (enviroment setup)

  • pip install apache-beam

Execute pipeline (Local DirectRunner)

  • python local_directrunner_pipeline.py

Execute pipeline (DataFlow)

  • python dataflow_pipeline.py
    --project=$PROJECT
    --runner=DataflowRunner
    --staging_location=gs://$PROJECT/temp
    --temp_location gs://$PROJECT/temp
    --input gs://$PROJECT/datafilename.csv --save_main_session

About

ETL - A Project demonstrates Batch Processing with Google Cloud Dataflow and BigQuery

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages