GitHub - tchken/Cloud-Dataflow-Batch-Processing: ETL - A Project demonstrates Batch Processing with Google Cloud Dataflow and BigQuery

tchken / Cloud-Dataflow-Batch-Processing Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

ETL - A Project demonstrates Batch Processing with Google Cloud Dataflow and BigQuery

2 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
direct_run_output		direct_run_output
env		env
resources		resources
README.md		README.md
dataflow_pipeline.py		dataflow_pipeline.py
local_directrunner_pipeline.py		local_directrunner_pipeline.py
requirements.txt		requirements.txt

Repository files navigation

ETL Job - Cloud-Dataflow-Batch-Processing

store dataset (.csv) in a Google Cloud Storage bucket.
create a Dataflow batch job that read and process the csv file.
in the Dataflow job, apply a "Group By" transform to get the count of listings by the "neighbourhood" field.
store both the original csv data and the transformed data into their own separate BigQuery tables.

gcloud command - connecting cloud shell, project setting, and check

gcloud auth list
gcloud config list project
export PROJECT=""
gcloud config set project $PROJECT
gsutil mb - c regional -l us-east4 gs://$PROJECT
gsutil cp ./ gs://$PROJECT/
bq mk
export GOOGLE_APPLICATION_CREDENTIALS="/filename.json"
bq show -j --project_id=<project_id dataflow_job>

Additional setup and install apache-beam

python2.7 -m virtual env
source env/bin/activate
deactivate (after job)
pip freeze -r requirements.txt (enviroment setup)
pip install apache-beam

Execute pipeline (Local DirectRunner)

python local_directrunner_pipeline.py

Execute pipeline (DataFlow)

python dataflow_pipeline.py
--project=$PROJECT
--runner=DataflowRunner
--staging_location=gs://$PROJECT/temp
--temp_location gs://$PROJECT/temp
--input gs://$PROJECT/datafilename.csv --save_main_session

About

ETL - A Project demonstrates Batch Processing with Google Cloud Dataflow and BigQuery

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%