Serverless BigData ETL

We need

read text files from a folder
send each line to the message queue
process each message with C++ and write result to the database
publish a new message for each file when all lines of a file will be written to the database
read data from database and send to the Neural Network

We'll use

Google Cloud Storage (GCS)
Google DataProc (Apache Spark)
Google Cloud Functions
Google Compute Engine
Google PubSub
Google Datastore
Python
Node.js
C++

Step 1

Create a bucket dataproc-cluster on GCS
Upload pip-install.sh and producer.py
Upload text files to the folder data

Step 2

Create a DataProc cluster (Apache Spark)

gcloud beta dataproc clusters create \
cluster-test \
--enable-component-gateway \
--bucket dataproc-cluster-test \
--region europe-west1 \
--subnet default \
--zone "" \
--master-machine-type n1-highmem-4 \
--master-boot-disk-size 500 \
--num-workers 3 \
--worker-machine-type n1-highmem-4 \
--worker-boot-disk-size 500 \
--image-version 1.3-deb9 \
--optional-components ANACONDA,JUPYTER \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--tags dataproc \
--project pq-stream \
--initialization-actions 'gs://dataproc-cluster/pip-install.sh' \
--metadata 'PIP_PACKAGES=google-cloud-pubsub==0.42.1' \
--properties 'spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,spark:spark.driver.memory=20g,spark:spark.driver.maxResultSize=16g,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g'

Notes

With pip-install.sh and --metadata we install packages on each instance on Spark cluster.
In --properties we increase Java heap size.

Step 3

Create a PubSub topics:

data-raw
data-saved
data-final

Step 4

Create a Datastore.

Step 5

Create Google Cloud Functions:

consumer
finalNotification

Step 6

Run aggregation app on Google Compute Engine.

Step 7

Start a DataProc job

gcloud beta dataproc jobs submit pyspark \
gs://dataproc-cluster-test/producer.py \
--cluster cluster-test \
--region europe-west1

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
aggregation		aggregation
consumer		consumer
finalNotification		finalNotification
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
etl-gcp2.png		etl-gcp2.png
pip-install.sh		pip-install.sh
producer.py		producer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serverless BigData ETL

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

About

Releases

Packages

Languages

License

sonufrienko/bigdata-etl

Folders and files

Latest commit

History

Repository files navigation

Serverless BigData ETL

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages