Sparkify, a music streaming startup, wanted to collect logs they have on user activity and song data and centralize them in a database in order to run analytics. This AWS S3 data lake, set up with a star schema, will help them to easily access their data in an intuitive fashion and start getting rich insights into their user base.
I set up an EMR instance with a Spark cluster to process their logs, reading them in from an S3 bucket. I then ran transformations on that big data, distributing it out into separate tables and then writing it back into an S3 data lake.
My client Sparkify has moved to a cloud based system and now keeps their big data logs in an AWS S3 bucket. The end goal was to get that raw .json data from their logs into fact and dimenstion tables in a S3 data lake with parquet files.
- Start by cloning this repository
- Install all python requirements from the requirements.txt
- Create an S3 bucket and fill in those details in the etl.py
main()
output_data variable - Initialize an EMR cluster with Spark
- Fill in the dl_template with your own custom details
- SSH into the EMR cluster and upload your
dl_template.cfg
andetl.py
files - Run
spark-submit etl.py
to initialize the spark job and write the resultant tables to parquet files in your s3 output path