- Description
- Project Dependencies
- The Dataset
- Utilised Tools
- Setup
- Batch Processing
- Stream Processing
- File Structure
- License
Pinterest crunches billions of data points every day to provide more value to its users. This this project emulates this feature using AWS Cloud to setup a comprehensive data pipeline that mirrors Pinterest's data processing operations.
The project is divided into three main parts:
- Setup
- Batch Processing
- Stream Processing
To execute this project, the required Python packages are located in the environment.yml
Follow the Creating an environment from an environment.yml file instructions in the link for a guide on how to replicate the conda environment.
The project includes a script (user_posting_emulation.py
) that simulates the flow of data points similar to those received by Pinterest's API during user data uploads. This data is stored in an AWS RDS database with the following tables:
- pinterest_data: Data about posts being uploaded to Pinterest.
- geolocation_data: Geolocated data related to each post.
- user_data: Information about the users uploading the posts.
The script continuously cycles through random intervals, selecting rows from each table and compiling them into dictionaries for further processing.
Apache Kafka is an event streaming platform used to process streaming data in real time.
Amazon MSK is a fully managed service for Apache Kafka, enabling the easy setup and management of Kafka clusters.
MSK Connect simplifies the process of streaming data to and from Apache Kafka clusters.
The Confluent REST Proxy provides a RESTful interface to interact with Kafka clusters, allowing message production, consumption, and cluster administration via HTTP requests.
Amazon API Gateway is a managed service that facilitates the creation, publication, maintenance, and security of APIs.
Apache Spark is a powerful engine for data processing, enabling large-scale data engineering and machine learning.
PySpark is the Python API for Apache Spark, used for real-time, distributed data processing.
Databricks is a platform that provides tools for running Apache Spark applications, used in this project for batch and stream data processing.
-
Create a Key Pair File:
- In the AWS console, generate a key-pair file for authentication.
-
Configure Security Groups:
- Create a security group with rules allowing HTTP, HTTPS, and SSH access.
-
Launch an EC2 Instance:
- Use the Amazon Linux 2 AMI to create an EC2 instance and install Kafka and IAM MSK authentication packages on the client EC2 machine.
-
Configure Kafka Client Properties:
-
Create MSK Clusters and Kafka Topics:
- Set up Amazon MSK clusters and create Kafka topics for the three data tables.
-
Create an S3 Bucket:
- Create an S3 bucket to store the data extracted from the MSK cluster.
-
Create an IAM Role:
- Set up an IAM role with permissions to write to the S3 bucket.
-
Create a VPC Endpoint:
- Establish a VPC endpoint to connect the MSK cluster directly to the S3 bucket.
-
Set up MSK Connect:
- Create a connector using a custom plug-in associated with the IAM role to stream data from MSK to the S3 bucket.
- Create and Configure an API:
- Set up a REST API in AWS API Gateway, create child resources, and configure methods to interact with the Kinesis streams.
Configuration of API Endpoints
- Deploy the API:
- Deploy the API to obtain an invoke URL, which is used to send data to the Kafka topics via the API.
-
Install Confluent Package:
- Install the Confluent package for Kafka REST Proxy on the EC2 client machine.
-
Configure the
kafka-rest.properties
:- Modify the properties file to specify the bootstrap server and IAM role, see an example below.
- Deploy the REST Proxy API:
- Deploy the API and use it to send data to Kafka topics.
To perform batch processing:
-
Mount the S3 Bucket in Databricks:
- Mount the S3 bucket to Databricks and load the data into DataFrames.
-
Clean the Data:
- Use PySpark to clean the data by removing duplicates, renaming columns, handling null values, and converting data types.
-
Automate Processing with MWAA:
- Set up MWAA to trigger Databricks notebooks automatically, using a DAG file for scheduling.
The batch_queries.ipynb
notebook contains queries to analyze the cleaned data, providing insights such as popular Pinterest categories, follower counts, and user activity over time.
- Create Kinesis Streams:
- Create three streams in Kinesis for the
pin
,geo
, anduser
data.
- Create three streams in Kinesis for the
-
Set up API Resources:
- Create resources and methods in AWS API Gateway to interact with Kinesis streams via HTTP requests.
-
Deploy the API:
- Deploy the API to enable real-time data streaming.
- Send Data to Kinesis Streams:
- Modify the provided script to send data to the Kinesis streams using the configured API.
- Ingest and Clean Streaming Data:
- In Databricks, read data from the Kinesis streams, clean it, and prepare it for storage.
- Save Cleaned Data:
- Save the cleaned streaming data into Delta Tables in Databricks.
Your main project folder should have the following structure:
pinterest-data-pipeline/
├── images/ # Images used in the documentation
├── 0abb070c336b_dag.py # DAG file for Apache Airflow
├── batch_data_cleaning.ipynb # Jupyter Notebook for batch data cleaning
├── batch_queries.ipynb # Jupyter Notebook for batch data analysis
├── README.md # Main documentation file
├── streaming_data.ipynb # Jupyter Notebook for streaming data processing
├── user_posting_emulation.py # Script for emulating user posting data
└── user_posting_emulation_streaming.py # Script for emulating user posting data with streaming
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) [YEAR] [YOUR NAME OR ORGANIZATION]
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.