This project implements an ETL pipeline for analyzing customer sales data. The pipeline extracts sales data from multiple CSV files, transforms it by cleaning and aggregating the data, and loads it into a MySQL database for analysis. The process is automated and scheduled using Apache Airflow.
- Data Extraction: Extracts data from multiple CSV files and consolidates it into a single dataset.
- Data Transformation: Cleans the data by handling missing values, Standardize data formats, Feature Engineering, Aggregation, transformed data.
- Data Loading: Loads the transformed data into a MySQL database for further analysis.
- Orchestration: The ETL process is managed using Apache Airflow.
- Python
- Apache Airflow
- Pandas
- MySQL
- Docker (optional for Airflow setup)
- Clone the repository.
- Install the required packages:
pip install -r requirements.txt
. - Configure your MySQL connection in
load.py
. - Run the Airflow DAG to execute the ETL pipeline.
dags/
: Contains the Airflow DAG.scripts/
: Contains Python scripts for extraction, transformation, and loading.data/
: Sample sales data in CSV format.requirements.txt
: List of required Python packages.
This project showcases an end-to-end ETL process that can be applied to real-world data analysis scenarios.