2021-09-10
- ETL script is written in Python
- Python libraries include Pandas, PySpark, Requests, Glob, and SQLAlchemy
The ETL process comprises of the following steps:
- Make API calls to extract dimension tables and fact data
- Store raw dimension and fact data as partitioned CSV files
- Load partitioned CSV files into SQLite database for staging and analysis
- Query the specific questions using SQL and show results in console
Follow the steps below to test the ETL process using sample JSON data files.
-
Install Python libraries
-
Open a terminal window and cd to the 'pipeline' folder that contains the etl.py and query.py files
cd c:/usr/documents/Project/pipeline
-
Run the elt.py script to extract data from API and load into datalake
python etl.py
-
Run the query.py script to extract data from datalake and load into SQLite to run queries
python query.py
Please see the "output.txt" file for an example of the console log of the pipeline after a test run.