Overview This project is aimed at centralising data from different sources for a retail company. The project involves extracting data from multiple databases, cleaning, transforming the data and then uploading the cleaned data into a database for querying.
If unsure how to clone a repository follow this guide
To set up the project, follow these steps:
-
Clone the repository:
git clone https://github.com/TDAYENI/multinational-retail-data-centralisation771.git cd multinational-retail-data-centralisation771
-
Install the Environment from the Text File
The text file can be found in the environments folder
Navigate to the directory containing the environment.yml file
cd /path/to/directory
-
Specify a name for the new environment during creation
conda env create -f environment.yml -n new_env_name
-
Activate new env
conda activate myenv
multinational-retail-data-centralisation/
├── SQL/
│ ├── alter_statements.sql
│ ├── drop_statements.sql
│ ├── erd_diagram.png
│ ├── primary_and_foreign_key.sql
│ ├── t1_how_many_Stores_.sql
│ ├── t2_most_stores.sql
│ ├── t3_month_largest_sale.sql
│ ├── t4_on_vs_offline.sql
│ ├── t5_total_percent_per_store.sql
│ ├── t6_month_highest_cost.sql
│ ├── t7_staff_headcount.sql
│ ├── t8_ger_stores.sql
│ └── t9_speed_of_sales.sql
├── data_cleaning.py
├── data_extraction.py
├── database_utils.py
├── main.py
├── README.md
└── environment.yml
The project directory contains the following key files and directories:
- main.py: The main script that orchestrates the data extraction, cleaning, and uploading processes.
- data_extraction.py: Contains methods for extracting data from various sources such as RDS databases, PDFs, APIs, and S3 buckets.
- data_cleaning.py: Contains methods for cleaning data, including replacing null values, converting date formats, and cleaning numeric data.
- database_utils.py: Contains methods for managing database connections and uploading data to databases.
- SQL/: Directory containing SQL scripts for database operations.
- requirements.txt contains the dependecies for a virtual environment.
Database Credentials Database credentials are stored in YAML files located in the cred directory. Ensure you have created these files with the correct database details.
Example db_creds.yaml:
database:
host: your_host
port: your_port
user: your_username
password: your_password
database_name: your_database_name
Database credentials are stored in YAML files located in the cred directory. Ensure you have created these files with the correct database details.
Example pg_admin_creds.yaml:
pgadmin:
host: your_host
port: your_port
user: your_username
password: your_password
-
Set Up the Environment:
- Ensure you have Conda installed. Create and activate the environment using the
environment.yml
file shown in Installation.
- Ensure you have Conda installed. Create and activate the environment using the
-
Update Database Credentials:
- Update the
cred/pg_admin_creds.yaml
path in main.py script with your PostgreSQL database credentials as well as API keys and headers. An example of which can be found in Configuration.
- Update the
-
Run the Main Script:
- The
main.py
script is used to create and set up the databases.
python main.py
- The
-
Database Management and Analysis:
- Use your preferred SQL client to connect to your database.
- Execute the SQL scripts located in the
SQL/
directory to manage and analyse your database.
Example SQL scripts:
alter_statements.sql
: SQL statements to alter existing database structures.drop_statements.sql
: SQL statements to drop database objects.primary_and_foreign_key.sql
: ddefines primary and foreign key constraints.t1_how_many_Stores_.sql
tot9_speed_of_sales.sql
: queries to gather insights from the database.