Skip to content

Commit

Permalink
Added github workflows
Browse files Browse the repository at this point in the history
  • Loading branch information
adityajaroli committed Jan 16, 2024
1 parent eb649cf commit 8eaf630
Show file tree
Hide file tree
Showing 5 changed files with 99 additions and 19 deletions.
26 changes: 26 additions & 0 deletions .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: pre-commit

on:
pull_request:
branches: ['*']
push:
branches: [main]

jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3

- uses: actions/setup-python@v3
with:
python-version: '3.11'

- name: "install project dependencies"
run : |
python -m pip install --upgrade pip
pip install setuptools_scm wheel
pip install -r requirements.txt
pip install -r test-requirements.txt
- uses: pre-commit/[email protected]
31 changes: 31 additions & 0 deletions .github/workflows/publish_package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Upload Release ot PyPI

on:
release:
types: [published]

jobs:
publish:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: '3.11'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install build wheel twine
python -m pip install --upgrade setuptools build wheel twine
- name: Build package
run: |
python -m build
twine check dist/*
- name: Publish package
uses: pypa/gh-action-pypi-publish@release/v1
with:
user: __token__
password: ${{ secrets.PG_BULK_LOADER_PYPI }}
verify_metadata: false
verbose: true
23 changes: 23 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.2.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files

- repo: https://github.com/PyCQA/autoflake
rev: v2.2.1
hooks:
- id: autoflake
args: [--remove-all-unused-imports, --in-place]

- repo: local
hooks:
- id: code-coverage-checker
name: pytest-coverage-checker
entry: pytest --cov=src.pg_bulk_loader --cov-fail-under=95
language: system
types: [python]
pass_filenames: false
35 changes: 17 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<h2>Overview</h2>

**pg-bulk-loader** is a utility package designed to facilitate faster bulk insertion DataFrame to a PostgreSQL Database.
Currently, it supports load from pandas DataFrame only.
Currently, it supports load from pandas DataFrame only.

<h2>Purpose</h2>

Expand All @@ -17,20 +17,20 @@ This utility leverages the power of PostgreSQL in combination with Python to eff

<h2>package's Efficiency</h2>

**Machine:**
- Resource config - 5 core, 8GB
- Azure hosted PostgreSQL Server
**Machine:**
- Resource config - 5 core, 8GB
- Azure hosted PostgreSQL Server
- Azure hosted Python service (jupyter notebook)

**Table info:**
**Table info:**
- 12 columns (3 texts, 2 date, 7 double)
- Primary key: 3 columns (2 text and 1 date)
- Indexes: 2 b-tree. (1 on single column and another on three columns)

**Runtime:**
- Data Size: 20M
- without PK and Indexes: ~55s
- with PK and indexes: ~150s (~85s to insert data with PK enabled and ~65 seconds to create indexes)
- with PK and indexes: ~150s (~85s to insert data with PK enabled and ~65 seconds to create indexes)

**Running with 1M records without having PK and Indexes with different approaches:**

Expand Down Expand Up @@ -58,7 +58,7 @@ The utility provides the following useful functions and classes:

**Note:** Provide input either in the form of DataFrame or DataFrame generator

<h3>batch_insert_to_postgres_with_multi_process() function</h3>
<h3>batch_insert_to_postgres_with_multi_process() function</h3>

- `pg_conn_details`: Instance of the PgConnectionDetail class containing PostgreSQL server connection details.
- `table_name`: Name of the table for bulk insertion.
Expand Down Expand Up @@ -101,8 +101,8 @@ from pg_bulk_loader import PgConnectionDetail, batch_insert_to_postgres
async def run():
# Read data. Let's suppose below DataFrame has 20M records
input_data_df = pd.DataFrame()
# Create Postgres Connection Details object. This will help in creating and managing the database connections

# Create Postgres Connection Details object. This will help in creating and managing the database connections
pg_conn_details = PgConnectionDetail(
user="<postgres username>",
password="<postgres password>",
Expand All @@ -111,7 +111,7 @@ async def run():
port="<port>",
schema="<schema name where table exist>"
)

# Data will be inserted and committed in the batch of 2,50,000
await batch_insert_to_postgres(
pg_conn_details=pg_conn_details,
Expand Down Expand Up @@ -140,8 +140,8 @@ from pg_bulk_loader import PgConnectionDetail, batch_insert_to_postgres
async def run():
# Read data. Let's suppose below DataFrame has 20M records
input_data_df_generator = pd.read_csv("file.csv", chunksize=1000000)
# Create Postgres Connection Details object. This will help in creating and managing the database connections

# Create Postgres Connection Details object. This will help in creating and managing the database connections
pg_conn_details = PgConnectionDetail(
user="<postgres username>",
password="<postgres password>",
Expand All @@ -150,7 +150,7 @@ async def run():
port="<port>",
schema="<schema name where table exist>"
)

# Data will be inserted and committed in the batch of 2,50,000
await batch_insert_to_postgres(
pg_conn_details=pg_conn_details,
Expand Down Expand Up @@ -181,7 +181,7 @@ from pg_bulk_loader import PgConnectionDetail, batch_insert_to_postgres_with_mul


async def run():
# Create Postgres Connection Details object. This will help in creating and managing the database connections
# Create Postgres Connection Details object. This will help in creating and managing the database connections
pg_conn_details = PgConnectionDetail(
user="<postgres username>",
password="<postgres password>",
Expand All @@ -190,9 +190,9 @@ async def run():
port="<port>",
schema="<schema name where table exist>"
)

df_generator = pd.read_csv("20M-file.csv", chunksize=1000000)

# Data will be inserted and committed in the batch of 2,50,000
await batch_insert_to_postgres_with_multi_process(
pg_conn_details=pg_conn_details,
Expand All @@ -214,6 +214,5 @@ if __name__ == '__main__':
<h2> Development: </h2>

- Run this command to install the required development dependencies `pip install -r dev-requirements.txt`
- Run `pre-commit install` so that it creates a hook with `git commit` and run for basic sanity before you make any commit.
- Run below commands to run the unit test cases: `pytest` or `coverage run --source=src.pg_bulk_loader --module pytest --verbose && coverage report --show-missing`


3 changes: 2 additions & 1 deletion dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
-r requirements.txt
-r test-requirements.txt

pre-commit
build
twine
twine

0 comments on commit 8eaf630

Please sign in to comment.