Skip to content

Commit

Permalink
Merge pull request #5 from qalita-io/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
armandleopold authored Jan 28, 2024
2 parents c498d21 + b385c43 commit bbe103a
Show file tree
Hide file tree
Showing 71 changed files with 4,875 additions and 256 deletions.
92 changes: 92 additions & 0 deletions accuracy_pack/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
QALITA SOFTWARE LICENSE AGREEMENT

THIS IS AN AGREEMENT BETWEEN YOU ("LICENSEE") AND QALITA SAS, A CORPORATION INCORPORATED UNDER THE LAWS OF FRANCE, COMPANY ID: 951 829 803 ("QALITA"). BY INSTALLING, COPYING, OR OTHERWISE USING THE QALITA SOFTWARE ("SOFTWARE"), YOU AGREE TO BE BOUND BY THE TERMS OF THIS LICENSE AGREEMENT.

Copyright (c) - 2023-2024 - QALITA SAS - All Rights Reserved

1. GRANT OF LICENSE. Subject to the terms and conditions of this License Agreement, QALITA grants to Licensee a non-exclusive, non-transferable license to use the Software solely for Licensee's internal business purposes.

2. RESTRICTIONS. Licensee may not rent, lease, distribute, sublicense, transfer, or sell the Software, or any portion thereof. Licensee may not modify, translate, reverse engineer, decompile, disassemble, or create derivative works based on the Software, except to the extent that enforcement of the foregoing restriction is prohibited by applicable law.

3. COPYRIGHT AND OWNERSHIP. The Software is owned by QALITA and is protected by French copyright laws and international treaty provisions. QALITA retains all rights not expressly granted to Licensee in this License Agreement.

4. SOFTWARE DEPENDENCIES. The Software may include or depend on other software components which are licensed under terms and conditions different from this License Agreement. The licenses for these dependencies are included below, or in the documentation or files accompanying the Software.

5. NO WARRANTIES. The Software is provided "AS IS" and QALITA makes no warranty as to its use or performance. QALITA AND ITS SUPPLIERS DO NOT AND CANNOT WARRANT THE PERFORMANCE OR RESULTS YOU MAY OBTAIN BY USING THE SOFTWARE.

6. LIMITATION OF LIABILITY. In no event will QALITA or its suppliers be liable for any loss, damages or costs, whether direct, indirect, incidental, special, consequential, or punitive, arising out of Licensee's use of, or inability to use, the Software, even if QALITA has been advised of the possibility of such damages.

7. TERMINATION. QALITA may terminate this License Agreement if Licensee fails to comply with the terms and conditions of this License Agreement. In such event, Licensee must destroy all copies of the Software.

8. GENERAL. This License Agreement is governed by the laws of France. If any provision of this License Agreement is held to be void, invalid, unenforceable or illegal, the other provisions shall continue in full force and effect.

BY INSTALLING, COPYING, OR OTHERWISE USING THE SOFTWARE, LICENSEE AGREES TO BE BOUND BY THE TERMS OF THIS LICENSE AGREEMENT.

**QALITA SAS**

**IMPORTANT:**

BEFORE USING THIS SOFTWARE, CAREFULLY READ THIS LICENSE AGREEMENT. BY USING THE SOFTWARE, YOU ARE AGREEING TO BE BOUND BY THE TERMS OF THIS LICENSE AGREEMENT. IF YOU DO NOT AGREE TO THE TERMS OF THIS LICENSE AGREEMENT, DO NOT USE THE SOFTWARE.

This is a legal agreement and should be treated as such. If you have any questions regarding this agreement, please contact QALITA at **[email protected].**


____

Dependency : [SQLAlchemy License](https://github.com/sqlalchemy/sqlalchemy/blob/main/LICENSE)

Copyright 2005-2024 SQLAlchemy authors and contributors <see AUTHORS file>.

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

____

Dependency : [Pandas License](https://github.com/pandas-dev/pandas/blob/main/LICENSE)

BSD 3-Clause License

Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
All rights reserved.

Copyright (c) 2011-2024, Open source contributors.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
36 changes: 36 additions & 0 deletions accuracy_pack/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Accuracy

## Overview
This pack assesses the precision of float columns within a dataset, providing a granular view of data quality. The script computes the maximum number of decimal places for each float column and generates a normalized score representing the precision level of the data. The results are saved in `metrics.json`, with each float column's precision score detailed individually.

## Features
- **Precision Calculation**: Computes the maximum number of decimal places for each float value in float columns.
- **Score Normalization**: Normalizes the precision values to a 0-1 scale, providing a standardized precision score for each column.
- **Metrics Generation**: Outputs a `metrics.json` file containing precision scores for each float column, enhancing the interpretability of data quality.

## Setup
Before running the script, ensure that the following files are properly configured:
- `source_conf.json`: Configuration file for the source data.
- `pack_conf.json`: Configuration file for the pack.
- Data file: The data to be analyzed, lo aded using `opener.py`.

## Usage
To use this pack, follow these steps:
1. Ensure all prerequisite files (`source_conf.json`, `pack_conf.json`, and the data file) are in place.
2. Run the script with the appropriate Python interpreter.
3. Review the generated `metrics.json` for precision metrics of the dataset.

## Output
- `metrics.json`: Contains precision scores for each float column in the dataset. The structure of the output is as follows:
```json
[
{
"key": "decimal_precision",
"value": "<precision_score>",
"scope": {
"perimeter": "column",
"value": "<column_name>"
},
},
...
]
Binary file added accuracy_pack/icon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
123 changes: 123 additions & 0 deletions accuracy_pack/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
import json
import utils

########################### Loading Data

# Load the configuration file
print("Load source_conf.json")
with open("source_conf.json", "r", encoding="utf-8") as file:
source_config = json.load(file)

# Load the pack configuration file
print("Load pack_conf.json")
with open("pack_conf.json", "r", encoding="utf-8") as file:
pack_config = json.load(file)

# Load data using the opener.py logic
from opener import load_data

df = load_data(source_config, pack_config)

############################ Compute Precision Score for Each Float Column


def compute_metrics(df):
float_columns = df.select_dtypes(include=["float", "float64"]).columns

# If there are no float columns, return None
if not float_columns.any():
print("No float columns found. metrics.json will not be created.")
return []

# Compute precision score for each float column
precision_data = []
total_proportion_score = 0 # Initialize total proportion score

for column in float_columns:
decimals_count = (
df[column]
.dropna()
.apply(lambda x: len(str(x).split(".")[1]) if "." in str(x) else 0)
)
max_decimals = decimals_count.max()
most_common_decimals_series = decimals_count.mode()

# Handle the scenario when the mode() returns an empty series
if most_common_decimals_series.empty:
print(f"No common decimal count found for column {column}.")
most_common_decimals = 0
proportion_score = 0
else:
most_common_decimals = most_common_decimals_series[
0
] # Get the most common decimals count
proportion_score = decimals_count[
decimals_count == most_common_decimals
].count() / len(decimals_count)

total_proportion_score += proportion_score # Add proportion score to the total

precision_data.append(
{
"key": "decimal_precision",
"value": str(max_decimals), # Maximum number of decimals
"scope": {"perimeter": "column", "value": column},
}
)

precision_data.append(
{
"key": "proportion_score",
"value": str(
round(proportion_score, 2)
), # Proportion of values with the most common decimals count
"scope": {"perimeter": "column", "value": column},
}
)

# Calculate the mean of proportion scores
mean_proportion_score = (
total_proportion_score / len(float_columns) if float_columns.any() else 0
)

# Add the mean proportion score to the precision data
precision_data.append(
{
"key": "score",
"value": str(round(mean_proportion_score, 2)), # Mean proportion score
"scope": {"perimeter": "dataset", "value": source_config["name"]},
}
)

return precision_data


# Compute metrics
precision_metrics = compute_metrics(df)

################### Recommendations
recommendations = []
for column in df.columns:
for item in precision_metrics:
if item["scope"]["value"] == column and item["key"] == "proportion_score":
proportion_score = float(item["value"])
if proportion_score < 0.9:
recommendation = {
"content": f"Column '{column}' has {(1-proportion_score)*100:.2f}% of data that are not rounded to the same number of decimals.",
"type": "Duplicates",
"scope": {"perimeter": "column", "value": column},
"level": utils.determine_recommendation_level(1 - proportion_score),
}
recommendations.append(recommendation)

############################ Writing Metrics and Recommendations to Files

if precision_metrics is not None:
with open("metrics.json", "w") as file:
json.dump(precision_metrics, file, indent=4)
print("metrics.json file created successfully.")

if recommendations:
with open("recommendations.json", "w", encoding="utf-8") as f:
json.dump(recommendations, f, indent=4)
print("recommendations.json file created successfully.")
128 changes: 128 additions & 0 deletions accuracy_pack/opener.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
"""
The opener module contains functions to load data from files and databases.
"""

import os
import glob
import pandas as pd
from sqlalchemy import create_engine

# Mapping of default ports to database types
DEFAULT_PORTS = {
"5432": "postgresql",
"3306": "mysql",
"1433": "mssql+pymssql",
}

def load_data_file(file_path, pack_config):

# Check if the outer keys exist
if "job" in pack_config and "source" in pack_config["job"]:
# Now safely check for 'skiprows'
skiprows = pack_config["job"]["source"].get("skiprows")

if skiprows is not None: # Checking if 'skiprows' exists and is not None
if file_path.endswith(".csv"):
return pd.read_csv(
file_path,
low_memory=False,
memory_map=True,
skiprows=int(skiprows),
on_bad_lines="warn",
)
elif file_path.endswith(".xlsx"):
return pd.read_excel(
file_path,
engine="openpyxl",
skiprows=int(skiprows),
)
else:
# Logic when 'skiprows' is not specified
if file_path.endswith(".csv"):
return pd.read_csv(
file_path, low_memory=False, memory_map=True, on_bad_lines="warn"
)
elif file_path.endswith(".xlsx"):
return pd.read_excel(file_path, engine="openpyxl")

# Function to create database connection
def create_db_connection(config):
user = config["username"]
password = config["password"]
host = config["host"]
port = config["port"]
type = config["type"]
db = config["database"]

if type:
db_type = type
else:
# Deduce the database type from the port
db_type = DEFAULT_PORTS.get(port, "unknown")
if db_type == "unknown":
raise ValueError(f"Unsupported or unknown database port: {port}")

engine = create_engine(f"{db_type}://{user}:{password}@{host}:{port}/{db}")
return engine


# Function to load data from database
def load_data_from_db(engine):
with engine.connect() as connection:
# Check liveness
try:
connection.execute("SELECT 1")
except Exception as e:
raise ConnectionError(f"Database connection failed: {e}")

# Scan tables
tables = engine.table_names()
if not tables:
raise ValueError("No tables found in the database.")

# Load each table into a DataFrame and return them
dataframes = {}
for table in tables:
dataframes[table] = pd.read_sql_table(table, engine)

return dataframes


# Function to load data based on the configuration
def load_data(source_config, pack_config):
source_type = source_config["type"]

if source_type == "file":
path = source_config["config"]["path"]

if os.path.isfile(path):
if path.endswith(".csv") or path.endswith(".xlsx"):
return load_data_file(path, pack_config)
else:
raise ValueError(
"Unsupported file type. Only CSV and XLSX are supported."
)
elif os.path.isdir(path):
data_files = glob.glob(os.path.join(path, "*.csv")) + glob.glob(
os.path.join(path, "*.xlsx")
)
if not data_files:
raise FileNotFoundError(
"No CSV or XLSX files found in the provided path."
)
first_data_file = data_files[0]
return load_data_file(first_data_file, pack_config)
else:
raise FileNotFoundError(
f"The path {path} is neither a file nor a directory. Or can't be reached."
)

elif source_type == "database":
db_config = source_config["config"]
engine = create_db_connection(db_config)
return load_data_from_db(engine)

else:
raise ValueError(
"Unsupported source type. Only 'file' and 'database' are supported."
)
Loading

0 comments on commit bbe103a

Please sign in to comment.