Open Sourcing Packs

qalita-io · Jan 28, 2024 · b385c43 · b385c43
1 parent 109f81d
commit b385c43
Show file tree

Hide file tree

Showing 65 changed files with 4,443 additions and 0 deletions.
diff --git a/accuracy_pack/LICENSE b/accuracy_pack/LICENSE
@@ -0,0 +1,92 @@
+QALITA SOFTWARE LICENSE AGREEMENT
+
+THIS IS AN AGREEMENT BETWEEN YOU ("LICENSEE") AND QALITA SAS, A CORPORATION INCORPORATED UNDER THE LAWS OF FRANCE, COMPANY ID: 951 829 803 ("QALITA"). BY INSTALLING, COPYING, OR OTHERWISE USING THE QALITA SOFTWARE ("SOFTWARE"), YOU AGREE TO BE BOUND BY THE TERMS OF THIS LICENSE AGREEMENT.
+
+    Copyright (c) - 2023-2024 - QALITA SAS - All Rights Reserved
+
+    1. GRANT OF LICENSE. Subject to the terms and conditions of this License Agreement, QALITA grants to Licensee a non-exclusive, non-transferable license to use the Software solely for Licensee's internal business purposes.
+
+    2. RESTRICTIONS. Licensee may not rent, lease, distribute, sublicense, transfer, or sell the Software, or any portion thereof. Licensee may not modify, translate, reverse engineer, decompile, disassemble, or create derivative works based on the Software, except to the extent that enforcement of the foregoing restriction is prohibited by applicable law.
+
+    3. COPYRIGHT AND OWNERSHIP. The Software is owned by QALITA and is protected by French copyright laws and international treaty provisions. QALITA retains all rights not expressly granted to Licensee in this License Agreement.
+
+    4. SOFTWARE DEPENDENCIES. The Software may include or depend on other software components which are licensed under terms and conditions different from this License Agreement. The licenses for these dependencies are included below, or in the documentation or files accompanying the Software.
+
+    5. NO WARRANTIES. The Software is provided "AS IS" and QALITA makes no warranty as to its use or performance. QALITA AND ITS SUPPLIERS DO NOT AND CANNOT WARRANT THE PERFORMANCE OR RESULTS YOU MAY OBTAIN BY USING THE SOFTWARE.
+
+    6. LIMITATION OF LIABILITY. In no event will QALITA or its suppliers be liable for any loss, damages or costs, whether direct, indirect, incidental, special, consequential, or punitive, arising out of Licensee's use of, or inability to use, the Software, even if QALITA has been advised of the possibility of such damages.
+
+    7. TERMINATION. QALITA may terminate this License Agreement if Licensee fails to comply with the terms and conditions of this License Agreement. In such event, Licensee must destroy all copies of the Software.
+
+    8. GENERAL. This License Agreement is governed by the laws of France. If any provision of this License Agreement is held to be void, invalid, unenforceable or illegal, the other provisions shall continue in full force and effect.
+
+BY INSTALLING, COPYING, OR OTHERWISE USING THE SOFTWARE, LICENSEE AGREES TO BE BOUND BY THE TERMS OF THIS LICENSE AGREEMENT.
+
+**QALITA SAS**
+
+**IMPORTANT:**
+
+BEFORE USING THIS SOFTWARE, CAREFULLY READ THIS LICENSE AGREEMENT. BY USING THE SOFTWARE, YOU ARE AGREEING TO BE BOUND BY THE TERMS OF THIS LICENSE AGREEMENT. IF YOU DO NOT AGREE TO THE TERMS OF THIS LICENSE AGREEMENT, DO NOT USE THE SOFTWARE.
+
+This is a legal agreement and should be treated as such. If you have any questions regarding this agreement, please contact QALITA at **[email protected].**
+
+
+____
+
+Dependency : [SQLAlchemy License](https://github.com/sqlalchemy/sqlalchemy/blob/main/LICENSE)
+
+    Copyright 2005-2024 SQLAlchemy authors and contributors <see AUTHORS file>.
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy of
+    this software and associated documentation files (the "Software"), to deal in
+    the Software without restriction, including without limitation the rights to
+    use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
+    of the Software, and to permit persons to whom the Software is furnished to do
+    so, subject to the following conditions:
+
+    The above copyright notice and this permission notice shall be included in all
+    copies or substantial portions of the Software.
+
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+    SOFTWARE.
+
+____
+
+Dependency : [Pandas License](https://github.com/pandas-dev/pandas/blob/main/LICENSE)
+
+    BSD 3-Clause License
+
+    Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
+    All rights reserved.
+
+    Copyright (c) 2011-2024, Open source contributors.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions are met:
+
+    * Redistributions of source code must retain the above copyright notice, this
+    list of conditions and the following disclaimer.
+
+    * Redistributions in binary form must reproduce the above copyright notice,
+    this list of conditions and the following disclaimer in the documentation
+    and/or other materials provided with the distribution.
+
+    * Neither the name of the copyright holder nor the names of its
+    contributors may be used to endorse or promote products derived from
+    this software without specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+    AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+    IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+    DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+    FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+    DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+    SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+    CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+    OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/accuracy_pack/README.md b/accuracy_pack/README.md
@@ -0,0 +1,36 @@
+# Accuracy
+
+## Overview
+This pack assesses the precision of float columns within a dataset, providing a granular view of data quality. The script computes the maximum number of decimal places for each float column and generates a normalized score representing the precision level of the data. The results are saved in `metrics.json`, with each float column's precision score detailed individually.
+
+## Features
+- **Precision Calculation**: Computes the maximum number of decimal places for each float value in float columns.
+- **Score Normalization**: Normalizes the precision values to a 0-1 scale, providing a standardized precision score for each column.
+- **Metrics Generation**: Outputs a `metrics.json` file containing precision scores for each float column, enhancing the interpretability of data quality.
+
+## Setup
+Before running the script, ensure that the following files are properly configured:
+- `source_conf.json`: Configuration file for the source data.
+- `pack_conf.json`: Configuration file for the pack.
+- Data file: The data to be analyzed, lo aded using `opener.py`.
+
+## Usage
+To use this pack, follow these steps:
+1. Ensure all prerequisite files (`source_conf.json`, `pack_conf.json`, and the data file) are in place.
+2. Run the script with the appropriate Python interpreter.
+3. Review the generated `metrics.json` for precision metrics of the dataset.
+
+## Output
+- `metrics.json`: Contains precision scores for each float column in the dataset. The structure of the output is as follows:
+  ```json
+  [
+      {
+          "key": "decimal_precision",
+          "value": "<precision_score>",
+          "scope": {
+              "perimeter": "column",
+              "value": "<column_name>"
+          },
+      },
+      ...
+  ]
diff --git a/accuracy_pack/icon.png b/accuracy_pack/icon.png
diff --git a/accuracy_pack/main.py b/accuracy_pack/main.py
@@ -0,0 +1,123 @@
+import json
+import utils
+
+########################### Loading Data
+
+# Load the configuration file
+print("Load source_conf.json")
+with open("source_conf.json", "r", encoding="utf-8") as file:
+    source_config = json.load(file)
+
+# Load the pack configuration file
+print("Load pack_conf.json")
+with open("pack_conf.json", "r", encoding="utf-8") as file:
+    pack_config = json.load(file)
+
+# Load data using the opener.py logic
+from opener import load_data
+
+df = load_data(source_config, pack_config)
+
+############################ Compute Precision Score for Each Float Column
+
+
+def compute_metrics(df):
+    float_columns = df.select_dtypes(include=["float", "float64"]).columns
+
+    # If there are no float columns, return None
+    if not float_columns.any():
+        print("No float columns found. metrics.json will not be created.")
+        return []
+
+    # Compute precision score for each float column
+    precision_data = []
+    total_proportion_score = 0  # Initialize total proportion score
+
+    for column in float_columns:
+        decimals_count = (
+            df[column]
+            .dropna()
+            .apply(lambda x: len(str(x).split(".")[1]) if "." in str(x) else 0)
+        )
+        max_decimals = decimals_count.max()
+        most_common_decimals_series = decimals_count.mode()
+
+        # Handle the scenario when the mode() returns an empty series
+        if most_common_decimals_series.empty:
+            print(f"No common decimal count found for column {column}.")
+            most_common_decimals = 0
+            proportion_score = 0
+        else:
+            most_common_decimals = most_common_decimals_series[
+                0
+            ]  # Get the most common decimals count
+            proportion_score = decimals_count[
+                decimals_count == most_common_decimals
+            ].count() / len(decimals_count)
+
+        total_proportion_score += proportion_score  # Add proportion score to the total
+
+        precision_data.append(
+            {
+                "key": "decimal_precision",
+                "value": str(max_decimals),  # Maximum number of decimals
+                "scope": {"perimeter": "column", "value": column},
+            }
+        )
+
+        precision_data.append(
+            {
+                "key": "proportion_score",
+                "value": str(
+                    round(proportion_score, 2)
+                ),  # Proportion of values with the most common decimals count
+                "scope": {"perimeter": "column", "value": column},
+            }
+        )
+
+    # Calculate the mean of proportion scores
+    mean_proportion_score = (
+        total_proportion_score / len(float_columns) if float_columns.any() else 0
+    )
+
+    # Add the mean proportion score to the precision data
+    precision_data.append(
+        {
+            "key": "score",
+            "value": str(round(mean_proportion_score, 2)),  # Mean proportion score
+            "scope": {"perimeter": "dataset", "value": source_config["name"]},
+        }
+    )
+
+    return precision_data
+
+
+# Compute metrics
+precision_metrics = compute_metrics(df)
+
+################### Recommendations
+recommendations = []
+for column in df.columns:
+    for item in precision_metrics:
+        if item["scope"]["value"] == column and item["key"] == "proportion_score":
+            proportion_score = float(item["value"])
+            if proportion_score < 0.9:
+                recommendation = {
+                    "content": f"Column '{column}' has {(1-proportion_score)*100:.2f}% of data that are not rounded to the same number of decimals.",
+                    "type": "Duplicates",
+                    "scope": {"perimeter": "column", "value": column},
+                    "level": utils.determine_recommendation_level(1 - proportion_score),
+                }
+                recommendations.append(recommendation)
+
+############################ Writing Metrics and Recommendations to Files
+
+if precision_metrics is not None:
+    with open("metrics.json", "w") as file:
+        json.dump(precision_metrics, file, indent=4)
+    print("metrics.json file created successfully.")
+
+if recommendations:
+    with open("recommendations.json", "w", encoding="utf-8") as f:
+        json.dump(recommendations, f, indent=4)
+    print("recommendations.json file created successfully.")
diff --git a/accuracy_pack/opener.py b/accuracy_pack/opener.py
@@ -0,0 +1,128 @@
+"""
+The opener module contains functions to load data from files and databases.
+"""
+
+import os
+import glob
+import pandas as pd
+from sqlalchemy import create_engine
+
+# Mapping of default ports to database types
+DEFAULT_PORTS = {
+    "5432": "postgresql",
+    "3306": "mysql",
+    "1433": "mssql+pymssql",
+}
+
+def load_data_file(file_path, pack_config):
+
+    # Check if the outer keys exist
+    if "job" in pack_config and "source" in pack_config["job"]:
+        # Now safely check for 'skiprows'
+        skiprows = pack_config["job"]["source"].get("skiprows")
+
+        if skiprows is not None:  # Checking if 'skiprows' exists and is not None
+            if file_path.endswith(".csv"):
+                return pd.read_csv(
+                    file_path,
+                    low_memory=False,
+                    memory_map=True,
+                    skiprows=int(skiprows),
+                    on_bad_lines="warn",
+                )
+            elif file_path.endswith(".xlsx"):
+                return pd.read_excel(
+                    file_path,
+                    engine="openpyxl",
+                    skiprows=int(skiprows),
+                )
+    else:
+        # Logic when 'skiprows' is not specified
+        if file_path.endswith(".csv"):
+            return pd.read_csv(
+                file_path, low_memory=False, memory_map=True, on_bad_lines="warn"
+            )
+        elif file_path.endswith(".xlsx"):
+            return pd.read_excel(file_path, engine="openpyxl")
+
+# Function to create database connection
+def create_db_connection(config):
+    user = config["username"]
+    password = config["password"]
+    host = config["host"]
+    port = config["port"]
+    type = config["type"]
+    db = config["database"]
+
+    if type:
+        db_type = type
+    else:
+        # Deduce the database type from the port
+        db_type = DEFAULT_PORTS.get(port, "unknown")
+        if db_type == "unknown":
+            raise ValueError(f"Unsupported or unknown database port: {port}")
+
+    engine = create_engine(f"{db_type}://{user}:{password}@{host}:{port}/{db}")
+    return engine
+
+
+# Function to load data from database
+def load_data_from_db(engine):
+    with engine.connect() as connection:
+        # Check liveness
+        try:
+            connection.execute("SELECT 1")
+        except Exception as e:
+            raise ConnectionError(f"Database connection failed: {e}")
+
+        # Scan tables
+        tables = engine.table_names()
+        if not tables:
+            raise ValueError("No tables found in the database.")
+
+        # Load each table into a DataFrame and return them
+        dataframes = {}
+        for table in tables:
+            dataframes[table] = pd.read_sql_table(table, engine)
+
+        return dataframes
+
+
+# Function to load data based on the configuration
+def load_data(source_config, pack_config):
+    source_type = source_config["type"]
+
+    if source_type == "file":
+        path = source_config["config"]["path"]
+
+        if os.path.isfile(path):
+            if path.endswith(".csv") or path.endswith(".xlsx"):
+                return load_data_file(path, pack_config)
+            else:
+                raise ValueError(
+                    "Unsupported file type. Only CSV and XLSX are supported."
+                )
+        elif os.path.isdir(path):
+            data_files = glob.glob(os.path.join(path, "*.csv")) + glob.glob(
+                os.path.join(path, "*.xlsx")
+            )
+            if not data_files:
+                raise FileNotFoundError(
+                    "No CSV or XLSX files found in the provided path."
+                )
+            first_data_file = data_files[0]
+            return load_data_file(first_data_file, pack_config)
+        else:
+            raise FileNotFoundError(
+                f"The path {path} is neither a file nor a directory. Or can't be reached."
+            )
+
+    elif source_type == "database":
+        db_config = source_config["config"]
+        engine = create_db_connection(db_config)
+        return load_data_from_db(engine)
+
+    else:
+        raise ValueError(
+            "Unsupported source type. Only 'file' and 'database' are supported."
+        )