Used by the gen-excel
Makefile target. No tool.poetry.scripts
alias in the pyproject.toml
file.
This Python script is designed to organize Excel files based on the categorization derived from a MIxS schema. It sorts files into directories corresponding to different schema classifications such as Checklists and Extensions. Here's a breakdown of the script's operations and structure:
-
Imports and Setup:
- Imports necessary modules including
os
for directory manipulations,shutil
for file operations,defaultdict
fromcollections
for data aggregation,SchemaView
fromlinkml_runtime
for schema operations, andlogging
andargparse
for logging and command-line interface management.
- Imports necessary modules including
-
Class Definition: MIxSExcelFileOrganizer:
- A class is defined to encapsulate all functionality related to file organization based on a MIxS schema.
- Constructor (
__init__
):- Takes paths to the MIxS schema file, the source directory containing Excel files, and the base destination folder.
- Initializes a logger for the class.
- Method: setup_logger:
- Configures and returns a logger object for logging information and warnings.
- Method: organize_files:
- Loads the schema using
SchemaView
and identifies classes categorized as "Checklist" and "Extension". - Organizes classes by their type and associations into a results dictionary.
- For each extension class, attempts to copy its corresponding Excel file from the source directory to a new ' extensions_only' directory.
- For each checklist class, copies its associated Excel files and those of any related classes into categorized directories.
- Logs warnings for any Excel files that are expected but not found in the source directory.
- Loads the schema using
-
Command-Line Interface Setup:
- Uses
argparse
to create a CLI that requires users to specify paths for the schema file, source directory, and destination folder. - Provides a default path for the schema file and makes the source and destination folder paths required arguments, ensuring that the user provides these essential inputs.
- Uses
-
Script Execution:
- The script is designed to be run from the command line. It parses arguments provided by the user and creates an
instance of
MIxSExcelFileOrganizer
with these inputs. - Calls the
organize_files
method to perform the file organization based on the schema-defined classifications.
- The script is designed to be run from the command line. It parses arguments provided by the user and creates an
instance of
- Effective Use of Libraries: The script effectively utilizes Python's built-in libraries and external modules to handle file and directory operations, logging, and command-line interactions.
- Structured and Modular Code: By encapsulating functionality within a class and using methods for specific tasks, the script maintains a clean and modular code structure that enhances readability and maintainability.
- Advanced File Management: Demonstrates complex file management tasks including conditional file copying based on schema-derived logic, and dynamic directory creation based on schema categories.
- The script is well-organized and serves a specific functional purpose of categorizing and organizing Excel files based on a schema classification. It provides robust error handling and user interaction through command-line arguments, making it a practical tool for managing dataset files according to a predefined schema structure.