no cache (teaxyz#19)

wisdomdyn · Oct 25, 2024 · f054025 · f054025
1 parent 6749d61
commit f054025
Show file tree

Hide file tree

Showing 16 changed files with 597 additions and 211 deletions.
diff --git a/README.md b/README.md
@@ -12,30 +12,32 @@ Use [Docker](https://docker.com)
 2. Then, run `docker compose up` to launch.
 
 > [!NOTE]
+>
 > This will run CHAI with for all package managers. As an example crates by
 > itself will take over an hour and consume >5GB storage.
 >
-> To run only a specific backend, comment out the others in `docker-compose.yml`.
-
-<!-- we'd like to change the above to be more friendly to users trying to run a specific
-pipeline -->
-
-> [!NOTE]
 > Currently, we support only two package managers:
 >
 > - crates
 > - Homebrew
 >
-> We are planning on supporting `NPM`, `PyPI`, and `rubygems`
+> You can run a single package manager by running
+> `docker compose run --rm -e ... <package_manager>`
+>
+> We are planning on supporting `NPM`, `PyPI`, and `rubygems` next.
 
 ### Arguments
 
 Specify these eg. `docker compose -e FOO=bar up`:
 
-- `FREQUENCY`: how frequently **(in hours)** the pipeline will run
-  (defaults to `24`)
-- `FETCH`: whether the pipeline will fetch the data. Defaults to `true`
-- `DEBUG`: whether the pipeline will run in debug mode. Defaults to `true`
+- `FREQUENCY`: Sets how often (in hours) the pipeline should run.
+- `TEST`: Runs the loader in test mode when set to true, skipping certain data insertions.
+- `FETCH`: Determines whether to fetch new data from the source when set to true.
+- `NO_CACHE`: When set to true, deletes temporary files after processing.
+
+> [!NOTE]
+> The flag `NO_CACHE` does not mean that files will not get downloaded to your local
+> storage, just that we'll delete the files once we're done with them
 
 These arguments are all configurable in the `docker-compose.yml` file.
 
@@ -66,6 +68,8 @@ Our goal is to build a data schema that looks like this:
 
 ![db/CHAI_ERD.png](db/CHAI_ERD.png)
 
+You can read more about specific data models in the dbs [readme](db/README.md)
+
 Our specific application extracts the dependency graph understand what are
 critical pieces of the open-source graph. We also built a simple example that displays
 [sbom-metadata](examples/sbom-meta) for your repository.

diff --git a/alembic/README.md b/alembic/README.md
@@ -0,0 +1,56 @@
+# CHAI Data Migrations
+
+This directory contains the Alembic configuration and migration scripts for managing the
+database schema of the CHAI project. Alembic is used to handle database migrations,
+allowing for version control of our database schema.
+
+### About Alembic
+
+Alembic is a database migration tool for SQLAlchemy. It allows us to:
+
+- Track changes to our database schema over time
+- Apply and revert these changes in a controlled manner
+- Generate migration scripts automatically based on model changes
+
+> [!NOTE]
+> It's important to note that while `alembic` serves our current needs, it may not be
+> our long-term solution. As the CHAI project evolves, we might explore other database
+> migration tools or strategies that better fit our growing requirements. We're open to
+> reassessing our approach to schema management as needed.
+
+## Entrypoint
+
+The main entrypoint for running migrations is the
+[run migrations script](run_migrations.sh). This script orchestrates the initialization
+and migration process.
+
+## Steps
+
+1. [Initialize](init-script.sql)
+
+The initialization script creates the database `chai`, and loads it up with any
+extensions that we'd need, so we've got a clean slate for our db structures.
+
+2. [Load](load-values.sql)
+
+The load script pre-populated some of the tables, with `enum`-like values - specifically
+for:
+
+- `url_types`: defines different types of URLs (e.g., source, homepage, documentation)
+- `depends_on_types`: defines different types of dependencies (e.g., runtime,
+  development)
+- `sources` and `package_managers`: defines different package managers (e.g., npm, pypi)
+
+3. Run Alembic Migrations
+
+After initialization and loading initial data, the script runs Alembic migrations to apply any pending database schema changes.
+
+## Contributing
+
+To contirbute to the database schema:
+
+1. Make a change in the [models](../core/models/__init__.py) file
+2. Generate a new migration script: `alembic revision --autogenerate "Description"`
+3. Review the generated migration script in the [versions](versions/) directory. The
+   auto-generation is powerful but not perfect, please review the script carefully.
+4. Test the migration by running `alembic upgrade head`.
diff --git a/core/README.md b/core/README.md
@@ -9,12 +9,18 @@ into the database.
 
 ### 1. [Config](config.py)
 
-The Config module provides configuration management for loaders. It includes:
+Config always runs first, and is the entrypoint for all loaders. It includes;
 
-- `PackageManager` enum for supported package managers
-- `Config` class for storing loader-specific configurations
-- Functions for initializing configurations and loading various types (URL types,
-  user types, package manager IDs, dependency types)
+- Execution flags:
+  - `FETCH` determines whether we request the data from source
+  - `TEST` enables a test mode, to test specific portions of the pipeline
+  - `NO_CACHE` to determine whether we save the intermediate pipeline files
+- Package Manager flags
+  - `pm_id` gets the package manager id from the db, that we'd run the pipeline for
+  - `source` is the data source for that package manager. `SOURCES` defines the map.
+
+The next 3 configuration classes retrieve the IDs for url types (homepage, documentation,
+etc.), dependency types (build, runtime, etc.) and user types (crates user, github user)
 
 ### 2. [Database](db.py)
 
@@ -31,6 +37,7 @@ package manager sources. It supports:
 
 - Downloading tarball files
 - Extracting contents to a specified directory
+- Maintaining a "latest" symlink so we always know where to look
 
 ### 4. [Logger](logger.py)
 
@@ -72,12 +79,3 @@ To create a new loader for a package manager:
    Transformer, Scheduler) to fetch, transform, and load data.
 
 Example usage can be found in the [crates](../package_managers/crates) loader.
-
-## Contributing
-
-When adding new functionality or modifying existing core components, please ensure that
-changes are compatible with all existing loaders and follow the established patterns
-and conventions.
-
-For more detailed information on each component, refer to the individual files and their
-docstrings.
diff --git a/core/config.py b/core/config.py
@@ -1,127 +1,126 @@
-from dataclasses import dataclass
-from os import getenv
+from enum import Enum
+
+from sqlalchemy import UUID
 
 from core.db import DB
 from core.logger import Logger
-from core.structs import (
-    DependencyTypes,
-    PackageManager,
-    PackageManagerIDs,
-    Sources,
-    URLTypes,
-    UserTypes,
-)
+from core.utils import env_vars
 
 logger = Logger("config")
 
-TEST = getenv("TEST", "false").lower() == "true"
-FETCH = getenv("FETCH", "true").lower() == "true"
 
+class PackageManager(Enum):
+    CRATES = "crates"
+    HOMEBREW = "homebrew"
 
-@dataclass
-class Config:
-    file_location: str
+
+TEST = env_vars("TEST", "false")
+FETCH = env_vars("FETCH", "true")
+NO_CACHE = env_vars("NO_CACHE", "true")
+SOURCES = {
+    PackageManager.CRATES: "https://static.crates.io/db-dump.tar.gz",
+    PackageManager.HOMEBREW: "https://github.com/Homebrew/homebrew-core/tree/master/Formula",  # noqa
+}
+
+# The three configuration values URLTypes, DependencyTypes, and UserTypes will query the
+# DB to get the respective IDs. If the values don't exist in the database, they will
+# raise an AttributeError (None has no attribute id) at the start
+
+
+class ExecConf:
     test: bool
     fetch: bool
-    package_manager_id: str
+    no_cache: bool
+
+    def __init__(self) -> None:
+        self.test = TEST
+        self.fetch = FETCH
+        self.no_cache = NO_CACHE
+
+    def __str__(self):
+        return f"ExecConf(test={self.test},fetch={self.fetch},no_cache={self.no_cache}"
+
+
+class PMConf:
+    pm_id: str
+    source: str
+
+    def __init__(self, pm: PackageManager, db: DB):
+        self.pm_id = db.select_package_manager_by_name(pm.value).id
+        self.source = SOURCES[pm]
+
+    def __str__(self):
+        return f"PMConf(pm_id={self.pm_id},source={self.source})"
+
+
+class URLTypes:
+    homepage: UUID
+    repository: UUID
+    documentation: UUID
+    source: UUID
+
+    def __init__(self, db: DB):
+        self.load_url_types(db)
+
+    def load_url_types(self, db: DB) -> None:
+        self.homepage = db.select_url_types_homepage().id
+        self.repository = db.select_url_types_repository().id
+        self.documentation = db.select_url_types_documentation().id
+        self.source = db.select_url_types_source().id
+
+    def __str__(self) -> str:
+        return f"URLs(homepage={self.homepage},repo={self.repository},docs={self.documentation},src={self.source})"  # noqa
+
+
+class UserTypes:
+    crates: UUID
+    github: UUID
+
+    def __init__(self, db: DB):
+        self.crates = db.select_source_by_name("crates").id
+        self.github = db.select_source_by_name("github").id
+
+    def __str__(self) -> str:
+        return f"UserTypes(crates={self.crates},github={self.github})"
+
+
+class DependencyTypes:
+    build: UUID
+    development: UUID
+    runtime: UUID
+    test: UUID
+    optional: UUID
+    recommended: UUID
+
+    def __init__(self, db: DB):
+        self.build = db.select_dependency_type_by_name("build").id
+        self.development = db.select_dependency_type_by_name("development").id
+        self.runtime = db.select_dependency_type_by_name("runtime").id
+        self.test = db.select_dependency_type_by_name("test").id
+        self.optional = db.select_dependency_type_by_name("optional").id
+        self.recommended = db.select_dependency_type_by_name("recommended").id
+
+    def __str__(self) -> str:
+        return f"DependencyTypes(build={self.build},development={self.development},runtime={self.runtime},test={self.test},optional={self.optional},recommended={self.recommended})"  # noqa
+
+
+class Config:
+    exec_config: ExecConf
+    pm_config: PMConf
     url_types: URLTypes
     user_types: UserTypes
     dependency_types: DependencyTypes
 
+    def __init__(self, pm: PackageManager, db: DB) -> None:
+        self.exec_config = ExecConf()
+        self.pm_config = PMConf(pm, db)
+        self.url_types = URLTypes(db)
+        self.user_types = UserTypes(db)
+        self.dependency_types = DependencyTypes(db)
+
     def __str__(self):
-        return f"Config(file_location={self.file_location}, test={self.test}, \
-            fetch={self.fetch}, package_manager_id={self.package_manager_id}, \
-            url_types={self.url_types}, user_types={self.user_types}, \
-            dependency_types={self.dependency_types})"
-
-
-def load_url_types(db: DB) -> URLTypes:
-    logger.debug("loading url types, and creating if not exists")
-    homepage_url = db.select_url_types_homepage(create=True)
-    repository_url = db.select_url_types_repository(create=True)
-    documentation_url = db.select_url_types_documentation(create=True)
-    source_url = db.select_url_types_source(create=True)
-    return URLTypes(
-        homepage=homepage_url.id,
-        repository=repository_url.id,
-        documentation=documentation_url.id,
-        source=source_url.id,
-    )
-
-
-def load_user_types(db: DB) -> UserTypes:
-    logger.debug("loading user types, and creating if not exists")
-    crates_source = db.select_source_by_name("crates", create=True)
-    github_source = db.select_source_by_name("github", create=True)
-    return UserTypes(
-        crates=crates_source.id,
-        github=github_source.id,
-    )
-
-
-def load_package_manager_ids(db: DB) -> PackageManagerIDs:
-    logger.debug("loading package manager ids, and creating if not exists")
-    crates_package_manager = db.select_package_manager_by_name("crates", create=True)
-    homebrew_package_manager = db.select_package_manager_by_name(
-        "homebrew", create=True
-    )
-    return {
-        PackageManager.CRATES: crates_package_manager.id,
-        PackageManager.HOMEBREW: homebrew_package_manager.id,
-    }
-
-
-def load_dependency_types(db: DB) -> DependencyTypes:
-    logger.debug("loading dependency types, and creating if not exists")
-    build_dep_type = db.select_dependency_type_by_name("build", create=True)
-    dev_dep_type = db.select_dependency_type_by_name("development", create=True)
-    runtime_dep_type = db.select_dependency_type_by_name("runtime", create=True)
-    test_dep_type = db.select_dependency_type_by_name("test", create=True)
-    optional_dep_type = db.select_dependency_type_by_name("optional", create=True)
-    recommended_dep_type = db.select_dependency_type_by_name("recommended", create=True)
-    return DependencyTypes(
-        build=build_dep_type.id,
-        development=dev_dep_type.id,
-        runtime=runtime_dep_type.id,
-        test=test_dep_type.id,
-        optional=optional_dep_type.id,
-        recommended=recommended_dep_type.id,
-    )
-
-
-def load_sources() -> Sources:
-    return {
-        PackageManager.CRATES: "https://static.crates.io/db-dump.tar.gz",
-        PackageManager.HOMEBREW: (
-            "https://github.com/Homebrew/homebrew-core/tree/master/Formula"
-        ),
-    }
-
-
-def initialize(package_manager: PackageManager, db: DB) -> Config:
-    url_types = load_url_types(db)
-    user_types = load_user_types(db)
-    package_manager_ids = load_package_manager_ids(db)
-    dependency_types = load_dependency_types(db)
-    sources = load_sources()
-
-    if package_manager == PackageManager.CRATES:
-        return Config(
-            file_location=sources[PackageManager.CRATES],
-            test=False,
-            fetch=True,
-            package_manager_id=package_manager_ids[PackageManager.CRATES],
-            url_types=url_types,
-            user_types=user_types,
-            dependency_types=dependency_types,
-        )
-    elif package_manager == PackageManager.HOMEBREW:
-        return Config(
-            file_location=sources[PackageManager.HOMEBREW],
-            test=False,
-            fetch=True,
-            package_manager_id=package_manager_ids[PackageManager.HOMEBREW],
-            url_types=url_types,
-            user_types=user_types,
-            dependency_types=dependency_types,
-        )
+        return f"Config(exec_config={self.exec_config}, pm_config={self.pm_config}, url_types={self.url_types}, user_types={self.user_types}, dependency_types={self.dependency_types})"  # noqa
+
+
+if __name__ == "__main__":
+    print(PackageManager.CRATES.value)