Skip to content

Commit

Permalink
no cache (teaxyz#19)
Browse files Browse the repository at this point in the history
  • Loading branch information
sanchitram1 authored Oct 25, 2024
1 parent 6749d61 commit f054025
Show file tree
Hide file tree
Showing 16 changed files with 597 additions and 211 deletions.
26 changes: 15 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,30 +12,32 @@ Use [Docker](https://docker.com)
2. Then, run `docker compose up` to launch.

> [!NOTE]
>
> This will run CHAI with for all package managers. As an example crates by
> itself will take over an hour and consume >5GB storage.
>
> To run only a specific backend, comment out the others in `docker-compose.yml`.
<!-- we'd like to change the above to be more friendly to users trying to run a specific
pipeline -->

> [!NOTE]
> Currently, we support only two package managers:
>
> - crates
> - Homebrew
>
> We are planning on supporting `NPM`, `PyPI`, and `rubygems`
> You can run a single package manager by running
> `docker compose run --rm -e ... <package_manager>`
>
> We are planning on supporting `NPM`, `PyPI`, and `rubygems` next.
### Arguments

Specify these eg. `docker compose -e FOO=bar up`:

- `FREQUENCY`: how frequently **(in hours)** the pipeline will run
(defaults to `24`)
- `FETCH`: whether the pipeline will fetch the data. Defaults to `true`
- `DEBUG`: whether the pipeline will run in debug mode. Defaults to `true`
- `FREQUENCY`: Sets how often (in hours) the pipeline should run.
- `TEST`: Runs the loader in test mode when set to true, skipping certain data insertions.
- `FETCH`: Determines whether to fetch new data from the source when set to true.
- `NO_CACHE`: When set to true, deletes temporary files after processing.

> [!NOTE]
> The flag `NO_CACHE` does not mean that files will not get downloaded to your local
> storage, just that we'll delete the files once we're done with them
These arguments are all configurable in the `docker-compose.yml` file.

Expand Down Expand Up @@ -66,6 +68,8 @@ Our goal is to build a data schema that looks like this:

![db/CHAI_ERD.png](db/CHAI_ERD.png)

You can read more about specific data models in the dbs [readme](db/README.md)

Our specific application extracts the dependency graph understand what are
critical pieces of the open-source graph. We also built a simple example that displays
[sbom-metadata](examples/sbom-meta) for your repository.
Expand Down
56 changes: 56 additions & 0 deletions alembic/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# CHAI Data Migrations

This directory contains the Alembic configuration and migration scripts for managing the
database schema of the CHAI project. Alembic is used to handle database migrations,
allowing for version control of our database schema.

### About Alembic

Alembic is a database migration tool for SQLAlchemy. It allows us to:

- Track changes to our database schema over time
- Apply and revert these changes in a controlled manner
- Generate migration scripts automatically based on model changes

> [!NOTE]
> It's important to note that while `alembic` serves our current needs, it may not be
> our long-term solution. As the CHAI project evolves, we might explore other database
> migration tools or strategies that better fit our growing requirements. We're open to
> reassessing our approach to schema management as needed.
## Entrypoint

The main entrypoint for running migrations is the
[run migrations script](run_migrations.sh). This script orchestrates the initialization
and migration process.

## Steps

1. [Initialize](init-script.sql)

The initialization script creates the database `chai`, and loads it up with any
extensions that we'd need, so we've got a clean slate for our db structures.

2. [Load](load-values.sql)

The load script pre-populated some of the tables, with `enum`-like values - specifically
for:

- `url_types`: defines different types of URLs (e.g., source, homepage, documentation)
- `depends_on_types`: defines different types of dependencies (e.g., runtime,
development)
- `sources` and `package_managers`: defines different package managers (e.g., npm, pypi)

3. Run Alembic Migrations

After initialization and loading initial data, the script runs Alembic migrations to apply any pending database schema changes.

## Contributing

To contirbute to the database schema:

1. Make a change in the [models](../core/models/__init__.py) file
2. Generate a new migration script: `alembic revision --autogenerate "Description"`
3. Review the generated migration script in the [versions](versions/) directory. The
auto-generation is powerful but not perfect, please review the script carefully.
4. Test the migration by running `alembic upgrade head`.
26 changes: 12 additions & 14 deletions core/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,18 @@ into the database.

### 1. [Config](config.py)

The Config module provides configuration management for loaders. It includes:
Config always runs first, and is the entrypoint for all loaders. It includes;

- `PackageManager` enum for supported package managers
- `Config` class for storing loader-specific configurations
- Functions for initializing configurations and loading various types (URL types,
user types, package manager IDs, dependency types)
- Execution flags:
- `FETCH` determines whether we request the data from source
- `TEST` enables a test mode, to test specific portions of the pipeline
- `NO_CACHE` to determine whether we save the intermediate pipeline files
- Package Manager flags
- `pm_id` gets the package manager id from the db, that we'd run the pipeline for
- `source` is the data source for that package manager. `SOURCES` defines the map.

The next 3 configuration classes retrieve the IDs for url types (homepage, documentation,
etc.), dependency types (build, runtime, etc.) and user types (crates user, github user)

### 2. [Database](db.py)

Expand All @@ -31,6 +37,7 @@ package manager sources. It supports:

- Downloading tarball files
- Extracting contents to a specified directory
- Maintaining a "latest" symlink so we always know where to look

### 4. [Logger](logger.py)

Expand Down Expand Up @@ -72,12 +79,3 @@ To create a new loader for a package manager:
Transformer, Scheduler) to fetch, transform, and load data.

Example usage can be found in the [crates](../package_managers/crates) loader.

## Contributing

When adding new functionality or modifying existing core components, please ensure that
changes are compatible with all existing loaders and follow the established patterns
and conventions.

For more detailed information on each component, refer to the individual files and their
docstrings.
223 changes: 111 additions & 112 deletions core/config.py
Original file line number Diff line number Diff line change
@@ -1,127 +1,126 @@
from dataclasses import dataclass
from os import getenv
from enum import Enum

from sqlalchemy import UUID

from core.db import DB
from core.logger import Logger
from core.structs import (
DependencyTypes,
PackageManager,
PackageManagerIDs,
Sources,
URLTypes,
UserTypes,
)
from core.utils import env_vars

logger = Logger("config")

TEST = getenv("TEST", "false").lower() == "true"
FETCH = getenv("FETCH", "true").lower() == "true"

class PackageManager(Enum):
CRATES = "crates"
HOMEBREW = "homebrew"

@dataclass
class Config:
file_location: str

TEST = env_vars("TEST", "false")
FETCH = env_vars("FETCH", "true")
NO_CACHE = env_vars("NO_CACHE", "true")
SOURCES = {
PackageManager.CRATES: "https://static.crates.io/db-dump.tar.gz",
PackageManager.HOMEBREW: "https://github.com/Homebrew/homebrew-core/tree/master/Formula", # noqa
}

# The three configuration values URLTypes, DependencyTypes, and UserTypes will query the
# DB to get the respective IDs. If the values don't exist in the database, they will
# raise an AttributeError (None has no attribute id) at the start


class ExecConf:
test: bool
fetch: bool
package_manager_id: str
no_cache: bool

def __init__(self) -> None:
self.test = TEST
self.fetch = FETCH
self.no_cache = NO_CACHE

def __str__(self):
return f"ExecConf(test={self.test},fetch={self.fetch},no_cache={self.no_cache}"


class PMConf:
pm_id: str
source: str

def __init__(self, pm: PackageManager, db: DB):
self.pm_id = db.select_package_manager_by_name(pm.value).id
self.source = SOURCES[pm]

def __str__(self):
return f"PMConf(pm_id={self.pm_id},source={self.source})"


class URLTypes:
homepage: UUID
repository: UUID
documentation: UUID
source: UUID

def __init__(self, db: DB):
self.load_url_types(db)

def load_url_types(self, db: DB) -> None:
self.homepage = db.select_url_types_homepage().id
self.repository = db.select_url_types_repository().id
self.documentation = db.select_url_types_documentation().id
self.source = db.select_url_types_source().id

def __str__(self) -> str:
return f"URLs(homepage={self.homepage},repo={self.repository},docs={self.documentation},src={self.source})" # noqa


class UserTypes:
crates: UUID
github: UUID

def __init__(self, db: DB):
self.crates = db.select_source_by_name("crates").id
self.github = db.select_source_by_name("github").id

def __str__(self) -> str:
return f"UserTypes(crates={self.crates},github={self.github})"


class DependencyTypes:
build: UUID
development: UUID
runtime: UUID
test: UUID
optional: UUID
recommended: UUID

def __init__(self, db: DB):
self.build = db.select_dependency_type_by_name("build").id
self.development = db.select_dependency_type_by_name("development").id
self.runtime = db.select_dependency_type_by_name("runtime").id
self.test = db.select_dependency_type_by_name("test").id
self.optional = db.select_dependency_type_by_name("optional").id
self.recommended = db.select_dependency_type_by_name("recommended").id

def __str__(self) -> str:
return f"DependencyTypes(build={self.build},development={self.development},runtime={self.runtime},test={self.test},optional={self.optional},recommended={self.recommended})" # noqa


class Config:
exec_config: ExecConf
pm_config: PMConf
url_types: URLTypes
user_types: UserTypes
dependency_types: DependencyTypes

def __init__(self, pm: PackageManager, db: DB) -> None:
self.exec_config = ExecConf()
self.pm_config = PMConf(pm, db)
self.url_types = URLTypes(db)
self.user_types = UserTypes(db)
self.dependency_types = DependencyTypes(db)

def __str__(self):
return f"Config(file_location={self.file_location}, test={self.test}, \
fetch={self.fetch}, package_manager_id={self.package_manager_id}, \
url_types={self.url_types}, user_types={self.user_types}, \
dependency_types={self.dependency_types})"


def load_url_types(db: DB) -> URLTypes:
logger.debug("loading url types, and creating if not exists")
homepage_url = db.select_url_types_homepage(create=True)
repository_url = db.select_url_types_repository(create=True)
documentation_url = db.select_url_types_documentation(create=True)
source_url = db.select_url_types_source(create=True)
return URLTypes(
homepage=homepage_url.id,
repository=repository_url.id,
documentation=documentation_url.id,
source=source_url.id,
)


def load_user_types(db: DB) -> UserTypes:
logger.debug("loading user types, and creating if not exists")
crates_source = db.select_source_by_name("crates", create=True)
github_source = db.select_source_by_name("github", create=True)
return UserTypes(
crates=crates_source.id,
github=github_source.id,
)


def load_package_manager_ids(db: DB) -> PackageManagerIDs:
logger.debug("loading package manager ids, and creating if not exists")
crates_package_manager = db.select_package_manager_by_name("crates", create=True)
homebrew_package_manager = db.select_package_manager_by_name(
"homebrew", create=True
)
return {
PackageManager.CRATES: crates_package_manager.id,
PackageManager.HOMEBREW: homebrew_package_manager.id,
}


def load_dependency_types(db: DB) -> DependencyTypes:
logger.debug("loading dependency types, and creating if not exists")
build_dep_type = db.select_dependency_type_by_name("build", create=True)
dev_dep_type = db.select_dependency_type_by_name("development", create=True)
runtime_dep_type = db.select_dependency_type_by_name("runtime", create=True)
test_dep_type = db.select_dependency_type_by_name("test", create=True)
optional_dep_type = db.select_dependency_type_by_name("optional", create=True)
recommended_dep_type = db.select_dependency_type_by_name("recommended", create=True)
return DependencyTypes(
build=build_dep_type.id,
development=dev_dep_type.id,
runtime=runtime_dep_type.id,
test=test_dep_type.id,
optional=optional_dep_type.id,
recommended=recommended_dep_type.id,
)


def load_sources() -> Sources:
return {
PackageManager.CRATES: "https://static.crates.io/db-dump.tar.gz",
PackageManager.HOMEBREW: (
"https://github.com/Homebrew/homebrew-core/tree/master/Formula"
),
}


def initialize(package_manager: PackageManager, db: DB) -> Config:
url_types = load_url_types(db)
user_types = load_user_types(db)
package_manager_ids = load_package_manager_ids(db)
dependency_types = load_dependency_types(db)
sources = load_sources()

if package_manager == PackageManager.CRATES:
return Config(
file_location=sources[PackageManager.CRATES],
test=False,
fetch=True,
package_manager_id=package_manager_ids[PackageManager.CRATES],
url_types=url_types,
user_types=user_types,
dependency_types=dependency_types,
)
elif package_manager == PackageManager.HOMEBREW:
return Config(
file_location=sources[PackageManager.HOMEBREW],
test=False,
fetch=True,
package_manager_id=package_manager_ids[PackageManager.HOMEBREW],
url_types=url_types,
user_types=user_types,
dependency_types=dependency_types,
)
return f"Config(exec_config={self.exec_config}, pm_config={self.pm_config}, url_types={self.url_types}, user_types={self.user_types}, dependency_types={self.dependency_types})" # noqa


if __name__ == "__main__":
print(PackageManager.CRATES.value)
Loading

0 comments on commit f054025

Please sign in to comment.