forked from teaxyz/chai
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
copy over existing data pipeline from chai
- Loading branch information
1 parent
d186898
commit be1c690
Showing
28 changed files
with
1,702 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# directories | ||
data/ | ||
db/ | ||
.venv/ | ||
|
||
# other files | ||
.gitignore | ||
docker-compose.yml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,192 @@ | ||
# chai-oss | ||
tea's CHAI package ranker | ||
# data pipeline | ||
|
||
inspiration [here](https://github.com/vbelz/Data-pipeline-twitter) | ||
|
||
this is an attempt at an open-source data pipeline, from which we can build our app that | ||
ranks open-source projects. all pieces for this are managed by `docker-compose`. there | ||
are 3 services to it: | ||
|
||
1. db: postgres to store package specific data | ||
1. alembic: for running migrations | ||
1. pipeline: which fetches and writes data | ||
|
||
first run `mkdir -p data/{crates,pkgx,homebrew,npm,pypi,rubys}`, to setup the data | ||
directory where the fetchers will store the data. | ||
|
||
then, running `docker compose up` will setup the db and run the pipeline. a successful | ||
run will look something like this: | ||
|
||
``` | ||
db-1 | 2024-09-23 18:33:31.199 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 | ||
db-1 | 2024-09-23 18:33:31.199 UTC [1] LOG: listening on IPv6 address "::", port 5432 | ||
db-1 | 2024-09-23 18:33:31.202 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" | ||
db-1 | 2024-09-23 18:33:31.230 UTC [30] LOG: database system was shut down at 2024-09-23 18:04:05 UTC | ||
db-1 | 2024-09-23 18:33:31.242 UTC [1] LOG: database system is ready to accept connections | ||
alembic-1 | db:5432 - accepting connections | ||
alembic-1 | INFO [alembic.runtime.migration] Context impl PostgresqlImpl. | ||
alembic-1 | INFO [alembic.runtime.migration] Will assume transactional DDL. | ||
alembic-1 | db currently at 0db06140525f (head) | ||
alembic-1 | INFO [alembic.runtime.migration] Context impl PostgresqlImpl. | ||
alembic-1 | INFO [alembic.runtime.migration] Will assume transactional DDL. | ||
alembic-1 | migrations run | ||
alembic-1 exited with code 0 | ||
alembic-1 | postgresql://postgres:s3cr3t@db:5432/chai | ||
alembic-1 | s3cr3t | ||
pipeline-1 | 0.01: [crates_orchestrator]: [DEBUG]: logging is working | ||
pipeline-1 | 0.01: [main_pipeline]: [DEBUG]: logging is working | ||
pipeline-1 | 0.01: [DB]: [DEBUG]: logging is working | ||
pipeline-1 | 0.03: [DB]: [DEBUG]: created engine | ||
pipeline-1 | 0.03: [DB]: [DEBUG]: created session | ||
pipeline-1 | 0.03: [DB]: [DEBUG]: connected to postgresql://postgres:s3cr3t@db:5432/chai | ||
pipeline-1 | 0.03: [crates_orchestrator]: fetching crates packages | ||
pipeline-1 | 0.03: [crates_fetcher]: [DEBUG]: logging is working | ||
pipeline-1 | 0.03: [crates_fetcher]: [DEBUG]: adding package manager crates | ||
``` | ||
|
||
> [!TIP] | ||
> | ||
> to force it, `docker-compose up --force-recreate --build` | ||
## Hard Reset | ||
|
||
if at all you need to do a hard reset, here's the steps | ||
|
||
1. `rm -rf db/data`: removes all the data that was loaded into the db | ||
1. `rm -rf .venv`: if you created a virtual environment for local dev, this removes it | ||
1. `rm -rf data`: removes all the data the fetcher is putting | ||
1. `docker system prune -a -f --volumes`: removes **everything** docker-related | ||
|
||
> [!WARNING] | ||
> | ||
> step 4 deletes all your docker stuff...be careful | ||
<!-- this is handled now that alembic/psycopg2 are in pkgx --> | ||
<!-- | ||
## Alembic Alternatives | ||
- sqlx command line tool to manage migrations, alongside models for sqlx in rust | ||
- vapor's migrations are written in swift | ||
--> | ||
|
||
## FAQs / common issues | ||
|
||
1. the database url is `postgresql://postgres:s3cr3t@localhost:5435/chai`, and is used | ||
as `CHAI_DATABASE_URL` in the environment. | ||
1. the command `./run_migrations.sh` is used to run migrations, and you might need to | ||
`chmod +x alembic/run_migrations.sh` so that it can be executed | ||
1. the command `./run_pipeline.sh` is used to run the pipeline, and you might need to | ||
`chmod +x src/run_pipeline.sh` so that it can be executed | ||
1. migrations sometimes don't apply before the service starts, so you might need to | ||
manually apply them: | ||
|
||
```sh | ||
cd alembic | ||
alembic upgrade head | ||
``` | ||
|
||
## tasks | ||
|
||
these are tasks that can be run using xcfile.dev. if you have pkgx, just run `dev` to | ||
inject into your environment. if you don't...go get it. | ||
|
||
### reset | ||
|
||
```sh | ||
rm -rf db/data data .venv | ||
``` | ||
|
||
### setup | ||
|
||
```sh | ||
mkdir -p data/{crates,pkgx,homebrew,npm,pypi,rubys} | ||
``` | ||
|
||
### local-dev | ||
|
||
```sh | ||
uv venv | ||
cd src | ||
uv pip install -r requirements.txt | ||
``` | ||
|
||
### chai-start | ||
|
||
Requires: setup | ||
Inputs: FORCE | ||
Env: FORCE=not-force | ||
|
||
```sh | ||
if [ "$FORCE" = "force" ]; then | ||
docker-compose up --force-recreate --build -d | ||
else | ||
docker-compose up -d | ||
fi | ||
export CHAI_DATABASE_URL="postgresql://postgres:s3cr3t@localhost:5435/chai" | ||
``` | ||
|
||
### chai-stop | ||
|
||
```sh | ||
docker-compose down | ||
``` | ||
|
||
### db-reset | ||
|
||
Requires: chai-stop | ||
|
||
```sh | ||
rm -rf db/data | ||
``` | ||
|
||
### db-logs | ||
|
||
```sh | ||
docker-compose logs db | ||
``` | ||
|
||
### db-generate-migration | ||
|
||
Inputs: MIGRATION_NAME | ||
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai | ||
|
||
```sh | ||
cd alembic | ||
alembic revision --autogenerate -m "$MIGRATION_NAME" | ||
``` | ||
|
||
### db-upgrade | ||
|
||
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai | ||
|
||
```sh | ||
cd alembic | ||
alembic upgrade head | ||
``` | ||
|
||
### db-downgrade | ||
|
||
Inputs: STEP | ||
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai | ||
|
||
```sh | ||
cd alembic | ||
alembic downgrade -$STEP | ||
``` | ||
|
||
### db | ||
|
||
```sh | ||
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" | ||
``` | ||
|
||
### db-list-packages | ||
|
||
```sh | ||
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT count(id) FROM packages;" | ||
``` | ||
|
||
### db-list-history | ||
|
||
```sh | ||
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT * FROM load_history;" | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# this .pkgx.yaml file is only for alembic | ||
|
||
dependencies: | ||
postgresql.org: 16 | ||
alembic.sqlalchemy.org: 1 | ||
psycopg.org/psycopg2: 2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
FROM pkgxdev/pkgx:latest | ||
# WORKDIR /app | ||
|
||
# # install alembic | ||
# COPY .pkgx.yaml . | ||
# RUN dev | ||
|
||
RUN pkgx install alembic.sqlalchemy.org^1 psycopg.org/psycopg2^2 postgresql.org^16 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
[alembic] | ||
script_location = . | ||
file_template = %%(year)d%%(month).2d%%(day).2d_%%(hour).2d%%(minute).2d-%%(slug)s | ||
|
||
prepend_sys_path = .. | ||
version_path_separator = os # Use os.pathsep. Default configuration used for new projects. | ||
|
||
# URL | ||
sqlalchemy.url = ${env:CHAI_DATABASE_URL} | ||
|
||
|
||
[post_write_hooks] | ||
# lint with attempts to fix using "ruff" - use the exec runner, execute a binary | ||
# TODO: this doesn't work rn | ||
# hooks = ruff | ||
# ruff.type = exec | ||
# ruff.executable = %(here)s/.venv/bin/ruff | ||
# ruff.options = --fix REVISION_SCRIPT_FILENAME | ||
|
||
# Logging configuration | ||
[loggers] | ||
keys = root,sqlalchemy,alembic | ||
|
||
[handlers] | ||
keys = console | ||
|
||
[formatters] | ||
keys = generic | ||
|
||
[logger_root] | ||
level = WARN | ||
handlers = console | ||
qualname = | ||
|
||
[logger_sqlalchemy] | ||
level = WARN | ||
handlers = | ||
qualname = sqlalchemy.engine | ||
|
||
[logger_alembic] | ||
level = INFO | ||
handlers = | ||
qualname = alembic | ||
|
||
[handler_console] | ||
class = StreamHandler | ||
args = (sys.stderr,) | ||
level = NOTSET | ||
formatter = generic | ||
|
||
[formatter_generic] | ||
format = %(levelname)-5.5s [%(name)s] %(message)s | ||
datefmt = %H:%M:%S |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
import os | ||
from logging.config import fileConfig | ||
|
||
from alembic import context | ||
from sqlalchemy import engine_from_config, pool | ||
from src.pipeline.models import Base | ||
|
||
# this is the Alembic Config object, which provides | ||
# access to the values within the .ini file in use. | ||
config = context.config | ||
|
||
# interpret the config file for Python logging. | ||
if config.config_file_name is not None: | ||
fileConfig(config.config_file_name) | ||
|
||
# metadata for all models | ||
target_metadata = Base.metadata | ||
|
||
# get database url | ||
database_url = os.getenv("CHAI_DATABASE_URL") | ||
if database_url: | ||
config.set_main_option("sqlalchemy.url", database_url) | ||
|
||
|
||
def run_migrations_offline() -> None: | ||
"""Run migrations in 'offline' mode. | ||
This configures the context with just a URL | ||
and not an Engine, though an Engine is acceptable | ||
here as well. By skipping the Engine creation | ||
we don't even need a DBAPI to be available. | ||
Calls to context.execute() here emit the given string to the | ||
script output. | ||
""" | ||
url = config.get_main_option("sqlalchemy.url") | ||
context.configure( | ||
url=url, | ||
target_metadata=target_metadata, | ||
literal_binds=True, | ||
dialect_opts={"paramstyle": "named"}, | ||
) | ||
|
||
with context.begin_transaction(): | ||
context.run_migrations() | ||
|
||
|
||
def run_migrations_online() -> None: | ||
"""Run migrations in 'online' mode. | ||
In this scenario we need to create an Engine | ||
and associate a connection with the context. | ||
""" | ||
connectable = engine_from_config( | ||
config.get_section(config.config_ini_section, {}), | ||
prefix="sqlalchemy.", | ||
poolclass=pool.NullPool, | ||
) | ||
|
||
with connectable.connect() as connection: | ||
context.configure(connection=connection, target_metadata=target_metadata) | ||
|
||
with context.begin_transaction(): | ||
context.run_migrations() | ||
|
||
|
||
if context.is_offline_mode(): | ||
run_migrations_offline() | ||
else: | ||
run_migrations_online() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
#!/bin/bash | ||
|
||
# wait for db to be ready | ||
until pg_isready -h db -p 5432 -U postgres; do | ||
echo "waiting for database..." | ||
sleep 2 | ||
done | ||
|
||
# migrate | ||
echo "db currently at $(pkgx +alembic +psycopg.org/psycopg2 alembic current)" | ||
pkgx +alembic +psycopg.org/psycopg2 alembic upgrade head | ||
echo "migrations run" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
"""${message} | ||
|
||
Revision ID: ${up_revision} | ||
Revises: ${down_revision | comma,n} | ||
Create Date: ${create_date} | ||
|
||
""" | ||
from typing import Sequence, Union | ||
|
||
from alembic import op | ||
import sqlalchemy as sa | ||
${imports if imports else ""} | ||
|
||
# revision identifiers, used by Alembic. | ||
revision: str = ${repr(up_revision)} | ||
down_revision: Union[str, None] = ${repr(down_revision)} | ||
branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)} | ||
depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)} | ||
|
||
|
||
def upgrade() -> None: | ||
${upgrades if upgrades else "pass"} | ||
|
||
|
||
def downgrade() -> None: | ||
${downgrades if downgrades else "pass"} |
Oops, something went wrong.