Skip to content

Commit

Permalink
copy over existing data pipeline from chai
Browse files Browse the repository at this point in the history
  • Loading branch information
sanchitram1 committed Sep 27, 2024
1 parent d186898 commit be1c690
Show file tree
Hide file tree
Showing 28 changed files with 1,702 additions and 2 deletions.
8 changes: 8 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# directories
data/
db/
.venv/

# other files
.gitignore
docker-compose.yml
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,3 +160,7 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

# data files
data
db/data
194 changes: 192 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,192 @@
# chai-oss
tea's CHAI package ranker
# data pipeline

inspiration [here](https://github.com/vbelz/Data-pipeline-twitter)

this is an attempt at an open-source data pipeline, from which we can build our app that
ranks open-source projects. all pieces for this are managed by `docker-compose`. there
are 3 services to it:

1. db: postgres to store package specific data
1. alembic: for running migrations
1. pipeline: which fetches and writes data

first run `mkdir -p data/{crates,pkgx,homebrew,npm,pypi,rubys}`, to setup the data
directory where the fetchers will store the data.

then, running `docker compose up` will setup the db and run the pipeline. a successful
run will look something like this:

```
db-1 | 2024-09-23 18:33:31.199 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
db-1 | 2024-09-23 18:33:31.199 UTC [1] LOG: listening on IPv6 address "::", port 5432
db-1 | 2024-09-23 18:33:31.202 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
db-1 | 2024-09-23 18:33:31.230 UTC [30] LOG: database system was shut down at 2024-09-23 18:04:05 UTC
db-1 | 2024-09-23 18:33:31.242 UTC [1] LOG: database system is ready to accept connections
alembic-1 | db:5432 - accepting connections
alembic-1 | INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
alembic-1 | INFO [alembic.runtime.migration] Will assume transactional DDL.
alembic-1 | db currently at 0db06140525f (head)
alembic-1 | INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
alembic-1 | INFO [alembic.runtime.migration] Will assume transactional DDL.
alembic-1 | migrations run
alembic-1 exited with code 0
alembic-1 | postgresql://postgres:s3cr3t@db:5432/chai
alembic-1 | s3cr3t
pipeline-1 | 0.01: [crates_orchestrator]: [DEBUG]: logging is working
pipeline-1 | 0.01: [main_pipeline]: [DEBUG]: logging is working
pipeline-1 | 0.01: [DB]: [DEBUG]: logging is working
pipeline-1 | 0.03: [DB]: [DEBUG]: created engine
pipeline-1 | 0.03: [DB]: [DEBUG]: created session
pipeline-1 | 0.03: [DB]: [DEBUG]: connected to postgresql://postgres:s3cr3t@db:5432/chai
pipeline-1 | 0.03: [crates_orchestrator]: fetching crates packages
pipeline-1 | 0.03: [crates_fetcher]: [DEBUG]: logging is working
pipeline-1 | 0.03: [crates_fetcher]: [DEBUG]: adding package manager crates
```

> [!TIP]
>
> to force it, `docker-compose up --force-recreate --build`
## Hard Reset

if at all you need to do a hard reset, here's the steps

1. `rm -rf db/data`: removes all the data that was loaded into the db
1. `rm -rf .venv`: if you created a virtual environment for local dev, this removes it
1. `rm -rf data`: removes all the data the fetcher is putting
1. `docker system prune -a -f --volumes`: removes **everything** docker-related

> [!WARNING]
>
> step 4 deletes all your docker stuff...be careful
<!-- this is handled now that alembic/psycopg2 are in pkgx -->
<!--
## Alembic Alternatives
- sqlx command line tool to manage migrations, alongside models for sqlx in rust
- vapor's migrations are written in swift
-->

## FAQs / common issues

1. the database url is `postgresql://postgres:s3cr3t@localhost:5435/chai`, and is used
as `CHAI_DATABASE_URL` in the environment.
1. the command `./run_migrations.sh` is used to run migrations, and you might need to
`chmod +x alembic/run_migrations.sh` so that it can be executed
1. the command `./run_pipeline.sh` is used to run the pipeline, and you might need to
`chmod +x src/run_pipeline.sh` so that it can be executed
1. migrations sometimes don't apply before the service starts, so you might need to
manually apply them:

```sh
cd alembic
alembic upgrade head
```

## tasks

these are tasks that can be run using xcfile.dev. if you have pkgx, just run `dev` to
inject into your environment. if you don't...go get it.

### reset

```sh
rm -rf db/data data .venv
```

### setup

```sh
mkdir -p data/{crates,pkgx,homebrew,npm,pypi,rubys}
```

### local-dev

```sh
uv venv
cd src
uv pip install -r requirements.txt
```

### chai-start

Requires: setup
Inputs: FORCE
Env: FORCE=not-force

```sh
if [ "$FORCE" = "force" ]; then
docker-compose up --force-recreate --build -d
else
docker-compose up -d
fi
export CHAI_DATABASE_URL="postgresql://postgres:s3cr3t@localhost:5435/chai"
```

### chai-stop

```sh
docker-compose down
```

### db-reset

Requires: chai-stop

```sh
rm -rf db/data
```

### db-logs

```sh
docker-compose logs db
```

### db-generate-migration

Inputs: MIGRATION_NAME
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai

```sh
cd alembic
alembic revision --autogenerate -m "$MIGRATION_NAME"
```

### db-upgrade

Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai

```sh
cd alembic
alembic upgrade head
```

### db-downgrade

Inputs: STEP
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai

```sh
cd alembic
alembic downgrade -$STEP
```

### db

```sh
psql "postgresql://postgres:s3cr3t@localhost:5435/chai"
```

### db-list-packages

```sh
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT count(id) FROM packages;"
```

### db-list-history

```sh
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT * FROM load_history;"
```
6 changes: 6 additions & 0 deletions alembic/.pkgx.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# this .pkgx.yaml file is only for alembic

dependencies:
postgresql.org: 16
alembic.sqlalchemy.org: 1
psycopg.org/psycopg2: 2
8 changes: 8 additions & 0 deletions alembic/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
FROM pkgxdev/pkgx:latest
# WORKDIR /app

# # install alembic
# COPY .pkgx.yaml .
# RUN dev

RUN pkgx install alembic.sqlalchemy.org^1 psycopg.org/psycopg2^2 postgresql.org^16
53 changes: 53 additions & 0 deletions alembic/alembic.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
[alembic]
script_location = .
file_template = %%(year)d%%(month).2d%%(day).2d_%%(hour).2d%%(minute).2d-%%(slug)s

prepend_sys_path = ..
version_path_separator = os # Use os.pathsep. Default configuration used for new projects.

# URL
sqlalchemy.url = ${env:CHAI_DATABASE_URL}


[post_write_hooks]
# lint with attempts to fix using "ruff" - use the exec runner, execute a binary
# TODO: this doesn't work rn
# hooks = ruff
# ruff.type = exec
# ruff.executable = %(here)s/.venv/bin/ruff
# ruff.options = --fix REVISION_SCRIPT_FILENAME

# Logging configuration
[loggers]
keys = root,sqlalchemy,alembic

[handlers]
keys = console

[formatters]
keys = generic

[logger_root]
level = WARN
handlers = console
qualname =

[logger_sqlalchemy]
level = WARN
handlers =
qualname = sqlalchemy.engine

[logger_alembic]
level = INFO
handlers =
qualname = alembic

[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic

[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S
70 changes: 70 additions & 0 deletions alembic/env.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import os
from logging.config import fileConfig

from alembic import context
from sqlalchemy import engine_from_config, pool
from src.pipeline.models import Base

# this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
config = context.config

# interpret the config file for Python logging.
if config.config_file_name is not None:
fileConfig(config.config_file_name)

# metadata for all models
target_metadata = Base.metadata

# get database url
database_url = os.getenv("CHAI_DATABASE_URL")
if database_url:
config.set_main_option("sqlalchemy.url", database_url)


def run_migrations_offline() -> None:
"""Run migrations in 'offline' mode.
This configures the context with just a URL
and not an Engine, though an Engine is acceptable
here as well. By skipping the Engine creation
we don't even need a DBAPI to be available.
Calls to context.execute() here emit the given string to the
script output.
"""
url = config.get_main_option("sqlalchemy.url")
context.configure(
url=url,
target_metadata=target_metadata,
literal_binds=True,
dialect_opts={"paramstyle": "named"},
)

with context.begin_transaction():
context.run_migrations()


def run_migrations_online() -> None:
"""Run migrations in 'online' mode.
In this scenario we need to create an Engine
and associate a connection with the context.
"""
connectable = engine_from_config(
config.get_section(config.config_ini_section, {}),
prefix="sqlalchemy.",
poolclass=pool.NullPool,
)

with connectable.connect() as connection:
context.configure(connection=connection, target_metadata=target_metadata)

with context.begin_transaction():
context.run_migrations()


if context.is_offline_mode():
run_migrations_offline()
else:
run_migrations_online()
12 changes: 12 additions & 0 deletions alembic/run_migrations.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash

# wait for db to be ready
until pg_isready -h db -p 5432 -U postgres; do
echo "waiting for database..."
sleep 2
done

# migrate
echo "db currently at $(pkgx +alembic +psycopg.org/psycopg2 alembic current)"
pkgx +alembic +psycopg.org/psycopg2 alembic upgrade head
echo "migrations run"
26 changes: 26 additions & 0 deletions alembic/script.py.mako
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
"""${message}

Revision ID: ${up_revision}
Revises: ${down_revision | comma,n}
Create Date: ${create_date}

"""
from typing import Sequence, Union

from alembic import op
import sqlalchemy as sa
${imports if imports else ""}

# revision identifiers, used by Alembic.
revision: str = ${repr(up_revision)}
down_revision: Union[str, None] = ${repr(down_revision)}
branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}


def upgrade() -> None:
${upgrades if upgrades else "pass"}


def downgrade() -> None:
${downgrades if downgrades else "pass"}
Loading

0 comments on commit be1c690

Please sign in to comment.