Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(kevel-metadata): adding new etl for extracting kevel metadata from kevel api into ads_derived table #242

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 27 additions & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,19 @@ jobs:
command: docker build -t app:build jobs/influxdb-to-bigquery/


build-job-kevel-metadata:
docker:
- image: << pipeline.parameters.git-image >>
steps:
- checkout
- compare-branch:
pattern: ^jobs/kevel-metadata/
- setup_remote_docker:
version: << pipeline.parameters.docker-version >>
- run:
name: Build Docker image
command: docker build -t app:build jobs/kevel-metadata/

build-job-kpi-forecasting:
docker:
- image: << pipeline.parameters.git-image >>
Expand Down Expand Up @@ -492,7 +505,6 @@ workflows:
branches:
only: main


job-fxci-taskcluster-export:
jobs:
- build-job-fxci-taskcluster-export
Expand Down Expand Up @@ -523,6 +535,20 @@ workflows:
only: main


kevel-metadata:
jobs:
- build-job-kevel-metadata
- gcp-gcr/build-and-push-image:
context: data-eng-airflow-gcr
docker-context: jobs/kevel-metadata/
path: jobs/kevel-metadata/
image: kevel-metadata_docker_etl
requires:
- build-job-kevel-metadata
filters:
branches:
only: main

job-kpi-forecasting:
jobs:
- build-job-kpi-forecasting
Expand Down
7 changes: 7 additions & 0 deletions jobs/kevel-metadata/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.ci_job.yaml
.ci_workflow.yaml
.DS_Store
*.pyc
.pytest_cache/
__pycache__/
venv/
2 changes: 2 additions & 0 deletions jobs/kevel-metadata/.flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[flake8]
max-line-length = 88
118 changes: 118 additions & 0 deletions jobs/kevel-metadata/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# PyCharm
.idea/*

# Mac OS
.DS_Store

# Packages
adzerk_lambda.zip
adzerk_lambda_ads*.zip
package/*

# Not relevant
runner.py

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
24 changes: 24 additions & 0 deletions jobs/kevel-metadata/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
FROM python:3.11

ARG USER_ID="10001"
ARG GROUP_ID="app"
ARG HOME="/app"

ENV HOME=${HOME}
RUN groupadd --gid ${USER_ID} ${GROUP_ID} && \
useradd --create-home --uid ${USER_ID} --gid ${GROUP_ID} --home-dir ${HOME} ${GROUP_ID}

WORKDIR ${HOME}

RUN pip install --upgrade pip

COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

COPY . .

RUN pip install .

# Drop root and change ownership of the application folder to the user
RUN chown -R ${USER_ID}:${GROUP_ID} ${HOME}
USER ${USER_ID}
70 changes: 70 additions & 0 deletions jobs/kevel-metadata/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Kevel Metadata Extraction Job

This job:

- Extracts data from the Kevel API into memory.
- Transforms the data into a single dataset saved to a local file.
- Loads that data into a file in Google Cloud Storage.
- Finally, we merge the data to create daily paritions.

This job is a migration of the logic that live here: https://github.com/Pocket/lambda-adzerk

Kevel used to be called Adzerk.

The API documentation is here: https://dev.kevel.com/reference/getting-started-with-kevel

The data extracts only 'active' flights. This means we need to include existing data when doing the partition replacement.

TO DO: **Decide on proper merge logic to maintain proper history of inactive and active flights**.

## Usage

This script is intended to be run in a docker container.
Build the docker image with:

```sh
docker build -t kevel-metadata .
```

To run locally:

```sh
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

To run unit tests locally (from within venv):

```sh
cd kevel_metadata
python -m pytest -svv --cov=src --cov-report term-missing --cov-fail-under=100 --cache-clear tests
```

Run the script with locally (needs gcloud auth):

```sh
python src/handler.py --project test-project --bucket test-bucket --env dev --api-key kevel API key
```

Run the script from docker locally (needs gcloud auth):

```sh
docker run -t kevel python kevel_metadata/src/handler.py --project test-project --bucket test-bucket --env dev --api-key kevel API key
```

Python code will need to be formatted with `black` by running:

```sh
cd kevel_metadata
black --exclude .venv .
```

Production execution of this ETL is expected to managed by [WTMO](https://workflow.telemetry.mozilla.org/home) with the following command:


```sh
python src/handler.py --env production --api-key kevel API key
```

In production, the project and bucket values are static.
12 changes: 12 additions & 0 deletions jobs/kevel-metadata/ci_job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
build-job-kevel-metadata:
docker:
- image: << pipeline.parameters.git-image >>
steps:
- checkout
- compare-branch:
pattern: ^jobs/kevel-metadata/
- setup_remote_docker:
version: << pipeline.parameters.docker-version >>
- run:
name: Build Docker image
command: docker build -t app:build jobs/kevel-metadata/
13 changes: 13 additions & 0 deletions jobs/kevel-metadata/ci_workflow.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
kevel-metadata:
jobs:
- build-job-kevel-metadata
- gcp-gcr/build-and-push-image:
context: data-eng-airflow-gcr
docker-context: jobs/kevel-metadata/
path: jobs/kevel-metadata/
image: kevel-metadata_docker_etl
requires:
- build-job-kevel-metadata
filters:
branches:
only: main
9 changes: 9 additions & 0 deletions jobs/kevel-metadata/kevel_metadata/.coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[run]
omit =
tests/*
.venv/*

[report]
exclude_also =
if __name__ == "__main__":
# pragma: no cover
Empty file.
Loading