Skip to content

Commit

Permalink
Merge pull request #13 from DataRecce/feature/drc-377-doc-document-ab…
Browse files Browse the repository at this point in the history
…out-the-recce-ci

[Document] Add start with dbt cloud document
  • Loading branch information
kentwelcome authored Apr 19, 2024
2 parents e086d0e + 8f1b8ee commit ac99231
Show file tree
Hide file tree
Showing 7 changed files with 326 additions and 30 deletions.
Binary file added docs/assets/images/dbt-cloud/dev-artifacts.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/dbt-cloud/login-dbt-cloud.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/dbt-cloud/prod-artifacts.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
183 changes: 182 additions & 1 deletion docs/docs/guides/scenario-ci.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,185 @@ title: Continue Integration
icon: octicons/play-16
---

:construction: Writing in progress
# Recce CI integrate with GitHub Action

Recce provide `recce run` command for CI/CD pipeline. You can integrate Recce with GitHub Action to compare the data models between two environments when a new pull-request is created.

## Prerequisites

Before you start integrating Recce with GitHub Action, you need to have the following prerequisites:

- Set up two environments in your data warehouse. For example, one for production and another for development.

- Provide the credentials profile for both environments in your `profiles.yml` file to let Recce access your data warehouse. You can put the credentials in the `profiles.yml` file. Or you can use the environment variables to provide the credentials.

- Set up the data warehouse credentials in the GitHub repository secrets. You can set up the credentials in the GitHub repository secrets by following the steps mentioned in the [GitHub documentation](https://docs.github.com/en/actions/reference/encrypted-secrets).

## Set up Recce with GitHub Action

We will suggest setting up two GitHub Actions workflows in your GitHub repository. One for the production environment and another for the development environment.

For the production environment, it will be triggered on every merge to the main branch.

And for the development environment, it will be triggered on every push commits to the pull-request branch.

### Base Workflow (Main Branch)

In this workflow, we will set up the GitHub Action to run the dbt commands for the production environment. And then, it will package the dbt artifacts and upload them to the 3rd party storage system outside the GitHub. We will use the AWS S3 bucket to store the dbt artifacts here.

```yaml
name: Recce CI Base Branch

on:
push:
branches:
- main

concurrency:
group: recce-ci-base
cancel-in-progress: true

jobs:
build:
name: DBT Runner
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: "3.10.x"

- name: Install dependencies
run: |
pip install -r requirements.txt
- name: Run DBT
run: |
dbt deps
dbt seed --target ${{ env.DBT_BASE_TARGET }} --target-path target-base
dbt run --target ${{ env.DBT_BASE_TARGET }} --target-path target-base
dbt docs generate --target ${{ env.DBT_BASE_TARGET }} --target-path target-base
env:
# Set the dbt target name of the base environment
DBT_BASE_TARGET: prod

- name: Package DBT artifacts
run: |
tar -czvf dbt-artifacts.tar.gz target-base
mv dbt-artifacts.tar.gz $GITHUB_WORKSPACE/${{ github.sha }}.tar.gz
- name: Upload to S3
run: |
aws s3 cp $GITHUB_WORKSPACE/${{ github.sha }}.tar.gz s3://${{ env.AWS_S3_BUCKET }}/${{ github.sha }}.tar.gz
env:
# Set these in your repository secrets
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
# Set these in your repository secrets
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
# Set these in your repository secrets
AWS_REGION: ${{ secrets.AWS_REGION }}
# Set these in your repository secrets
AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }}
```
### Current Workflow (Pull Request Branch)
In the current workflow, we will set up the GitHub Action to run the dbt commands for the development environment. And then, download the dbt artifacts built in the base environment from the 3rd party storage system. After that, it will compare the data models between the base and current environments using Recce.
```yaml
name: Recce CI Current Branch

on:
pull_request:
branches: [main]

jobs:
check-pull-request:
name: Check pull request by Recce CI
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- name: Checkout repository
uses: actions/checkout@v3
with:
fetch-depth: 0

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10.x"

- name: Install dependencies
run: |
pip install -r requirements.txt
- name: Install Recce
run: |
pip install recce
- name: Prepare DBT Base environment
run: |
if aws s3 cp s3://$AWS_S3_BUCKET/${{ github.event.pull_request.base.sha }}.tar.gz .; then
echo "Base environment found in S3"
tar -xvf ${{ github.event.pull_request.base.sha }}.tar.gz
else
echo "Base environment not found in S3. Running dbt to create base environment"
git checkout ${{ github.event.pull_request.base.sha }}
dbt deps
dbt seed --target ${{ env.DBT_BASE_TARGET }} --target-path target-base
dbt run --target ${{ env.DBT_BASE_TARGET }} --target-path target-base
dbt docs generate --target ${{ env.DBT_BASE_TARGET }} --target-path target-base
fi
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_REGION: ${{ secrets.AWS_REGION }}
AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }}
# Set the dbt target name of the base environment
DBT_BASE_TARGET: prod

- name: Prepare DBT Current environment
run: |
git checkout ${{ github.event.pull_request.head.sha }}
dbt deps
dbt seed --target ${{ env.DBT_CURRENT_TARGET }}
dbt run --target ${{ env.DBT_CURRENT_TARGET }}
dbt docs generate --target ${{ env.DBT_CURRENT_TARGET }}
env:
# Set the dbt target name of the current environment
DBT_CURRENT_TARGET: dev

- name: Run Recce CI
run: |
recce run --github-pull-request-url ${{ github.event.pull_request.html_url }}
- name: Archive Recce State File
uses: actions/upload-artifact@v4
id: recce-artifact-uploader
with:
name: recce-state-file
path: recce_state.json

- name: Comment on pull request
uses: thollander/actions-comment-pull-request@v2
with:
message: |
Recce `run` successfully completed.
Please download the [artifact](${{ env.ARTIFACT_URL }}) for the state file.
env:
ARTIFACT_URL: ${{ steps.recce-artifact-uploader.outputs.artifact-url }}
```
## Review the Recce State File
Once the Recce CI workflow is completed, you can download the Recce state file from the GitHub pull-request. The Recce state file contains the comparison results of the data models between the base and current environments.
```bash
recce server --review recce_state.json
```

In the Recce server review mode, you can review the comparison results of the data models between the base and current environments. It will contain the row counts of modified data models, and the query results of the Recce Preset Checks.
115 changes: 115 additions & 0 deletions docs/docs/start-with-dbt-cloud.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
---
title: Start with dbt Cloud
icon: material/cloud
---

# Start with dbt Cloud

dbt Cloud is a hosted service that provides a managed environment for running dbt projects by [dbt Labs](https://docs.getdbt.com/docs/cloud/about-cloud/dbt-cloud-features). This document provides a step-by-step guide to get started `recce` with dbt Cloud.

## Prerequisites

`Recce` will compare the data models between two environments. That means you need to have two environments in your dbt Cloud project. For example, one for production and another for development.
Also, you need to provide the credentials profile for both environments in your `profiles.yml` file to let `Recce` access your data warehouse.

### Suggestions for setting up dbt Cloud

To integrate the dbt Cloud with Recce, we suggest to set up two run jobs in your dbt Cloud project.

#### Production Run Job

The production run should be the main branch of your dbt project. You can trigger the dbt Cloud job on every merge to the main branch or schedule it to run at a daily specific time.

#### Development Run Job

The development run should be a separate branch of your dbt project. You can trigger the dbt Cloud job on every merge to the pull-request branch.

### Set up dbt profiles with credentials

You need to provide the credentials profile for both environments in your `profiles.yml` file. Here is an example of how your `profiles.yml` file might look like:

```yaml
dbt-example-project:
target: dev
outputs:
dev:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"

# User/password auth
user: "{{ env_var('SNOWFLAKE_USER') | as_text }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') | as_text }}"

role: DEVELOPER
database: cloud_database
warehouse: LOAD_WH
schema: "{{ env_var('SNOWFLAKE_SCHEMA') | as_text }}"
threads: 4
prod:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"

# User/password auth
user: "{{ env_var('SNOWFLAKE_USER') | as_text }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') | as_text }}"

role: DEVELOPER
database: cloud_database
warehouse: LOAD_WH
schema: PUBLIC
threads: 4
```
## Install `Recce`

Install Recce using `pip`:

```shell
pip install -U recce
```

## Execute Recce with dbt Cloud

To compare the data models between two environments, you need to download the dbt Cloud artifacts for both environments. The artifacts include the manifest.json file and the catalog.json file. You can download the artifacts from the dbt Cloud UI.

### Login to your dbt Cloud account

![dbt Cloud login](../assets/images/dbt-cloud/login-dbt-cloud.png)

### Go to the project you want to compare

![dbt Cloud login](../assets/images/dbt-cloud/select-run-job.png)

### Download the dbt artifacts

Download the artifacts from the latest run of both run jobs. You can download the artifacts from the `Artifacts` tab.

![dbt Cloud login](../assets/images/dbt-cloud/prod-artifacts.png)
![dbt Cloud login](../assets/images/dbt-cloud/dev-artifacts.png)

### Setup the dbt artifacts folders

Extract the downloaded artifacts and keep them in a separate folder. The production artifacts should be in the `target-base` folder and the development artifacts should be in the `target` folder.

```bash
$ tree target target-base
target
├── catalog.json
└── manifest.json
target-base/
├── catalog.json
└── manifest.json
```

### Setup dbt project

Move the `target` and `target-base` folders to the root of your dbt project.
You should also have the `profiles.yml` file in the root of your dbt project with the credentials profile for both environments.

### Start the `Recce` server

Run the `recce` command to compare the data models between the two environments.

```shell
recce server
```
Loading

0 comments on commit ac99231

Please sign in to comment.