This repository is made to deploy open source tools easily to have a modern data stack.
There is no real need for Airflow, because dbt is meant to be deployed in serverless mode with this repository (Cloud Workflows + Cloud Scheduler). This is an opinionated choice, because I dislike the current use data teams make of Airflow. But later on I will still add support for it.
It only supports GCP for now.
As Airbyte and Lightdash needs multiple containers to run, they can't be deployed in serverless. Thus they are deployed as Compute Engine VM.
Planning to later add support for:
- Airflow for people who don't want to reduce costs and stay serverless.
- Snowflake because it's as popular as BigQuery, so why not.
- AWS Cloud because it's the most popular cloud provider.
- Metabase because it was (and is still) the go-to visualization tool to deploy in early stages projects
- duckDB because it's a trending database for analytics workloads
- GCP IAM, APIs, etc...
- Airbyte: Extract and Load
- dbt: Transform
- BigQuery: Warehouse
- Cloud Workflows / Cloud Scheduler: Schedule
- Lightdash: Visualize
- Streamlit: Machine Learning UI
To use the Google Cloud Platform with this project, you will need to create a Google Cloud account and enable billing. In the billing page, you will find a billing ID in the format ######-######-######. Make sure to note this value, as it will be required in the next step.
You can even set an organization if you need and if you have a professional DNS. You should follow instructions at the following link
If you want to avoid the next few steps, you can use the docker image already created for you.
It has all the tools needed: gcloud cli, terraform, etc...
To set up the Google Cloud SDK on your computer, follow the instructions provided for your specific operating system. Once you have installed the gcloud command-line interface (CLI), open a terminal window and run the following command. This will allow Terraform to use your default credentials for authentication.
gcloud auth application-default login
To use Terraform, you will first need to install the Terraform CLI on your local machine. Follow the instructions provided at the following link to complete the installation.
You can just do it through the Github UI and then clone it to your local machine.
Or if you want to fork the repository with the following command:
gh repo fork REPOSITORY --org ORGANIZATION --clone=true
Then you will need to install Github CLI with the following instructions
TODO document github PAT
To create the resources on Google Cloud, you will first have to fill your .env file. We provided a template, you can just copy it and rename it to .env.
Then you only need to fill what you want. BILLING_ID (which we already have thanks to step 1), PROJECT, REGION and ZONE should at least be set. You can keep default REGION and ZONE that we set in the template file. Make sure that your PROJECT variable is 6 to 30 characters in length and only contains lowercase letters, numbers, and hyphens.
If you setup an organization (optional), you can also create a folder (optional). In order to do this optional step, fill the FOLDER_NAME value in your .env
file and afterwards just run:
make create-folder
Check the FOLDER_ID value on the CLI, here is an example, the value is the one under the green rectange. Just fill the .env
file with this value.
Finally run the following command in a terminal window:
make all
This will create a google project, a gcs bucket for Terraform state infrastructure storage and deploy the IaC afterwards. That's it.
You can enable IAP and HTTPS endpoints for Airbyte and Lightdash instances. This is set in the load_balancer files and through your DNS variable that you can optionally set in your .env
file.
You will get your two instances only available on your DNS with IAP authenticated users : airbyte.yourdns.com / lightdash.yourdns.com.
If you want to directly access your airbyte instance, we can tunnel the instance IP to our localhost with this command:
make airbyte-tunnel
You can now create connectors between your sources and destinations on the url localhost:8002
When you don't need to connect to the instance anymore just run:
make airbyte-fuser
You can initialize a dbt project with the command:
make dbt-init
It will be based on three env variables located in your .env file: PROJECT, DBT_PROJECT and DBT_DATASET.
Then you can locally run your models, views, etc... with the following command:
make dbt-run
TODO develop serverless dbt
If you want to directly access your lightdash instance, we can tunnel the instance IP to our localhost with this command:
make lightdash-tunnel
You can now connect on the url localhost:8003, sadly Lightdash isn't really terraform friendly so we need to do some UI few steps. For now I don't know how to automate this, I will need to deep dive in the CLI (or even API if there is any)
See Lightdash initial project setup tutorial in our docs here
When you don't need to connect to the instance anymore just run:
make lightdash-fuser