Skip to content

Snowplow offically-supported SQL data models for working with Snowplow web and mobile behavioral data. Supports Redshift, Snowflake and BigQuery.

Notifications You must be signed in to change notification settings

aidanradford/data-models

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

actively-maintained License

snowplow-logo

Snowplow is a scalable open-source platform for rich, high quality, low-latency data collection. It is designed to collect high quality, complete behavioral data for enterprise business.

Snowplow Pipeline Overview

snowplow-pipeline

The Snowplow trackers enable highly customizable collection of raw, unopinionated event data. The pipeline validates these events against a JSONSchema - to guarantee a high quality dataset - and adds information via both standard and custom enrichments.

This data is then made available in-stream for real-time processing, and can also be loaded to blob storage and data warehouse for analysis.

The Snowplow atomic data acts as an immutable log of all the actions that occurred across your digital products. The data model takes that data and transforms it into a set of derived tables optimized for analysis. Visit our documentation site for further explanation on the data modeling process.

Try Snowplow

This repo contains data models which are relevant to users who already have a full Snowplow pipeline running (which can be done Open Source or via Snowplow BDP).

If you don't have a pipeline yet, you might be interested in finding out what Snowplow can do by setting up Try Snowplow.

Repo Contents

Documentation for the data models can be found on our documentation site.

Prerequisites

These models are written in a format that is runnable via SQL-runner - available for download as a zip file from Github Releases. The BigQuery model requires >= v0.9.2, and the Snowflake model requires >= v0.9.3 of sql-runner.

Those who don't wish to use sql-runner to run models can use the -t and -o flags of the run_config.sh script to output the pure sql for a model according to how it has been configured for sql-runner.

They each also require a dataset of Snowplow events, generated by one of the tracking SDKs, passed through the validation and enrichment steps of the pipeline, and loaded to a database.

For the testing framework, Python3 is required. Install requirements with:

cd .tests
pip3 install -r requirements.txt

Quick start

To run a model and tests end to end, run the .scripts/e2e.sh bash script.

end-to-end

For a quick start guide to each individual model, and specific details on each module, see the README in the model's database-specific folder (eg. web/v1/redshift).

For detail on the structure of a model, see the README in the model's main folder (eg. web/v1).

For detail on using the helper scripts, see the README in .scripts/

Running models in production

Using SQL-runner

Snowplow BDP

Snowplow BDP customers can configure jobs for SQL-runner in production via configuration files. See our docs site for details on doing so. The configs/datamodeling.json file in each model is an example configuration for the standard model. The configs/example_with_custom.json file is an example configuration with a customization.

Open Source

For open-source users, the JSON files in configs folders can't be directly used, but serve as a representation of the dependencies for what to run. Open Source users using SQL-runner should instrument their jobs to run playbooks individually according to the dependencies specified.

For local use, the .scripts/run_config.sh script can be used to run a config - note that it does not resolve dependencies but runs the playbooks in order of appearance.

dbt

For users using dbt we have the snowplow-web dbt package, allowing you to run the web model via dbt. Currently the package only supports Redshift, with BigQuery and Snowflake support to follow soon. Similarly, the mobile model will soon be supported in dbt.

Using other tools

For those who wish to use other tools, one may configure playbooks and config JSON files for the desired model, then use the .scripts/run_configs.sh script's -p and -o flags to fill templates and output pure SQL to file:

bash .scripts/run_config.sh -b ~/pathTo/sql-runner -c web/v1/bigquery/sql-runner/configs/example_with_custom.json -p -o tmp/sql;

This SQL can then be used directly or amended to suit the relevant tool.

Copyright and license

The Snowplow Data Models project is Copyright 2020-2021 Snowplow Analytics Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

About

Snowplow offically-supported SQL data models for working with Snowplow web and mobile behavioral data. Supports Redshift, Snowflake and BigQuery.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • PLpgSQL 81.8%
  • Shell 18.2%