Skip to content

Training materials for using the DASH Platform

Notifications You must be signed in to change notification settings

tomkdefra/CDAP_training

Repository files navigation

dashbanner

DASH Platform training

Here you can find training materials for using the DASH Platform. These materials were created by members of the Data Analytics and Science Hub (DASH) in core Defra. Feedback is welcome! Please raise any issues in the Issues tab.

The training is structured around the technologies available on the platform. Each folder has a description of the technology as well as how to access it. Sub-folders are designed to be modular and task based to be both an induction as well as reference guide.

DASH team Sharepoint

DASH SharePoint site

For an overview of the DASH platform visit the DASH Platform SharePoint site. Where you can watch a Video on accessing the DASH Platform and combining data

DASH Platform SharePoint site. Where you can watch a [video]- Video on accessing the DASH Platform and combining data

  • This video talks through accessing Databricks notebooks
  • Accessing RStudio on the DASH Platform
  • Databricks Filestore

The DASH Platform playbook

The DASH Platform playbook gives a detailed account of how the platform will operate over the next few years, detailing all the features as well as governance around the platform.
It is available on the Posit server: DASH Platform playbook

Datalake

  • The Data catalogue is a Power BI dashboard giving a search for location of the governed dataset available.

  • Importing your own data to the DASH Platform

    • Learn to load data onto the platform
    • Moving data to your folder in the lab zone

There a 2G limit on importing files in this method. Dataset above this limit or if they are of potential use to others can be requested to be added through the issue tracker on Teams.

Databricks notebooks

Databricks notebooks can be used for writing Python, SQL, and R code, as well as a combination of the languages all within the same notebook.

  • Example workflow in Databricks: This video gives you an idea of what is possible

    • This example uses geospatial data from the DASH Platform
    • See how to create a dashboard using Python
  • Databricks documentation offers lots of information easily accessible in one place, for a variety of tasks.

    • The documentation is based on a general set-up of Databricks, not all of which can be done on the DASH Platform; for example, you can use existing clusters, but not create your own.

We have created training notebooks that you can load from this GitHub page into your DASH Platform workspace for practicing. Click the heading to access a guide on accessing this repo from Databricks, so you can load the notebooks into the platform and run the code.

The following notebooks are available:

  • DASH Platform Demo - Data Combine (this is the notebook used in the videos linked above)
  • Data access user guide
  • ML flow and tensorboard user guide

Databricks Training

Databricks offer lots of training materials that are free to DEFRA employees to help utilise the Databricks workspace. Most of these courses are Python & SQL based and will require some prior knowledge so we recommend looking at the beginners courses that are highlighted further on.

To access Databricks Learning, you can sign up here When using these notebooks provided by Databricks, it is important that you attach the notebook to the training cluster. The other clusters do not support the set up of the notebooks.

Python on Databricks

  • Apache SparkTM Programming with Databricks: Because Databricks is built upon spark clusters this is a good course to start with in order ot make use of the Databricks workspace. Spark can optimize queries, especially for big data.

    • Identify core features of Spark & Databricks
    • Apply the DataFrame transformation API to process and analyse data
    • Apply Delta & Structured streaming to process streaming data
  • Scalable Machine Learning with Apache Spark: This course navigates the process of building machine learning solutions using Spark. You will build and tune ML models with SparkML using transformers, estimators, and pipelines. This course highlights some of the key differences between SparkML and single-node libraries such as scikit-learn. You will also reproduce your experiments and version your models using MLflow.

    • Create data processing pipelines with Spark.
    • Build and tune machine learning models with Spark ML.
    • Track, version, and deploy models with MLflow.
    • Perform distributed hyperparameter tuning with Hyperopt.
    • Use Spark to scale the inference of single-node models.

SQL on Databricks

  • Data Analysis with Databricks SQL: This course provides a comprehensive introduction to Databricks SQL. Learners will ingest data, write queries, produce visualizations and dashboards, and configure alerts.
    • Import data and persist it in Databricks SQL as tables and views
    • Query data in Databricks SQL
    • Use Databricks SQL to create visualizations and dashboards
    • Create alerts to notify stakeholders of specific events
    • Share queries and dashboards with others

R and RStudio on the DASH Platform

We have created training materials for working in RStudio on the DASH Platform.

  • Getting started with RStudio on the DASH Platform: This is a guide for using RStudio within the DASH platform, for those familiar with R and RStudio, but new to the DASH platform. It contains:

    • Opening and closing RStudio
    • RStudio workspace
    • Accessing and working with DASH Platform data from RStudio
    • Uploading files into your workspace in RStudio
    • This also includes instructions on how you can install packages and upload data, which you need to be able to do the Introduction to R course
  • Connecting RStudio to git and GitHub: This course is a short version of getting connected as described in Jenny Bryan’s book, Happy Git with R, adapted to work for the DASH Platform. See the book here: Let’s Git started. Happy Git and GitHub for the useR. It contains:

    • Connecting everything using GitHub's Personal Access Tokens
    • Adding your GitHub repo to RStudio
    • Working with GitHub and your RStudio project
    • For more about how to use GitHub, you can sign up for a Government Analysis Function course: Intro to Git.

Hosting dashboards

  • Dashboards with Shiny and RStudio Connect: This is a guide for creating a dashboard through Shiny in RStudio and publishing it on the Posit Connect Server (formerly RStudio Connect server). It contains:
    • Creating dashboards
    • Publishing dashboards
    • For more detailed information on building Shiny apps, see the Shiny RStudio pages.
    • For more information on Posit Connect, see the Posit Connect pages.
  • Python dashboards: This is a guide for hosting Python dashboards on the Posit Connect server. This is currently possible via VS Code and the Azure Virtual Desktop (AVD).
  • Plotly dashboards: This is a guide for hosting plotly dashboards using Python on the Posit Connect server. This is currently possible via VS Code and the Azure Virtual Desktop (AVD).

Azure Virtual Desktop (AVD) on the DASH platform

  • VScode guide: This user guide walks you through all the relevant steps to get you started on VScode - how to access it and then connect it to the Databricks Compute.
  • Power BI guide: This is a guide on how to access data on the DASH platform in Power BI.

Training for new workspaces

  • BusAdmin guide: This guide talks you through what business admins (busadmin users) are permitted to do in the new workspaces, and how to create clusters.

This page details more courses that are available to you. These courses are not specific to the DASH Platform, but are useful for those new to programming in R, Python, or SQL, and also link to more advanced courses to make the most of the capabilities of the DASH Platform, such as Spark:

About

Training materials for using the DASH Platform

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages