Here you can find training materials for using the DASH Platform. These materials were created by members of the Data Analytics and Science Hub (DASH) in core Defra. Feedback is welcome! Please raise any issues in the Issues tab.
The training is structured around the technologies available on the platform. Each folder has a description of the technology as well as how to access it. Sub-folders are designed to be modular and task based to be both an induction as well as reference guide.
For an overview of the DASH platform visit the DASH Platform SharePoint site. Where you can watch a Video on accessing the DASH Platform and combining data
DASH Platform SharePoint site. Where you can watch a [video]- Video on accessing the DASH Platform and combining data
- This video talks through accessing Databricks notebooks
- Accessing RStudio on the DASH Platform
- Databricks Filestore
The DASH Platform playbook gives a detailed account of how the platform will operate over the next few years, detailing all the features as well as governance around the platform.
It is available on the Posit server:
DASH Platform playbook
-
The Data catalogue is a Power BI dashboard giving a search for location of the governed dataset available.
-
Importing your own data to the DASH Platform
- Learn to load data onto the platform
- Moving data to your folder in the lab zone
There a 2G limit on importing files in this method. Dataset above this limit or if they are of potential use to others can be requested to be added through the issue tracker on Teams.
Databricks notebooks can be used for writing Python, SQL, and R code, as well as a combination of the languages all within the same notebook.
-
Example workflow in Databricks: This video gives you an idea of what is possible
- This example uses geospatial data from the DASH Platform
- See how to create a dashboard using Python
-
Databricks documentation offers lots of information easily accessible in one place, for a variety of tasks.
- The documentation is based on a general set-up of Databricks, not all of which can be done on the DASH Platform; for example, you can use existing clusters, but not create your own.
We have created training notebooks that you can load from this GitHub page into your DASH Platform workspace for practicing. Click the heading to access a guide on accessing this repo from Databricks, so you can load the notebooks into the platform and run the code.
The following notebooks are available:
- DASH Platform Demo - Data Combine (this is the notebook used in the videos linked above)
- Data access user guide
- ML flow and tensorboard user guide
Databricks offer lots of training materials that are free to DEFRA employees to help utilise the Databricks workspace. Most of these courses are Python & SQL based and will require some prior knowledge so we recommend looking at the beginners courses that are highlighted further on.
To access Databricks Learning, you can sign up here When using these notebooks provided by Databricks, it is important that you attach the notebook to the training cluster. The other clusters do not support the set up of the notebooks.
-
Apache SparkTM Programming with Databricks: Because Databricks is built upon spark clusters this is a good course to start with in order ot make use of the Databricks workspace. Spark can optimize queries, especially for big data.
- Identify core features of Spark & Databricks
- Apply the DataFrame transformation API to process and analyse data
- Apply Delta & Structured streaming to process streaming data
-
Scalable Machine Learning with Apache Spark: This course navigates the process of building machine learning solutions using Spark. You will build and tune ML models with SparkML using transformers, estimators, and pipelines. This course highlights some of the key differences between SparkML and single-node libraries such as scikit-learn. You will also reproduce your experiments and version your models using MLflow.
- Create data processing pipelines with Spark.
- Build and tune machine learning models with Spark ML.
- Track, version, and deploy models with MLflow.
- Perform distributed hyperparameter tuning with Hyperopt.
- Use Spark to scale the inference of single-node models.
- Data Analysis with Databricks SQL: This course provides a comprehensive introduction to Databricks SQL. Learners will ingest data, write queries, produce visualizations and dashboards, and configure alerts.
- Import data and persist it in Databricks SQL as tables and views
- Query data in Databricks SQL
- Use Databricks SQL to create visualizations and dashboards
- Create alerts to notify stakeholders of specific events
- Share queries and dashboards with others
We have created training materials for working in RStudio on the DASH Platform.
-
Getting started with RStudio on the DASH Platform: This is a guide for using RStudio within the DASH platform, for those familiar with R and RStudio, but new to the DASH platform. It contains:
- Opening and closing RStudio
- RStudio workspace
- Accessing and working with DASH Platform data from RStudio
- Uploading files into your workspace in RStudio
- This also includes instructions on how you can install packages and upload data, which you need to be able to do the Introduction to R course
-
Connecting RStudio to git and GitHub: This course is a short version of getting connected as described in Jenny Bryan’s book, Happy Git with R, adapted to work for the DASH Platform. See the book here: Let’s Git started. Happy Git and GitHub for the useR. It contains:
- Connecting everything using GitHub's Personal Access Tokens
- Adding your GitHub repo to RStudio
- Working with GitHub and your RStudio project
- For more about how to use GitHub, you can sign up for a Government Analysis Function course: Intro to Git.
- Dashboards with Shiny and RStudio Connect: This is a guide for creating a dashboard through Shiny in RStudio and publishing it on the Posit Connect Server (formerly RStudio Connect server). It contains:
- Creating dashboards
- Publishing dashboards
- For more detailed information on building Shiny apps, see the Shiny RStudio pages.
- For more information on Posit Connect, see the Posit Connect pages.
- Python dashboards: This is a guide for hosting Python dashboards on the Posit Connect server. This is currently possible via VS Code and the Azure Virtual Desktop (AVD).
- Plotly dashboards: This is a guide for hosting plotly dashboards using Python on the Posit Connect server. This is currently possible via VS Code and the Azure Virtual Desktop (AVD).
- VScode guide: This user guide walks you through all the relevant steps to get you started on VScode - how to access it and then connect it to the Databricks Compute.
- Power BI guide: This is a guide on how to access data on the DASH platform in Power BI.
- BusAdmin guide: This guide talks you through what business admins (busadmin users) are permitted to do in the new workspaces, and how to create clusters.
This page details more courses that are available to you. These courses are not specific to the DASH Platform, but are useful for those new to programming in R, Python, or SQL, and also link to more advanced courses to make the most of the capabilities of the DASH Platform, such as Spark:
- Training for beginners
- R and RStudio training
- Python training
- SQL training
- More advanced training
- Government Analysis Function courses
- Geospatial courses
- Resources for good practice
- Table of links