Module 07 - Medallion Architecture: Bronze Layer

< Previous Module - Home - Next Module >

📢 Scenario

In this scenario, you need to produce a cleansed dataset from 2 inputs: an Excel spreadsheet listing employees and an Excel spreadsheet listing valid regions that the employees report under. The outcome of data operations should be 2 cleansed and transformed datasets:

An operations dataset that includes confidential information
An analtyical dataset without confidential information

In this lab, both spreadsheets are ingested using a pipeline activities due to their small size. Then, the data is cleansed, transformed, and published to its target location using data flows.

The pattern followed is the medallion architecture pattern. However, the demonstration includes broad concepts applicable to other architectural patterns.

Introduction

This module shows what a bronze layer within the medallion pattern might look like implemented in Azure Data Factory. The bronze layer typically involves the ingestion of raw data. In this lab, it will also perform data retention (cleanup of old data).

Key capabilities of the bronze layer are listed below:

Contains an entire historical archive in a raw or nearly raw state
Be able to reload source data without having to re-query the source system
Allows auditability and lineage tracking
Can be any combination of streaming and batch transactions

Typically, the bronze layer is incrementally and grows over time. Ingested full datasets should be dated to maintain history.

📑 Table of Contents

#	Section
1	Stage data in the data lake
2	Create the pipeline

↥ back to top

1. Stage data in the data lake

Data was staged as part of the setup for this lab. See Module 7 staging instructions for more information.

↥ back to top

2. Create the pipeline

In the factory resources pane, select on the plus icon to open the new resource menu. Select Pipeline.
In the General panel under Properties, add a Name and Description.

Attribute Value

Name pl_medallion_bronze

Description Ingest and apply retention policy
From the Activities panel, open the Move & transform accordian and drag the Copy data activity onto the canvas. Complete the below attributes in the General tab:

Attribute Value

Name Copy to incoming

Description Copy any excel file found in the source directory to a dated incoming directory.

Timeout 0.00:30:00 (30 minutes)

Retry 1
On the Source tab of the activity, click the + New button. The next steps will create a dataset that uses the managed vnet integration runtime (as opposed to the AutoResolve integration runtime).
In the New dataset panel, find Azure Data Lake Storage Gen 2 and click the Continue button.
Click the Binary option and click the Continue button.
Enter the properties and click the OK button. The New dataset pane will close.

Attribute Value

Name ds_vnmed_adls_binary

Linked Service ls_adls_irvnetmedium
Click the Open button from the Source tab of the Copy data activity properties to open the newly created dataset.
On the Parameters tab of the new dataset, click the + New button and add the below parameters.

Name Type Value

container String leave blank

directory String leave blank

filename String leave blank
On the Connection tab of the new dataset, for all 3 attributes of the File path, roll over the attribute fields and click the Add dynamic content link that appears under each field. Then, add the parameters to the appropriate fields, the first field being for the container parameter, then directory, then filename. This dataset will take parameters inclusive of file name (as opposed to the directory level).
From the Factory Resources pane, click the ds_vnmed_adls_binary dataset elipsis menu item and select Clone.
On the Properties pane of the cloned dataset, replace the text "copy1" in the Name with directory. The name of the dataset should now be ds_vnmed_adls_binary_directory.
Roll over the file name part of the File path and click the gargage bin icon. On the Parameters tab, check the filename parameter and click the Delete button.

Now that the datasets are complete, open the pl_medallion_bronze pipeline in the canvas and make the following selections in the Source tab of the activity named Copy to incoming.

Attribute	Value
Source dataset	`ds_vnmed_adls_binary`
Dataset properties / container	`source`
Dataset properties / directory	`employees`
Dataset properties / filename	`.xlsx`
File path type	`Wildcard file path`
Wildcard paths / [directory]	`employees`
Wildcard paths / [filename]	`.xlsx`
Filter by last modified / Start time (UTC)	Add dynamic content expression: `@addDays(convertFromUtc(utcNow(),'Central Standard Time'),-3)`
Filter by last modified / End time (UTC)	Add dynamic content expression: `@utcNow()`

Note that the Delete files after completion option is un-checked. Following industry accepted medallion architecture bronze layer patterns, you want to be able reload without having to re-query the source as much as possible. Leaving this option blank ensures that if you need to reload the silver layer, the most recent data is available.

Make the following selections in the Sink tab of the activity Copy to incoming.

Attribute	Value
Sink dataset	`ds_vnmed_adls_binary_directory`
Dataset properties / container	`inbound`
Dataset properties / directory	Add dynamic content expression to sink files into a YYYY/YYYYMMDD directory: `@concat('employees_bronze/',formatDateTime(convertFromUtc(utcNow(),'Central Standard Time'),'yyyy'),'/',formatDateTime(convertFromUtc(utcNow(),'Central Standard Time'),'yyyyMMddHHmm'))`

From the Activities panel, open the General accordian and drag the Delete activity onto the canvas. Click the on success condition (green check) for the Copy to incoming activity and connect it to the Delete activity. Complete the below attributes in the Delete activity:

Tab	Attribute	Value
General	Name	`Inbound Data Retention Policy`
General	Description	`Delete incoming folders older than 1 year.`
General	Timeout	`0.00:30:00` (30 minutes)
General	Retry	`1`
Source	Dataset	`ds_vnmed_adls_bineary_directory`
Source	Dataset properties / container	`inbound`
Source	Dataset properties / directory	Add dynamic content expression: `@concat('employees_bronze/',formatDateTime(getPastTime(1,'Year'),'yyyy'))`
Logging settings	Enable logging	clear check box

Note that this activity will only delete the previous year directory. If there are older directories, consider a one-time cleanup operation.

From the Activities panel, open the General accordian and drag the Fail activity onto the canvas. Click the on fail condition (red x) for the Copy to incoming activity and connect it to the Fail activity. Complete the below attributes in the Fail activity.

Tab	Attribute	Value
General	Name	`Fail`
General	Description	`Set and throw first error code and message`
Settings	Fail message	Add dynamic content expression: `@activity('Copy to incoming').output.errors[0].Message`
Settings	Error code	Add dynamic content expression: `@activity('Copy to incoming').output.errors[0].Code`

(Optional) In the Azure Storage Account lab resource named dfmdf< Random string for your lab environment resources >adls to open the storage account.

Create a directory named ####, where the number is one less than the current year, in the inbound/employees_bronze directory so that you can see the retention policy working. Then, create a directory for the current year.
In the Azure Data Factory Studio, click the Debug button.
Ensure the debug succeeds.

When asked if you want to use the dataflow debug session, click Use activity runtime.
In the Azure Storage Account lab resource named dfmdf< Random string for your lab environment resources >adls, ensure the inbound/employees_bronze/[####, current year]/[yyyymmddhhmm] directory exists with 2 Excel files.
Click the Publish all button, then click the Publish button.
:important: Click Debug again to run the pl_medallion_bronze pipeline a second time. This is necessary to demonstate the de-duplicaton logic in the next module.
Close all open tabs.

↥ back to top

🎉 Summary

You have now completed this module. Your medallion architecture bronze layer pipeline is working as expected and you can continue to the silver layer.

Continue >

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

module07.md

module07.md

Module 07 - Medallion Architecture: Bronze Layer

📢 Scenario

Introduction

📑 Table of Contents

1. Stage data in the data lake

2. Create the pipeline

🎉 Summary

Attribute	Value
Name	`pl_medallion_bronze`
Description	`Ingest and apply retention policy`

Attribute	Value
Name	`Copy to incoming`
Description	`Copy any excel file found in the source directory to a dated incoming directory.`
Timeout	`0.00:30:00` (30 minutes)
Retry	`1`

Attribute	Value
Name	`ds_vnmed_adls_binary`
Linked Service	`ls_adls_irvnetmedium`

Name	Type	Value
`container`	`String`	leave blank
`directory`	`String`	leave blank
`filename`	`String`	leave blank

Files

module07.md

Latest commit

History

module07.md

File metadata and controls

Module 07 - Medallion Architecture: Bronze Layer

📢 Scenario

Introduction

📑 Table of Contents

1. Stage data in the data lake

2. Create the pipeline

🎉 Summary