A data dashboard application that collects and analyzes real-time data about dams in NSW. Created with a Flask API, TypeScript React, PySpark, AWS RDS, S3, Lambda, and Docker.
-
This project aims to support water management efforts and enhance public awareness about water resource trends and statuses.
-
The MVP was to collect live and historic data about dams in NSW, using the WaterNSW API and display this in a responsive data dashboard to the user.
-
One major focus was to integrate cloud and data tools to create a live data pipeline directly from the public API into an AWS RDS, which could then be cleaned, processed and analyzed with PySpark.
-
First Stage - Python scripting to collect all available data from the WaterNSW API, process it with Pandas, and then seed it into a local MySQL database.
-
Second Stage - Building a Flask API on top of the local database, then connecting a React UI to display the data in various ways, including graphically with the Chart.js package
-
Third Stage - Attaching PySpark to the database to create data-driven endpoints that could perform live analysis on the entire dataset to provide historical insights
-
Fourth Stage - Creating a live data pipeline with AWS Services and connecting this live-update database with the Flask backend, to create a real-time data experience
- React
- Chart.js
- Typescript
- This frontend was designed primarily as an SPA, with additional search functionality to fetch pages about specific resources.
- Designed with the objective of creating an aesthetically appealing and interactive interface to display useful data for an engaging UX experience.
- Use the search functionality with the search bar or open a list to find specific insights on a dam.
- Clicking the 'dam-group' button will allow for automatic population of a new grouping and re-render the associated graph.
- A variety of graphs and statistics display useful information to the user.
- Chart.js integrated to provide graphical insights
- Search feature allowing users to find specific dams
- Individual pages about each dam that provide specific insights and analysis
- Google Maps API integration for dynamically displaying location
- Flask
- Python
- PySpark
- The aim of this application was to create a lightweight Flask API, that can easily switch between databases, and integrate Python data tools for quick and efficient analysis of the underlying dataset.
- The frontend app provides a user interface to interact with the backend automatically; endpoints can also be accessed and tested through applications like Postman or curl.
- PySpark for data cleaning, processing and analysis
- A collection of data-driven endpoints
- A simple and lightweight API to access the dataset
- Pandas
- PySpark
- WaterNSW API
- AWS RDS
- AWS S3 bucket
- AWS Lambda
There are three major data components in this project:
- A series of Python scripts were written to collect all data from the WaterNSW API and automate the database seeding process. These files can be found in the database-prep folder.
-
PySpark was attached to the local database during development to perform a series of real-time calculations on the dataset, accessible through endpoints in the Flask API.
-
The analysis focuses specifically about how the average water level of any specific dam or the aggregation of dams within the dataset have changed over set time periods (12 months, 5 years, and 20 years).
-
The WaterNSW API provides new data about each dam in the dataset on the first day of each month.
-
A live data pipeline was created by first creating an AWS Lambda function call to collect an OAuth2 key, with a 12-hour duration, from the WaterNSW API on the first of each month and store this in an AWS S3 Bucket.
-
A second Lambda function call then uses this key to make an API call that accesses the endpoint that provides the latest data for each dam. This recent data is then stored in the AWS S3 Bucket.
-
This recent data is then written into the historical and latest data tables in the associated AWS RDS to provide an access point to the Flask API.
- AWS Lambda, AWS S3 Bucket and AWS RDS to create a live data pipeline
- Pandas for data handling and transfer
- Live data cleaning, processing and analysis with PySpark
- Scripting for API data collection and database seeding
- Deployed by using Docker by tagging images in the AWS ECR, and then creating a service in AWS ECS
- This project uses AWS Fargate to spin up a serverless compute engine when the deployment URL is accessed.
- Investigate cached storage for calculations each month
- Fix bug with dynamically updating months on graphs
- Fix bug with button click in 'Dam Capacity Percentage Over Last 12 Months' graph
- Create testing for frontend and backend
- Providing more complex analysis with PySpark (time-series, seasonal trends, etc.)
- Addition of distributed computing for data processing with Spark
- Building a cloud-based live update data pipeline
- Integrating new data tools such as Pandas and PySpark
- Gaining hands-on experience with various AWS services
- Creating a React based data dashboard to display insights to users
- Deploying with Docker and serverless computing resources
Many aspects of this application were challenging and provided experience in new domains, including creating a live-data pipeline, deployment with Docker, and learning new data tools.
- Visit my LinkedIn for more details.
- Check out my GitHub for more projects.
- Or send me an email at [email protected]
Thanks for your interest in this project. Feel free to reach out with any thoughts or questions.
Oliver Jenkins © 2024