Notes from learning about distributed systems in GW CS 6421 with Prof. Wood
- AWS Tutorial: Launch a VM (took ~15 min)
- Used the EC2 Console to create an EC2 instance of type t2.micro
- Used
ssh -i '<path-to-ssh-pem-file>' ec2-user@<ip-address>
to connect - After terminating instance, connection timed out when attempting to connect
- QwikLab: Intro to S3 (took ~20 min)
- S3 stands for Simple Storage Service, which allows storage of objects in buckets which include backups and versioning
- When uploading objects, there are different storage class options based on how long-lived and how freuqently accessed the data will be
- Public read access can be granted to specific objects
- ARN is Amazon Resource Name and in this case refers to a bucket
- The policy generator can be used to create policies to put in the policy editor
- New versions of a file can be uploaded and will display by default; additional permissions must be given to view old versions
- Video: Hadoop Intro (took ~15 minutes)
- Distributed systems are used to deal with the data explosion
- Hadoop is a framework inspired by a Google technical document to make distributed data processing easier
- Hadoop is economical, reliable, scalable, and flexible
- Traditional systems have data going to a program, while with Hadoop the program goes to the data
- Hadoop has 12 components:
- HDFS (Hadoop Distirbuted File System) is a storage layer for Hadoop
- HBase is a NoSQL database that stores data in HDFS
- Sqoop transfers data between Hadoop and relational database servers
- Flume ingests streaming data
- Spark is a cluster computing framework
- Hadoop MapReduce is the original processing engine but is losing ground to Spark
- Pig is a dataflow system primarily used for analytics that converts pig script to Map-Reduce code
- Impala is a high-performance SQL engine with low latency
- Hive is similar to Impala but is good for data processing
- Cloudera Search enables non-technical users to search and explore data
- Oozie is a workflow or coodination system used to manage Hadoop jobs
- Hue (Hadoop User Experience) is a web interface for Hadoop
- These components help with the four stages of big data processing: ingest, processing, analyze, and access
- QwikLab: Analyze Big Data with Hadoop (took ~20 minutes)
- Set up an EMR cluster and an S3 bucket for output
- The default EMR cluster configuration includes Apache Hadoop, Ganglia (performance monitor), Apache Tez (framework for creating a complex directied acyclic graph of tasks), Hive (data warehouse and analytics package), Hue (GUI), and Pig (converts Pig Latin to Tez jobs)
- Amazon CloudFront speeds up web content distribution through worldwide network of edge location data centers
- A step is a unit of work with one or more Hadoop jobs
- Created a step with a Hive script that reads and parses CloudFront log files and then submits a HiveQL query to analyze requests per OS
- Script took about a minute to process 5,000 rows
- QwikLab: Intro to S3 - already completed in Area 1
- QwikLab: Intro to Amazon Redshift (took ~20 minutes)
- Amazon Redshift is a data warehouse allowing data analysis using SQL
- Pgweb is a web application for SQL based on PostreSQL
- Created a users table, then copied 49,990 rows from an S3 text file
- Queried the data using SQL
- Video: Short AWS Machine Learning Overview (took ~3 minutes)
- There are three layers of the machine learning stack:
- Framework and Interfaces: used by companies with experts who build and train ML models
- ML Platforms: make it easy to build, train, tune, and deploy models without expertise
- Application Services: APIs for computer vision, speech processing, etc.
- There are three layers of the machine learning stack:
- AWS Tutorial: Analyze Big Data with Hadoop (took ~25 minutes)
- Created an S3 bucket and an EC2 key pair
- Launched an EMR cluster
- Allowed SSH connections to cluster
- Steps can be submitted to a cluster, configured when it is set up, or run on the command line after connecting to the master node
- The script also analyzes CloudFront log files and determines the number of requests for each OS
- QwikLab: Intro to Amazon Machine Learning (took ~35 minutes)
- Amazon ML allows software developers to build predictive models from supervised data sets
- Docs: AWS Machine Learning
- A datasource contains metadata about input data but not the data itself
- An ML model generates predictions
- An evaluation measures ML model quality
- Batch predictions generate predictions for a set of observations
- Real-time predictions generate predictions for single input observations with low latency
- Started by adding data to an S3 bucket
- Inputted the data into Amazon ML and identified the target variable
- Amazon ML by default reserves 70% of data for training and 30% for evaluation
- The confusion matrix shows a matrix of the true values and predicted values and the frequencies
- Real-time predictions show predicted scores (adding up to 1) for each category
- Real-time predictions can be done without submitting all of the input data values
- AWS Tutorial: Build a Machine Learning Model (took ~35 minutes)
- The ML model creation roughly follows the previous tutorial except with different data
- AUC (Area Under a Curve) is a quality metric for ML models
- The score threshold can be adjusted to choose a balance of false positives and false negatives
- Batch predictions can be done by linking to a new spreadsheet of input data; they cost $0.10 per 1,000 predictions
- The results spreadsheet gives the best answer and the score; the score threshold determines the best answer based on the score
- Video Tutorial: Overview of AWS SageMaker (took ~30 minutes)
- SageMaker is a machine learning service with four parts: notebook instances, jobs, models, and endpoints
- The parts can be used independently or together
- The notebook instances run Jupyter notebooks
- SageMaker has implementations of algorithms
- In the xgboost_mnist notebook shown, the steps included configuration, data ingestion, data conversion before training
- The algorithms are hosted in Docker containers; the user defines the parameters
- Training can be done on a single instance or on multiple instances to speed it up; training is done as jobs
- After training the model, it is hosted in a container so it can be used for predicting
- SageMaker allows multiple models behind the same endpoint, which allows AB testing
- After setting up the endpoint, predictions can be done and a confusion matrix can be generated
- A user-generated algorithm in a separate Python file can also be used
- SageMaker can also host already trained models, but they may need to be converted
- You can also build your own algorithm Docker container
- AWS Tutorial: AWS SageMaker (took ~45 minutes)
- Started by creating S3 bucket and SageMaker notebook instance
- In the notebook, loaded the dataset, which is images of handwritten numbers
- Configured the built-in k-means algorithm with data and output locations, etc.
- Used
fit
method to begin training - Used
deploy
to create a model, an endpoint configuration, and an endpoint - Used
predict
to get inferences for 100 images; they were mostly accurate
- Build a Serverless Real-Time Data Processing App (took ~90 minutes)
- Intro
- Set up AWS Cloud9 IDE which runs in browser
- Module 1: Build a data stream (Kinesis, Cognito)
- Created a Kinesis stream with 1 shard (unit of throughput capacity)
- Started the producer and consumer in separate terminals
- Used Cognito to create an identity pool to grant unauthenticated users access to stream
- Opened a web app that showed a live map of the coordinates being transmitted in the stream
- Module 2: Aggregate data (Kinesis Data Analytics)
- Created another stream with 1 shard for aggregated statistics
- Started the producer so that Kinesis Data Analytics would automatically detect the schema
- Created a Kinesis Analytics application and added a SQL query to aggregate stream data
- A table with data from the SQL query updated every minute automatically
- Connected the application to an output Kinesis stream
- Used a consumer to view the aggregate stream in a terminal
- Module 3: Process streaming data (DynamoDB, Lambda)
- Module 4: Store & query data (Kinesis Data Firehose, Athena)
- Intro
This application processed data generated by a producer simulating unicorns travelling across New York City in a few different ways. First, the producer sent its data to an Amazon Kinesis stream, which could be directly accessed by a dashboard for viewing on a map. From there, a SQL query in Kinesis Data Analytics analyzed the data and sent aggregate statistics out another Kinesis stream every minute. A Lambda function was then used to take the data from this stream and save it in a DynamoDB table as it arrived via an S3 bucket. Meanwhile, a Kinesis Data Firehose was used to put the stream data into an S3 bucket which was then queried to be put into an Athena table.