- Extract tweets from the Mexican National Seismological Service account with the Twitter api and Tweepy and save it in a csv file
- Check if the file has been created correctly
- Move the csv file to hdfs
- Create a hive table for storing data
- Create a PySpark script to process and insert the data into the hive table
- Send an email notification when the data pipeline is completed
Orchestrated by Airflow
Extract tweets from Mexican National Seismological Service and stored in an amazon s3 bucket, all running in an EC2 instance.
- Python 3
- Pyspark
- AWS
- Hadoop
- HDFS
- Hive
- Airflow
- Datetime
- Pandas
- Requests
- Json
- Tweepy
- s3fs