GitHub - girish-Pillai/Youtube_DataScrape_GCP_PySpark_Airflow: A simple pyspark project that will fetch data from Youtube using the API V3 and run the code on Dataproc using Google Cloud Composer

A Spark Project to fetch Data using Youtube API Using PySpark and create a tables in Google Cloud Storage using Airflow and Google Cloud Dataproc

This project showcases the workflow for fetching data from the YouTube API, performing transformations via PySpark, and storing the processed data in Google Cloud Storage in Parquet format.

Overview

Data Retrieval and Transformation: The process involves retrieving data from the YouTube API and conducting transformations using PySpark.

Cloud Service Utilization: Google Cloud Dataproc services are utilized to automate the data processing workflow.

Details

DAG Configuration: The configuration defines parameters such as DAG_ID, PROJECT_ID, CLUSTER_NAME, REGION, JOB_FILE_URI, etc. These settings enable the setup and management of the Dataproc cluster environment.

Cluster Configuration: Configuration settings for the Dataproc cluster are established utilizing the ClusterGenerator class. This includes specifications for workers, machine types, disk sizes, initialization actions, and more.

Tasks in the DAG:

create_cluster: Task to create a Dataproc cluster leveraging DataprocCreateClusterOperator.

pyspark_task: Task for submitting a PySpark job to the created cluster via DataprocSubmitJobOperator.

delete_cluster: Task to delete the Dataproc cluster after job completion using DataprocDeleteClusterOperator.

Task Dependencies: Task dependencies are set up to ensure sequential execution.

The create_cluster task is initiated first, followed by pyspark_task, and finally delete_cluster, utilizing the >> operator.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
Scrape_youtube_data_v1.py		Scrape_youtube_data_v1.py
Youtube_data_scrape.ipynb		Youtube_data_scrape.ipynb
pip-install.sh		pip-install.sh
sparkSubmit_dag.py		sparkSubmit_dag.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Spark Project to fetch Data using Youtube API Using PySpark and create a tables in Google Cloud Storage using Airflow and Google Cloud Dataproc

Overview

Details

Tasks in the DAG:

About

Releases

Packages

Languages

girish-Pillai/Youtube_DataScrape_GCP_PySpark_Airflow

Folders and files

Latest commit

History

Repository files navigation

A Spark Project to fetch Data using Youtube API Using PySpark and create a tables in Google Cloud Storage using Airflow and Google Cloud Dataproc

Overview

Details

Tasks in the DAG:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages