GDELT +

Introduction

GDELT is a very powerful dataset that is described as:

"an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day."

GDELT - What is it ?

Global Database of Events, Language, and Tone (GDELT)
A Global Database of Society
Monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages
Identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day

Problem

Problem:

A lot of analysis is already done to provide the data and classify if with a CAMEO code.
BUT there is data missing that is useful for further analysis - The content of each news (only URLs are provided)
Plus, one would have to use BigQuery or build a pipeline to get the data

Solution:

Enriched Data: Enrich the data by getting all the news' articles and bring them in play.
Make it easy: Let's provide a simple and easy way to access the data and keep track of data versions.
Make it visirual: Provide a simple dashboard with trends in the news

Pipeline & Tech Stack

Challenges

Queries' Performance:

300Gb of data with heavy joins
- Filter early
- Select only what is needed
- Used hashed columns to join

Queries' Performance:

15 min to download, scrape, clean and process data
- Find and tweak good news' scrapers
- Make usage of auto scaling groups
- Improve write process to PostgreSLQ

Slides & Demo

Project slides can be found here
Dashboard can be found here
Jupyter Notebook can be found (password is insight) here

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
airflow		airflow
config		config
data_processing		data_processing
flaskweb		flaskweb
quilt		quilt
tests		tests
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GDELT +

Introduction

GDELT - What is it ?

Problem

Pipeline & Tech Stack

Challenges

Slides & Demo

About

Releases

Packages

Contributors 2

Languages

nmduarteus/insight-gdelt

Folders and files

Latest commit

History

Repository files navigation

GDELT +

Introduction

GDELT - What is it ?

Problem

Pipeline & Tech Stack

Challenges

Slides & Demo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages