Skip to content

Project to ingest the gdelt data and make it available for public analysis thru a web interface

Notifications You must be signed in to change notification settings

nmduarteus/insight-gdelt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GDELT +

Introduction

GDELT is a very powerful dataset that is described as:

"an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day."

GDELT - What is it ?

  • Global Database of Events, Language, and Tone (GDELT)
  • A Global Database of Society
  • Monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages
  • Identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day

Problem

Problem:

  • A lot of analysis is already done to provide the data and classify if with a CAMEO code.
  • BUT there is data missing that is useful for further analysis - The content of each news (only URLs are provided)
  • Plus, one would have to use BigQuery or build a pipeline to get the data

Solution:

  • Enriched Data: Enrich the data by getting all the news' articles and bring them in play.
  • Make it easy: Let's provide a simple and easy way to access the data and keep track of data versions.
  • Make it visirual: Provide a simple dashboard with trends in the news

Pipeline & Tech Stack

flow

Challenges

Queries' Performance:

  • 300Gb of data with heavy joins
    • Filter early
    • Select only what is needed
    • Used hashed columns to join

Queries' Performance:

  • 15 min to download, scrape, clean and process data
    • Find and tweak good news' scrapers
    • Make usage of auto scaling groups
    • Improve write process to PostgreSLQ

Slides & Demo

  • Project slides can be found here
  • Dashboard can be found here
  • Jupyter Notebook can be found (password is insight) here

About

Project to ingest the gdelt data and make it available for public analysis thru a web interface

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published