Data Exploration & Machine Learning, Hands-on

Practical Walkthroughs on Machine Learning, Data Exploration and Insight Finding







GDELT - World Events at Your Finger Tips and for Free!



Resources


GDELT news globe





The Global Database of Events, Language and Tone (GDELT) Contains Some 40 Years of News Worldwide Sources Begging for Your Questions, Analysis, and Discoveries - Learn How to Maximize Your Querying Potential

This is going to be a quick flyover GDELT. It’s a phenomenal resource that I don’t think enough data scientists and analysts know about or if they know about it, don’t realize how easy it is to work with.

GDELT and BQ are phenomenal tools that aren’t used enough in my opinion - its up-there cool like Google Trends and Trending Searches.

Google's BigQuery provides free access to the GDELT database along with 1TB of free BigQuery processing every month. This is great and very generous, but keep in mind that 1TB with GDELT on BQ can go fast! So, to make this last, I’ll show you ways of querying only small subsets of data using _PARTITIONTIME and also a cool Chrome plugin a colleague turned me onto that will estimate query costs.



What is GDELT?

"The Global Database of Events, Language and Tone is one of the largest datasets on the planet. It is the quantitative database of human society, relying on thousands of news sources from every corner of the globe dating back to 1979." (See https://www.gdeltproject.org/)



"The GDELT 2.0 Event Database is a global catalog of worldwide activities (“events”) in over 300 categories from protests and military attacks to peace appeals and diplomatic exchanges. Each event record details 58 fields capturing many different attributes of the event. The GDELT 2.0 Event Database currently runs from February 2015 to present, updated every 15 minutes and is comprised of 326 million mentions of 103 million distinct events as of February 19, 2016. This dataset uses machine translation coverage of all monitored content in 65 core languages, with a sample of an additional 35 languages hand translated. It also expands upon GDELT 1.0 by providing a separate MENTIONS table that records every mention of each event, along with the offset, context and confidence of each of those mentions."
(See: https://console.cloud.google.com/marketplace/details/the-gdelt-project/gdelt-2-events)



Let's Get Querying!

Here is a very simple query and simple visualiztion. Let's pull the longitude and lattitude of the latest 1,000 news events in the US:

In [12]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
 
    SELECT 
        ActionGeo_Lat, 
        ActionGeo_Long 
    FROM 
        `gdelt-bq.gdeltv2.events_partitioned` 
    WHERE 
        _PARTITIONTIME >= TIMESTAMP(DATE_SUB(CURRENT_DATE(), INTERVAL 2 day)) 
    AND _PARTITIONTIME <= TIMESTAMP(DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)) 
    AND ActionGeo_CountryCode='US' 
    ORDER BY DATEADDED DESC 
    LIMIT 1000
In [13]:
# load exploratory data and plot it
news_geo_df = pd.read_csv('/Users/manuel/Downloads/results-20181108-172523.csv')
plt.scatter(news_geo_df['ActionGeo_Long'], news_geo_df['ActionGeo_Lat'], s=1)
plt.grid()


_PARTITIONTIME & BigQuery Mate

Here are two tips to make sure you get to query BQ without depleting your free 1TB too quickly:



Thanks for reading and thanks GDELT!!!

Manuel Amunategui

Author: Monetizing Machine Learning, Curator of amunategui.github.io and ViralML