GDELT - World Events at Your Finger Tips and for Free!
The Global Database of Events, Language and Tone (GDELT) Contains Some 40 Years of News Worldwide Sources Begging for Your Questions, Analysis, and Discoveries - Learn How to Maximize Your Querying Potential
This is going to be a quick flyover GDELT. It’s a phenomenal resource that I don’t think enough data scientists and analysts know about or if they know about it, don’t realize how easy it is to work with.
GDELT and BQ are phenomenal tools that aren’t used enough in my opinion - its up-there cool like Google Trends and Trending Searches.
Google's BigQuery provides free access to the GDELT database along with 1TB of free BigQuery processing every month. This is great and very generous, but keep in mind that 1TB with GDELT on BQ can go fast! So, to make this last, I’ll show you ways of querying only small subsets of data using _PARTITIONTIME and also a cool Chrome plugin a colleague turned me onto that will estimate query costs.
What is GDELT?
"The Global Database of Events, Language and Tone is one of the largest datasets on the planet. It is the quantitative database of human society, relying on thousands of news sources from every corner of the globe dating back to 1979." (See https://www.gdeltproject.org/)
"The GDELT 2.0 Event Database is a global catalog of worldwide activities (“events”) in over 300 categories from protests and military attacks to peace appeals and diplomatic exchanges. Each event record details 58 fields capturing many different attributes of the event. The GDELT 2.0 Event Database currently runs from February 2015 to present, updated every 15 minutes and is comprised of 326 million mentions of 103 million distinct events as of February 19, 2016. This dataset uses machine translation coverage of all monitored content in 65 core languages, with a sample of an additional 35 languages hand translated. It also expands upon GDELT 1.0 by providing a separate MENTIONS table that records every mention of each event, along with the offset, context and confidence of each of those mentions."
Let's Get Querying!
Here is a very simple query and simple visualiztion. Let's pull the longitude and lattitude of the latest 1,000 news events in the US:
%matplotlib inline import matplotlib import matplotlib.pyplot as plt import pandas as pd
SELECT ActionGeo_Lat, ActionGeo_Long FROM `gdelt-bq.gdeltv2.events_partitioned` WHERE _PARTITIONTIME >= TIMESTAMP(DATE_SUB(CURRENT_DATE(), INTERVAL 2 day)) AND _PARTITIONTIME <= TIMESTAMP(DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)) AND ActionGeo_CountryCode='US' ORDER BY DATEADDED DESC LIMIT 1000
# load exploratory data and plot it news_geo_df = pd.read_csv('/Users/manuel/Downloads/results-20181108-172523.csv') plt.scatter(news_geo_df['ActionGeo_Long'], news_geo_df['ActionGeo_Lat'], s=1) plt.grid()
_PARTITIONTIME & BigQuery Mate
Here are two tips to make sure you get to query BQ without depleting your free 1TB too quickly:
- Only use _PARTITIONTIME on GDELT tables and limit the time scope to only what you want to see
- Use a Google Chrome plugin like BigQuery Mate to translate your querying estimates into dollars - ain't mistaking those!
Thanks for reading and thanks GDELT!!!