About Manuel Amunategui

Data scientist with over 20-years experience in the tech industry, MAs in Predictive Analytics and International Administration, co-author of Monetizing Machine Learning and VP of Data Science at SpringML.

From consulting in machine learning, healthcare modeling, 6 years on Wall Street in the financial industry, and 4 years at Microsoft, I feel like I’ve seen it all. And this has opened my eyes to the huge gap in educational material on applied data science. Like I say:

It just ain’t real 'til it reaches your customer’s plate

I am a startup advisor and available for speaking engagements with companies and schools on topics around building and motivating data science teams, and all things applied machine learning.

Reach me at amunategui@gmail.com

Data Exploration & Machine Learning, Hands-on

Recommended free walkthrough, check it out and boost your career:

Office Automation Part 2 - Using Pre-Trained Word-Embedded Vectors to Categorize the Enron Email Dataset

Practical walkthroughs on machine learning, data exploration and finding insight.

Resources

YouTube Companion Video

Word Vectors

This walkthrough is comprised of three videos:

Second video of 3, here we use pre-trained word-embedded vectors to find clear logical and thematic clusters in the Enron email dataset.

Welcome to Part 2: Using Pre-Trained Word-Embedded Vectors

As you remember from Part 1, we weren't overwhelmed by the clusters we created from our word2vec efforts. This time, let's be amazed and use a pre-trained vectors! Pre-trained vectors are all the rage these days, why do all the heavy lifting when somebody else has already done it for you, and probably done better???

Just like with image recognition, you can splice your model into an existing model like with the inception model (check out Siraj Raval great video on the topic).

Hi there, this is Manuel Amunategui- if you're enjoying the content, find more at ViralML.com

GloVe: Global Vectors for Word Representation

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
Global Vectors for Word Representation

GloVe is another model just like word2vec from Stanford’s NLP department. We’re going to use one of their trained outputs that they’ve graciously made available (and there are plenty more out there).

Download the Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip file and unzip it. Warning, its over 9GB unzipped!! (there are smaller ones if that’s just too much for your equipment).

Intersecting Words between Enron and Glove.840B.300d

The first part of the code should be familiar to you. Its the same as in part 1 where we open the Enron dataset, clean it up by removing everything but alphabetic characters.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import collections
import os
import random
import numpy as np
from tqdm import tqdm
import sys, email
import pandas as pd 
import math

#########################################################
# Load Enron dataset
#########################################################

ENRON_EMAIL_DATASET_PATH = 'enron-dataset/emails.csv'

# load enron dataset
import pandas as pd 
emails_df = pd.read_csv(ENRON_EMAIL_DATASET_PATH)
print(emails_df.shape)
emails_df.head()


#########################################################
# Sort out required email features: date, subject, content
#########################################################

# source https://www.kaggle.com/zichen/explore-enron
## Helper functions
def get_text_from_email(msg):
    '''To get the content from email objects'''
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            parts.append( part.get_payload() )
    return ''.join(parts)

import email
# Parse the emails into a list email objects
messages = list(map(email.message_from_string, emails_df['message']))
emails_df.drop('message', axis=1, inplace=True)
# Get fields from parsed email objects
keys = messages[0].keys()
for key in keys:
    emails_df[key] = [doc[key] for doc in messages]
# Parse content from emails
emails_df['Content'] = list(map(get_text_from_email, messages))

# keep only Subject and Content for this exercise
emails_df = emails_df[['Date','Subject','Content']]

#########################################################
# change wor2vec model to work with Enron emails
#########################################################

 
# point it to our Enron data set
emails_sample_df = emails_df.copy()

import string, re
# clean up subject line
emails_sample_df['Subject'] = emails_sample_df['Subject'].str.lower()
emails_sample_df['Subject'] = emails_sample_df['Subject'].str.replace(r'[^a-z]', ' ')  
emails_sample_df['Subject'] = emails_sample_df['Subject'].str.replace(r'\s+', ' ')  

# clean up content line
emails_sample_df['Content'] = emails_sample_df['Content'].str.lower()
emails_sample_df['Content'] = emails_sample_df['Content'].str.replace(r'[^a-z]', ' ')  
emails_sample_df['Content'] = emails_sample_df['Content'].str.replace(r'\s+', ' ')  

# create sentence list 
emails_text = (emails_sample_df["Subject"] + ". " + emails_sample_df["Content"]).tolist()

sentences = ' '.join(emails_text)
words = sentences.split()

print('Data size', len(words))
 

# get unique words and map to glove set
print('Unique word count', len(set(words))) 
 

# drop rare words
vocabulary_size = 50000

def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in tqdm(words):
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count += 1
    data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)

del words  
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])

Now that we have our Enron emails cleaned and ready to get, let’s load the GloVe data set and extract all the word-embedded vectors for every intersecting word. A word of caution, normally you would not want to prune out all non-intersecting words as you would lose interesting word relationships. Here, we only need a handle on the Enron data set, so keeping only intersecting words will drastically reduce the data, speed up clustering operations and still reveal prominent theme in the email set

####################################################################
# find matches with glove 
####################################################################
GLOVE_DATASET_PATH = 'glove-dataset/glove.840B.300d.txt'

from tqdm import tqdm
import string
embeddings_index = {}
f = open(GLOVE_DATASET_PATH)
word_counter = 0
for line in tqdm(f):
  values = line.split()
  word = values[0]
  if word in dictionary:
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
  word_counter += 1
f.close()

print('Found %s word vectors matching enron data set.' % len(embeddings_index))
print('Total words in GloVe data set: %s' % word_counter)

And now let’s cluster these word-embedded vectors so we can understand some of the over-arching themes contained in this dataset:

#########################################################
# Check out some clusters
#########################################################

# create a dataframe using the embedded vectors and attach the key word as row header
import pandas as pd
enrond_dataframe = pd.DataFrame(embeddings_index)
enrond_dataframe = pd.DataFrame.transpose(enrond_dataframe)
 
# See what it learns and look at clusters to pull out major themes in the data
CLUSTER_SIZE = 300 
# cluster vector and investigate top groups
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=CLUSTER_SIZE)
cluster_make = kmeans.fit_predict(enrond_dataframe)

labels = kmeans.predict(enrond_dataframe)
import collections
cluster_frequency = collections.Counter(labels)
print(cluster_frequency)
cluster_frequency.most_common()

clusters = {}
n = 0
for item in labels:
    if item in clusters:
        clusters[item].append(list(enrond_dataframe.index)[n])
    else:
        clusters[item] = [list(enrond_dataframe.index)[n]]
    n +=1

for k,v in cluster_frequency.most_common(100):
  print('\n\n')
  print('Cluster:', k)
  print (' '.join(clusters[k]))

Let’s analyze a few of these clusters:

It easily pulls names, gender-based names, ethnic names, etc:

Cluster: 33
adams allen anderson armstrong barnes barton bennett bentley bradley brooks bryant burton campbell carter clark cole collins cooper crawford curtis daniels davis douglas duncan dwight edwards ellis evans fisher forbes franklin gibson gilbert gordon graham greene griffin hamilton harris harrison hayes holmes howard hudson hughes jackson johnson jones jordan jr kennedy lamar lewis lloyd marshall matthews maxwell mcdonald meyer miller mitchell moore morgan morris morrison murphy murray nelson norman owens palmer parker penn perry peterson phillips powell reynolds rhodes richardson robinson rogers ross russell scott shaw simmons smith stanley stevens stewart sullivan taylor thompson troy tyler tyson wallace walt warner warren webb whitney williams willis wilson winston wright

Cluster: 56
alice alicia alison amanda amy angela angie ann anna anne ashley audrey beth betty brenda carmen carol caroline carrie cheryl christina christine cindy claire courtney cynthia danielle dee denise diana diane ellen emily erica erin eva fiona gina heather heidi holly jackie jamie jane janet jen jennifer jenny jess jessica jill julia julie karen kate katie katy kelly kim kimberly kristen kristin laura lauren lilly linda lindsay lisa liz louise lucy lynn maggie marie megan melanie melissa michele michelle molly mrs natalie nicki nicole paige pamela rachel rebecca renee riley sally samantha sandra sara sarah sharon simpson stephanie tara tiffany tina tori vanessa wendy

Cluster: 241
acosta aguilar alberto aldo alejandra alejandro alessandro alfonso almeida alonso alvarez alvaro alves andrade andres angelo armando arroyo arturo avila ayala baez barrera barrios batista beatriz beltran benitez bianchi cabrera calderon campos carlos carrillo caruso castillo castro cesar chavez claudio conde condit cortez cristian cristina cruz daniele dante delgado diaz dolores dominguez dona duarte duran eduardo elias emilio enrico enrique erick ernesto escobar espinoza esteban estrada fabian fabio federico felipe felix fernanda fernandes fernandez fernando...

Clear clusters:

Cluster: 264
apple bell berry bowl cheese cherry chicken chile chili chips cob cook cookies cooking corn cream cup curry dip dry extract fat fresh frozen fruit ginger honey hot juice lamb milk miso mix mixed nuggets nuts pan paste patty pie pinto platter pound powder pulp raw recipe rice salt scoop sherry sugar sweet taste turkey

Cluster: 174
abroad administration affairs agenda citizen citizens civil coalition constitution corruption countries country domestic embassy foreign gov government governments govt immigration independence justice leaders military minister minority nation nations opposition overseas policies policy political politically politicians protest protests public reform reforms regime relations republic security society terrorism union unions

Cluster: 62
albany atl atlanta austin baltimore birmingham boston charlotte chicago cincinnati cleveland columbus dallas dayton denver detroit dfw houston indianapolis jacksonville lauderdale louisville miami milwaukee minneapolis myers nashville omaha orlando orleans philadelphia pittsburgh portland raleigh richmond rochester seattle tampa tulsa tx

Scary words:

Cluster: 138
anthrax attack attacks blackout blackouts blamed cartel collapse cyber danger debacle devastated dire disaster emergencies explosion fears flood gouging halt imminent meltdown outage outages shortage shortages strike strikes terrorist threat threaten threatened threatening threatens threats toll warn warned warning warnings warns widespread woes worst

Spiritual words:

Cluster: 196
admonition afflictions almighty altar angelic angels anointed behold believer believers believing bible biblical bless blessed blessing blessings brethren brotherhood christ church churches commandments confess confession congregation counsels covenant creed deeds denomination denominations devotional disciples divine divinity dwell elders eternal eternity faith faithful faithfully fasting fellowship followers forgiven forgiveness fulfilled gentile glory god gods gospel grace hearts heaven heavenly heavens holy intercede jesus mankind mercy ministry miracle miracles miraculous missionary monks mourn obedience obey ordination orthodox pagan pardon parish pastor pastoral peace praise pray prayed prayer prayers praying preach preached preacher preaching priest priests proclamation prophecy prophet prophetic providence psalm pulpit recite reconciled reformation rejoice remembrance resurrection revelation righteous sacred sacrifice sacrificed sacrifices saints salvation satan savior scripture sect sermon servant servants shalt sin sins souls spiritually testimonies thee thou thy tidings unto verse verses vow vows witness worship wrath ye yourselves

News agencies:

Cluster: 3
abcnews accuweather acma ajc akamai amzn ballmer biogen bme bmps breakingnews britannica businessweek businesswire cheatsheet cheatsheets cios citysearch clearchannel clickz climo computerworld corbis customerservice dateline denverpost deseret dilbert echostar egm ejournal eletter elocity enews enewsletter enquirer enso eonline epicurious factbox fastcompany findlaw foxsports freestuff gannett gazeta genentech goodmorning ibd idg ifilm infospace inquirer intc intelligencer iptv khou ksl latimes looksmart mailbag marketwatch marketwire mckesson mnf msft nber newsday newsdesk newshour newsmax newsticker newsweek newswire newswires nightline northcoast nrdc ntsb nws nyg nytimes ocregister oklahoman oregonian pressreleases prnewswire proflowers razorfish rcn realnetworks reuters rrs saic sepracor sfgate smartmoney sportsline ssrn stockwatch technet techrepublic tennessean theatlantic theeconomist theonion thestreet topix travelzoo twc twx udate upi upn usatoday usnews washingtonpost weatherbug webi wec weekley wgbh wmo wsi wsj wspa xinhua xom zacks zagat zdnet

Manuel Amunategui - Follow me on Twitter: @amunategui