This walkthrough is comprised of three videos:
First part out of 3, word2vec on Tensorflow and modeling the Enron Email Dataset. We'll clean up the emails, model it with word2vec skip-gram and cluster it to discover themes.
Welcome to Part 1: Applying word2vec on the Enron data set
word2vec probably doesn’t need an introduction as its been around for a few years and is extremely popular in the NLP community. In a nutshell, it is a straightforward single layer neural net through which you feed large bodies of text and it will model complex word relationships. For more information on the model, check out one of the original papers: Distributed Representations of Words and Phrases and their Compositionality.
Distributed Representations of Words and Phrases and their Compositionality.
word2vec Tutorial for Tensorflow
For information on word2vec’s Tensorflow tutorial, please visit: Vector Representations of Words, meanwhile, go the the GitHub repo to copy and run the code word2vec_basic.py (this assumes you have Tensorflow up-and-running).
Here is the final logged loss along with the top word distances.
Average loss at step 100000 : 4.69704249334
Nearest to also: which, now, often, still, not, pulau, generally, it,
Nearest to on: in, through, upon, at, against, under, constituci, canaris,
Nearest to this: it, which, the, that, another, nn, balboa, some,
Nearest to new: bowled, equivalents, agouti, dasyprocta, rutger, hebrews, cardiomyopathy, cabins,
Nearest to or: and, agouti, while, than, but, operatorname, abet, thaler,
Nearest to all: many, some, these, agouti, except, mortimer, session, kifl,
Nearest to however: but, where, although, though, that, thibetanus, while, and,
Nearest to often: sometimes, commonly, widely, usually, also, generally, there, now,
Nearest to with: in, between, using, by, when, from, michelob, circ,
Nearest to can: may, would, could, will, should, must, might, cannot,
Nearest to most: more, some, less, many, kvac, bolo, several, callithrix,
Nearest to if: when, where, cannot, ursus, though, although, operatorname, thaler,
Nearest to use: balboa, albury, arrival, callithrix, adaptation, ibelin, marlow, warhead,
Nearest to war: alligator, dasyprocta, operatorname, heh, bomis, riots, subgenres, bluetooth,
Nearest to during: in, after, at, from, ursus, circ, despite, callithrix,
Nearest to may: can, would, will, could, must, might, should, upanija,
Some word relationships are good, like ‘can: may, would, could, will, should, must, might, cannot’, others not so much. This is using the Text8 corpus, let’s try it with the Enron Email dataset.