Analyze Classic Works of Literature from Around the World with Project Gutenberg and R

Practical walkthroughs on machine learning, data exploration and finding insight.

shakespeare

Resources


Project Gutenberg is an incredible resource for anybody interested in accessing classic works of literature from around the world in digital format. It is free and powered by thousands of volunteers (if you find it useful, there is a donation button on their landing page). From the site:

Project Gutenberg offers over 50,000 free ebooks: choose among free epub books, free kindle books, download them or read them online.

Beyond enjoying all the great classics, it’s an enormous boon to those working in Natural Language Processing (NLP). It can be used to compare writing styles through the ages, vocabulary choices, gender balance, punctuation usage, etc…

The Tragedy of Romeo and Juliet

Let’s see how it works by downloading ‘The Tragedy of Romeo and Juliet’

First go to the main page and enter ‘Shakespeare’ in the top right search box. It will return many books by and on Shakespeare. Click on The Tragedy of Romeo and Juliet

Project Gutenberg

There are many file types available and if you have a preference for a particular tab, go ahead. I keep things simple and rely on the plain text format.

Click on ‘Plain Text UTF-8’ and the entire book should be readable in plain text format in your browser.

Project Gutenberg

Now, let’s see how we can do this programmatically, copy the link in the URL address bar (http://www.gutenberg.org/cache/epub/1112/pg1112.txt). There are different ways of downloading this data, a simple way is to you the readLines that can handle URL paths and split each line on the newline character.

romeo_juliet <- readLines("http://www.gutenberg.org/cache/epub/1112/pg1112.txt") 
class(romeo_juliet)
## [1] "character"

length(romeo_juliet)
## [1] 4853


We’ve successfully imported 4853 lines of text. But we need to cut off some editorial data surrounding the actual play. The easiest way to tackle this is to print the first 100 lines of text:

# not showing all lines to save space
# head(romeo_juliet,100)

print(romeo_juliet[63:69])

## [63] ""                                                                       
## [64] "1595"                                                                   
## [65] ""                                                                       
## [66] "THE TRAGEDY OF ROMEO AND JULIET"                                        
## [67] ""                                                                       
## [68] "by William Shakespeare"                                                 
## [69] ""    


So, the play really starts at line 64 therefore we need to remove everything before that. At the tail end, the story ends with THE END (this may be different for different books and requires manual inspection)

grep( "THE END", romeo_juliet)
## [1] 4484

Let’s prune out download to only contains data from line 64 to line 4484:

romeo_juliet <- romeo_juliet[64:4484]



And that’s how you find and download books - first search for the book on the Project Gutenberg page, get to the text link, copy it into a readLines function in R, download it, and finally trim the editorial elements. After that… the sky is the limit.

For example:

Note: The site will track usage and may limit access if it goes beyond their accepted usage settings.

Let’s bring all our data into one big blob of text:

romeo_juliet_blob <- paste(romeo_juliet, sep="",collapse=" ")

Let’s remove all punctuation and create a vector of words:

romeo_juliet_words <- gsub(x=romeo_juliet_blob,pattern= "[[:punct:]]", replacement = ' ')   
romeo_juliet_words <- strsplit(romeo_juliet_words, ' ')



Flyby Analysis

How many times was the word Romeo and Juliet used?

print('Romeo')
## [1] "Romeo"

sum(grepl(romeo_juliet_words[[1]], pattern='romeo', ignore.case = TRUE))
## [1] 156

print('Juliet')
## [1] "Juliet"

sum(grepl(romeo_juliet_words[[1]], pattern='juliet', ignore.case = TRUE))
## [1] 66

Interesting, right? Romeo was mentioned over 2 times more than Juliet

How many times was the word man used?

sum(grepl(romeo_juliet_words[[1]], pattern='man', ignore.case = TRUE))
## [1] 176


And woman?

sum(grepl(romeo_juliet_words[[1]], pattern='woman', ignore.case = TRUE))
## [1] 9

Really? Man was mentioned 20 times more than woman?


But that isn’t fair, the grepl function looks for man on its own but also within words, including woman. Thankfully we can force the grepl function to only look for the word man:

sum(grepl(romeo_juliet_words[[1]], pattern='\\<man\\>', ignore.case = TRUE))
## [1] 77


Just like we can find words synonymous to woman by adding multiple patterns in the grepl and using the | symbol to separate them:

sum(grepl(romeo_juliet_words[[1]], pattern='woman|women|lady', ignore.case = TRUE))
## [1] 94

The point here is that it’s trivially easy to run some analysis of literature using Project Gutenberg.


How many sentences are in The Tragedy of Romeo and Juliet?

The R language has some great NLP packages that can help us build a sentence tokenizer and split our text into actual sentences.


Instead, we’re going to cheat and use a simple trick to break our data into sentences that I used in a previous blog entry. This won’t be as precise as a real sentence tokenizer but is package-free and fast. We replace the following characters !, ?, ., ; with an alphabet-based code: ootoo. Then we remove all punctuation and finally split our data using the ootoo code:

# create a blob of text like we did earlier
romeo_juliet_sentences <- paste(romeo_juliet, sep="",collapse=" ")

# swap all sentence ends with code 'ootoo'
romeo_juliet_sentences <- gsub(pattern=';|\\.|!|\\?', x=romeo_juliet_sentences, replacement='ootoo')

# remove all punctuation
romeo_juliet_sentences <- gsub(x=romeo_juliet_sentences,pattern= "[[:punct:]]", replacement = ' ')  

# remove contiguous spaces
romeo_juliet_sentences <- gsub(pattern="\\s+", x=romeo_juliet_sentences, replacement=' ')

# split sentences by split code
romeo_juliet_sentences <- unlist(strsplit(x=romeo_juliet_sentences, split='ootoo',fixed = TRUE))

# remove empty elements (when there are multiple points or question marks)
romeo_juliet_sentences <- romeo_juliet_sentences[romeo_juliet_sentences != ""]


How many sentences are in The Tragedy of Romeo and Juliet?

length(romeo_juliet_sentences)
## [1] 3751




A special thanks to Lucas A. for the Shakespeare portrait!

Full source code:

romeo_juliet <- readLines("http://www.gutenberg.org/cache/epub/1112/pg1112.txt") 
class(romeo_juliet)
length(romeo_juliet)

# not showing all lines to save space
# head(romeo_juliet,100)
print(romeo_juliet[63:69])
grep( "THE END", romeo_juliet)

# remove unecessary text
romeo_juliet <- romeo_juliet[64:4484]

# create a text blob
romeo_juliet_blob <- paste(romeo_juliet, sep="",collapse=" ")

#remove all punctuation and create a vector of words:
romeo_juliet_words <- gsub(x=romeo_juliet_blob,pattern= "[[:punct:]]", replacement = ' ')   
romeo_juliet_words <- strsplit(romeo_juliet_words, ' ')

# How many times was the word Romeo and Juliet used:
print('Romeo')
sum(grepl(romeo_juliet_words[[1]], pattern='romeo', ignore.case = TRUE))
print('Juliet')
sum(grepl(romeo_juliet_words[[1]], pattern='juliet', ignore.case = TRUE))

# How many times was the word ``man`` used?
sum(grepl(romeo_juliet_words[[1]], pattern='man', ignore.case = TRUE))

# And ``woman``?
sum(grepl(romeo_juliet_words[[1]], pattern='woman', ignore.case = TRUE))

sum(grepl(romeo_juliet_words[[1]], pattern='\\<man\\>', ignore.case = TRUE))
sum(grepl(romeo_juliet_words[[1]], pattern='woman|women|lady', ignore.case = TRUE))

# create a blob of text like we did earlier
romeo_juliet_sentences <- paste(romeo_juliet, sep="",collapse=" ")

# swap all sentence ends with code 'ootoo'
romeo_juliet_sentences <- gsub(pattern=';|\\.|!|\\?', x=romeo_juliet_sentences, replacement='ootoo')
 
# remove all punctuation
romeo_juliet_sentences <- gsub(x=romeo_juliet_sentences,pattern= "[[:punct:]]", replacement = ' ')  

# remove contiguous spaces
romeo_juliet_sentences <- gsub(pattern="\\s+", x=romeo_juliet_sentences, replacement=' ')

# split sentences by split code
romeo_juliet_sentences <- unlist(strsplit(x=romeo_juliet_sentences, split='ootoo',fixed = TRUE))

# remove empty elements (when there are multple points or question marks)
romeo_juliet_sentences <- romeo_juliet_sentences[romeo_juliet_sentences != ""]

# How many sentences are in ``The Tragedy of Romeo and Juliet``?

length(romeo_juliet_sentences)