Beyond enjoying all the great classics, it’s an enormous boon to those working in Natural Language Processing (NLP). It can be used to compare writing styles through the ages, vocabulary choices, gender balance, punctuation usage, etc…
First go to the main page and enter ‘Shakespeare’ in the top right search box. It will return many books by and on Shakespeare. Click on
The Tragedy of Romeo and Juliet
There are many file types available and if you have a preference for a particular tab, go ahead. I keep things simple and rely on the plain text format.
Click on ‘Plain Text UTF-8’ and the entire book should be readable in plain text format in your browser.
Now, let’s see how we can do this programmatically, copy the link in the URL address bar (http://www.gutenberg.org/cache/epub/1112/pg1112.txt). There are different ways of downloading this data, a simple way is to you the
readLines that can handle URL paths and split each line on the newline character.