Open Source Shakespeare (in MySQL)
My good friend Richard Brent has often complained that my blog has very little Shakespeare content. Despite the domain name, I don't think I've ever blogged about The Big S. For shame! Fear not, my Brentish-Boy, this post is all about Shakespeare. And MySQL....
Ahem...
When I first started shkspr.mobi it was intended to be an easy way to get Shakespeare on your phone. At that time, there were no mobile formatted texts of his plays and sonnets, so I had to create them. Finding Shakespeare's works in a suitable format for conversion wasn't too hard - but it meant lots of crufty code to read text files line-by-line. Yuck.
A few years later, I stumbled across Open Source Shakespeare. The project grew out of Eric Johnson's MA thesis. It's a remarkably good idea with only one minor problem. The database it uses is Microsoft Access.
MS Access, as a database, could best be described as
deformed, crooked, old and sere, ill faced, worse bodied, shapeless everywhere, vicious, ungentle, foolish, blunt, unkind, stigmatical in making, worse in mind
(Comedy of Errors, Act IV, Scene II)
There are a few Open Source Shakespeare projects on GitHub, but they don't seem very practical.
So, naturally, I've decided to create my own version of Shakespeare's works - in MySQL :-)
This is what it looks like: You can download it from GitHub.
I've stripped out a lot of the extraneous stuff from the original version - word counts, etc. So it should be a fairly lean database which is easy to use. I'm not a database professional, so I would be grateful if you could suggest any improvements. Either using this blog's comment form or on GitHub..
There are four tables
Paragraphs
This is where the main body of text is. A typical row will look like this
- WorkID: hamlet
- ParagraphID: 639015
- ParagraphNum: 3427
- CharID: hamlet
- PlainText: Has this fellow no feeling of his business, that he sings atngrave-making?
- Act: 5
- Scene: 1
Works
This is what translates the "WorkID" into something human readable - plus some extra metadata
- WorkID: hamlet
- Title: Hamlet
- LongTitle: Tragedy of Hamlet, Prince of Denmark, The
- Date: 1600
- GenreType: Tragedy
Character
This is what translates the CharID into a human readable name and description
- charID: hamlet
- CharName: Hamlet
- Abbrev: Ham
- Works: Tragedy of Hamlet, Prince of Denmark, The
- Description: son of the former king and nephew to the present king
Chapters
This gives the setting for each Act and Scene.
- WorkID: hamlet
- ChapterID: 18893
- Act: 5
- Scene: 1
- Description: Elsinore. A churchyard.
What's Next?
The next steps for the project are fairly obvious:
- Write some high level example code to show people how to use the database.
- Make shkspr.mobi a showcase site which runs off the database.
- Fix any bugs and inconsistencies that people find.
You can download the Shakespeare MySQL Database from GitHub.
Eric Johnson says:
Terence, the OSS site itself runs on MySQL, and has since 2003, when I launched the beta version of the site. The download page provides Access and CSV files because those are the most easily-consumed versions of the database. For whatever reason, I've actually never been asked for a mysqldump version of the site -- probably because whenever someone has downloaded the db, they want to use the database in their own personal project, so they'd rather import the data into their own table structure, rather than replicating OSS's.
I'm glad you're finding the database useful, and I'm also glad to see it up on github.
Terence Eden says:
Thanks for the comment - and thank you for creating the original version.
Samuel Pickard says:
Has this fellow no feeling of his business, that he sings atngrave-making?
I think that this may be a question for both Terence and Eric then. The text has a new-line character n in it, which really really annoys me far more than is reasonable. Does this text really need formatting in it? Can't I decide how to word wrap the text?
Terence Eden says:
Hi Samuel,
Great question.
Two points to note,
What one could do is create a separate tale which lists where the line breaks should be - then remove them from the text. To be honest, I think it's probably easier for the user to strip out the n is they're not needed.
T
Samuel Pickard says:
Good point, I'd not thought of referencing specific lines. n is much, much better than <br/>
Joseph Haig says:
Thanks for this, which I found via Bill Thompson on Twitter (@billt). One thing I notice immediately is that the sql file doesn't have the table definitions. I can guess more or less how the tables should be created but it would be useful if these could be included. A 'mysqldump' should produce a file including all you need to recreated the database elsewhere.
Joseph Haig says:
... and another thing (now that I have imported your data).
You should avoid multi-valued columns such as the 'Works' column of the 'Characters' table. Instead, have a separate table with two columns; CharId and WorkId. This will make it much easier to extract the data.
I am a database professional (of sorts) but not a great expert on Shakespeare.
Andy Mabbett says:
Sir Tim Berners-Lee proposes "5 Stars of Linked Open Data", the last of which is "link your data to other data to provide context". Accordingly, I'd suggest you add a line (or lines) to your "Works" table, with the URIs of, say, the equivalent English Wikipedia articles, and/or, their DBPedia (data) equivalents.
Wikipedia has a Shakespeare bibliography whose list of links you may find useful. Lists of links for Male Shakespearean characters and female Shakespearean characters are also listed.
Andy Mabbett says:
Since I wrote that, Wikidata - a linked-, open- data repository sitting alongside Wikipedia - has become available.
So now you can use Wikidata URIs :
The Merchant of Venice == https://www.wikidata.org/entity/Q206400
and not just for works:
Lady Macbeth == https://www.wikidata.org/entity/Q2454065
Richard Morrison says:
I've been tinkering with this data and have tidied it up a bit (according to my own preferences and those of Ruby on Rails).
Not a finished work, but something to be built upon, perhaps.
See https://github.com/edent/Open-Source-Shakespeare/issues/1
Richard Morrison says:
Did a bit more on this - it's fun!
Romeo just edges Juliet in the stats for "Romeo and Juliet". Can you guess who is next with 10.7% of the lines (paragraphs)?
http://bardofavon.herokuapp.com/works/34/characters