Exploring Sherlock Holmes Stories and Topic Modeling

When using Mallet for topic modeling, I was surprised by how quickly the tool implemented algorithms to sort through over two thousand text elements. I assumed it would take much longer to go through 1000 iterations. After playing with the number of topics, I decided to pick three categories from each of the four html outputs which were composed of 75, 50, 30, and 15 topics. Though I started with 50 topics on a list and moved upward, I found that having over 100 topics on a list was a lot to sift through; while being aesthetically overwhelming, some of the general themes began repeating after 100. When looking at my twelve selected topics, they are undoubtedly representative of Sherlock Holmes’ world. The topics I’ve chosen to label include:

Found Corpse       Crime in London        Murder            Bedroom

Baker Street         Holmes in his Room       Attack       Investigation

Examine         Sudden          House        Holmes Sitting

In 3 out of 4 of the groups of topics, there is at least one topic related to crime. “Found Corpse” (from 75 topics), “Murder” (from 50 topics), and “Attack” (from 30 topics) all reference the violent crimes that make Holmes’ mysteries so engaging. Though the number of topics differs quite significantly between the lists of 75 and 30, the word blood still made it into both groups, showing its definite relevance and recurrence. In lists from two different outputs, there are references to Victorian London. The output of 75 topics includes a group of words that I labeled “Baker Street” and the output of 50 topics has a topic I named “Crime in London.” Another commonality among html outputs of varying topic numbers was groups of words relating to investigation or examination.

Each time I manipulated the number of topics, the outputs still maintained references to violence, Holmes, Watson, crime, and violence/murder. These being the most obvious elements in Arthur Conan Doyle’s world of story, I was not surprised that they continuously surfaced. Instead, I was mainly interested that different groups of words still brought to light similar themes, no matter what the number of topics. This reinforced my understanding that topic modeling helps to do just as its name suggests – model the kinds of overarching topics within a broad collection of texts. My only issue with this activity was in naming some groups of words. For some of my chosen categories, there was so much variation in the types of words – nouns, adjectives, verbs – that I needed more than one or two words to describe the topic. The ability to further explore details about topics such as the percentage of each story that a topic has, is very useful, but does not help to note a given topic at face value in some instances.


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s