Topic Modeling Part 2: Graphing the Results

Instructions: For this assignment, you will take the data from the Topic Modeling Tool and use Google Fusion Tables to graph the 10 topics you have identified. You will then look for trends in the graphs (e.g. does “violence” rise in the Holmes stories over time? Does “writing” appear more in the early days of the stories?) and start to theorize about them in a 300-word blog post. Refer to http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/ and http://dsl.richmond.edu/dispatch/Topics for examples of how to analyze graphs. Make sure to include enough screenshots so that all your 10 topics appear (you can graph them individually or in pairs if you find interesting relations between them).

Due: 3/31 by 10am 4/3 by 8pm (4% of final grade)

Preparing Data:

  1. Download the zip file of your Holmes topic modeling from our last class and save it on the desktop.
  2. Right-click the zip file, and go to WinZip->”extract to here.”
  3. Your unzipped folder should contain two folders: output_csv and output_html. If it doesn’t, you’re missing some vital data, and you’ll need to quickly redo your topic modeling. If you have both folders, you’re ready to go!
  4. Right now, the data is a bit messy:  TopicsInDocs.csv in your output_csv folder tells us the topic distribution for each chunk of each short story, but we need average the chunks of each story together to compare entire stories to each other. We also need to add the full title and date of publication for each story. Normally this would be a time-consuming process, but I asked Daniel Lepage to build a small web tool to do this for us, and he kindly agreed.
  5. Navigate to the web app at http://holmes-processor.appspot.com/ and upload the file called “TopicsInDocs.csv.”
  6. The web tool will output a new spreadsheet: this spreadsheet has the story abbreviation, title, publication date, and percent of each topic for each story from your original spreadsheets.
  7. Now that your data is organized, you’re ready to upload it into Google Fusion Tables.

Importing Data with Google Fusion Tables:

  1. Make sure you’re logged out of your Hawkmail account.
  2. Log into your non-Hawkmail Gmail account.
  3. Navigate to http://tables.googlelabs.com and click “Create a Fusion Table” to start.
  4. To upload your spreadsheet, select “From this computer,” “choose file,” and then browse until you find the spreadsheet on the Desktop. Highlight it, click “open,” and then click “Next.”
  5. It will give you a preview of the spreadsheet and ask if the column names are in row 1; double check that they are, and then click “Next.”
  6. Give your project a title and a description, and then click “Finish.” You’ve now imported the data!

Refining Data:

  1. Now, you need to make sure that the columns are all the correct data type
    1. Click on “edit,” “Change Column,” and look at all the information. “Story ID” and “Title” should be “text,” “Publication Date” should be “Date/Time,” and everything else should be “Number.”
    2. If you’ve changed anything, click “Save.” If not, click the arrow to the left of the “Save” button to go back to the spreadsheet. Now you’re ready to graph!

Graphing Data:

  1. Click the red “+” sign and select “Add chart” to create a line graph.
  2. Select the second chart option (“Continuous Variable Chart”).
    1. NOTE: Do not select the “Categorical Chart” line graph further down the left-hand menu. It will not work correctly with our data.
  3. You should now see the graph of “Topic 1.” Make sure “Publication Date” is the label for the x-axis on the bottom of the graph (and change it if it’s not). The bottom of the graph has little scroll bars on either side of it; you can click and drag them to zoom in on different parts of the graph.
  4. Minimize the window, and look at the all_topics.html file that you used to choose the 10 topics you identified for today’s class. Write down the topic numbers.
  5. Go back to Google Fusion Tables’s “Continuous Variable Chart” page, and click the button labeled “Choose”: this will let you select the topic numbers of your 10 favorite topics.
  6. Select them one at a time (uncheck them to make them disappear), and then try comparing them to your other categories. Look for trends.
  7. When you’re happy with your chart, click “Done” to finish it and click the red “+” to start your next chart.
  8. Take screenshots (http://www.take-a-screenshot.org/) of the charts, and make sure that each of the 10 topics appears at least once in your images.

Writing the Blog Post:

  1. Now that you have your images, it’s time to analyze them.
  2. Write a 300-word blog post that points out some trends that you’ve found across the stories.
  3. If there aren’t any trends, provide evidence to back up that assertion.
  4. Is there any correlation between historical events (such as Women’s suffrage, the Second Boer War, Queen Victoria’s death) and spikes in your topics? (Check BRANCH (http://www.branchcollective.org/) and http://myweb.fsu.edu/cupchurch/Resources/Timeline_19thcBrit.html for ideas.) Do you think there’s a connection? Why/why not? What additional research would you need to do to decide?
  5. Include your screenshots throughout your post, and make sure to label them with your topic titles.
  6. You did it! Proofread the post, submit it, and be proud of what you’ve accomplished!

Topic Modeling Assignment

Preparing Data:

1. Click this link to download a zip file of all Sherlock Holmes short stories (text from https://sherlock-holm.es). (NOTE: Each of the 56 short stories has been broken into smaller40-60 smaller text files to improve the results of topic modeling, resulting in a total of 2845 files; each story can be identified by its abbreviation.)
2. Go to the Downloads folder and unzip the data.
a. Right-click the zip file, and go to WinZip->”extract to here.”

Preparing the Topic Modeling Tool

1. Go to https://topic-modeling-tool.googlecode.com/files/TopicModelingTool.jar to download the Topic Modeling Tool (a graphical user interface for Mallet).
2. Open it.

Topic Modeling Sherlock Holmes Stories:

1. Click on the button labeled “Select Input File or Dir” and choose the folder with the Holmes stories from your Downloads folder and select “Open.”
2. Click on the button labeled “Select Output Dir.”
a. Click the icon that looks like a folder in the upper right part of the window to create a new folder.
b. Click on the new folder title to rename it.
c. Click “Open.”
3. Under “Number of Topics,” type “50,” under “Number of Iterations,” type “1000,” and under “No. of topic words printing,” type “20.”
4. Click on “Learn Topics.”
5. Once it finishes running, go to your downloads folder and click on the folder you created.
6. Click on output_html, then all_topics to see your results.
7. Click on each topic to see how it’s used.
8. Repeat steps 3-7 with different numbers of topics and iterations to explore how this affects your results.  Give the output folder a new name each timeAlso experiment with checking and unchecking “Remove Stopwords” to see what happens.

Displaying Topic Models:

1. When you find the topics you like, save the data for later by turning it into a zip file (Right-click the file or folder, point to “Send to,” and then click “Compressed (zipped) folder.”) WE WILL USE THIS DATA NEXT TIME.

2. Choose at least 10 of your favorite topics Mallet produced.

3. Post the 10 topics and their words to the blog, and make sure to start your post by listing the settings you used to generate your topics (number of iterations, number of topic words printed, and number of total topics.)
4. Your blog post with the 10 topics and words is due on March 27th at 10am (NOTE: you do not need to write a 250-word blog post for Friday.)

Topic Modeling Identification Practice

Instructions:

Examine the following groups of words. Identify what each group has in common, and summarize that theme/topic in 1-2 words.

  1. New York Times (from http://journalofdigitalhumanities.org/wp-content/uploads/2013/02/nyt.jpg)

  1. Pennsylvania Gazette (1728-1800) (from http://www.common-place.org/vol-06/no-02/tales/)
    1. away reward servant named feet jacket high paid hair coat run inches master
    2. state government constitution law united power citizen people public congress
    3. good house acre sold land meadow mile premise plantation stone mill dwelling
    4. silk cotton ditto white black linen cloth women blue worsted men fine thread
    5. general officer enemy army troop men regiment major colonel soldier
    6. church life god society great friend christian good virtue religion minister rev
    7. book published vol new price school history printing sold paper english work
    8. court person justice committed goal trial jury taken murder prisoner guilty
  2. Martha Ballard’s Diary (from http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/)
    1. birth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient
    2. meeting attended afternoon reverend worship foren mr famely performd vers attend public supper st service lecture discoarst administred supt
    3. day yesterday informd morn years death ye hear expired expird weak dead las past heard days drowned departed evinn
    4. gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds
    5. lb made brot bot tea butter sugar carried oz chees pork candles wheat store pr beef spirit churnd flower
    6. unwell mr sick gave dr rainy easier care head neighbor feet relief made throat poorly takeing medisin ts stomach
  3. Richmond Daily Dispatch (1860-1865) (from http://dsl.richmond.edu/dispatch/Topics)
    1. TREET GOOD APPLY CORNER MAIN STREETS HOUSE JA CARY FRANKLIN BROAD STS ST HIRE COOK ROOM TS SERVANT FE WOMAN DE SERVANTS STORE YEAR
    2. NEGRO YEARS REWARD BOY MAN NAMED JAIL DELIVERY GIVE LEFT BLACK PAID PAY RAN COLOR RICHMOND SUBSCRIBER HIGH APPREHENSION AGE RANAWAY FREE FEET DELIVERED
    3. STATES GOVERNMENT PEOPLE UNION STATE SOUTH WAR UNITED CONSTITUTION PEACE POWER SOUTHERN FEDERAL NORTH RIGHTS COUNTRY CONGRESS SLAVERY PRESIDENT POLICY QUESTION CONFEDERACY ACTION FREE
    4. SERVICE MEN COMPANY ARMS STATE COMPANIES VIRGINIA WAR MARYLAND VOLUNTEERS ARTILLERY MILITIA VOLUNTEER CALL JOIN FURNISHED FIELD OFFICE CORPS RECRUITING RAISE CAVALRY RECRUITS ORDNANCE
    5. ENEMY WOUNDED KILLED GEN LEFT BATTLE MEN FIGHT LOSS LINE BACK ARTILLERY POSITION FORCE CAVALRY MILES FRONT ROAD BRIGADE TROOPS DAY SIDE MORNING NIGHT
    6. HUNDRED COTTON YEAR DOLLARS MONEY THOUSAND COUNTRY AMOUNT LARGE SUPPLY MILLIONS MADE GOVERNMENT TWENTY TRADE TEN PRICE NUMBER LABOR HALF PAPER BUSINESS FIFTY STATE

DHM293 Juxta Editions In-Class Lab: Create a Digital Edition of a Holmes Story

Assignment Goal:

For this assignment, you will use Juxta Editions to make a simple digital edition of a Sherlock Holmes story of your choosing. You will use page images either from the first edition of Adventures of Sherlock Holmes or from the original printing in The Strand Magazine. You will upload the page images and make a transcription of the story’s text to create your digital edition. You will then use Juxta Editions’s “Create a Website” option to publish your digital edition.

Due Date: February 27th, 10am (8% of final grade)

Getting Page Images:

  1. Choose page images from either first edition of Adventures of Sherlock Holmes (https://archive.org/details/adventuresofsher00doyl1) or from the original printing in The Strand Magazine (https://archive.org/details/StrandMagazine9).
  2. If you choose the version from Adventures of Sherlock Holmes:
    1. Click the full-screen button (in the shape of a rectangle with 4 small arrows pointing out), then click the button in the shape of an arrow pointing to the right to turn pages in the book. Click it until you find the first page of your Holmes story.
  3. If you choose the printing from The Strand Magazine:
    1. use this website (http://www.sshf.com/encyclopedia/index.php/The_Strand_Magazine) to find the publication month of your Holmes story.
    2. Go back to the Internet Archive The Strand Magazine page, and scroll to below the page images to find a list of the different issues. Click on the correct issue to find your story
    3. Once on the correct Internet Archive page for your story, scroll down to the list of contents. Find your story and take note of what comes before and after it.
    4. Scroll back up so you can see the page images. Click the full-screen button (in the shape of a rectangle with 4 small arrows pointing out), then click on the button with an arrow pointing to the right to turn pages in the book. Click it until you find the first page of your Holmes story.
  4. Right-click on the image of the first page of your story. Select “Save Image as,” rename it “1.jpg,” and save it to the desktop. Click the arrow to reach the next page.
  5. Repeat step 4 (i.e. Save, rename the image (2.jpg for the second page, and so on), and go to the next page) until you have saved an image file for every page in your story.
  6. Now, you need to get images for the “Front Matter” and “Back Matter” of your story (e.g. cover page, table of contents, and advertisements). Go to the beginning and end of the book to save and rename the images.

Juxta Editions Set-up:

  1. Create an account
  2. Watch instructional videos 1-3 and 5
  3. Click the blue “Create Edition!” button
  4. In the new window, under “Name,” put the title of your Holmes story. Under “Description,” write “Digital Edition of” and then the name of your Holmes story. Click “Create.”
  5. You are now looking at the main page for your digital edition. Click the button labeled “Add Document” to add a document to your edition.
  6. Under “Document Name,” put the Holmes story’s title, and leave “TEI lite” as the tag schema. Click “Add Document.”
  7. To start editing your document, click on the title of the story, which should be in blue, under the “Documents in this Edition” section. You’re now ready to add information to your edition.

Adding Metadata to Juxta Editions:

  1. It’s VITAL to any editions project that you say where your information came from, and this step will show you how.
  2. Click the words “TEI Header” at the top left side of the page to add information about this digital edition.
    1. Under the section labeled “Title Statement,” select “Main Title,” and make sure the title of your Holmes story is in the text field across from it.
    2. Under “Author,” write “Sir Arthur Conan Doyle.”
    3. Under “Editor,” write your name.
  3. Click the next tab, labeled “Publication Statement” (publication information about this digital edition.)
    1. Under “Name,” put “Digital Tools for the 21st Century”
    2. Under “Place,” write “New Paltz”
    3. Under “Date,” put today’s date
  4. Click the next tab, labeled “Source Description,” and click the button labeled “Structured” to give you some more boxes to fill out about the original story.
    1. Title: your story name
    2. Author: Sir Arthur Conan Doyle
    3. Name: either The Strand Magazine or “A.L. Burt Company” (for Adventures)
    4. Place: either London (Strand) or “New York” (Adventures)
  5. Click the “Save” button in the lower right-hand corner

Uploading Front and Back Matter into Juxta Editions (cover, title page, table of contents):

  1. Click on the icon of 4 lines with dots in front of them (next to the button labeled “Side by Side” at the bottom left of the screen), and select “Front Matter.”
  2. Click the button labeled “Upload Image.”
  3. In the next window, select the first image for the front matter and click “Open”
  4. In the right-hand window, directly opposite the image, transcribe the text from the image.
  5. To add a second page of front matter, click the button labeled “New Page” in the upper left immediately below the words “TEI Header.”
  6. Repeat steps 2-5 until you’ve added all the front matter. Then click “Save” (Upper right hand button)
  7. Repeat steps 1-6, but select “Back Matter.”

Uploading and Transcribing Page Images of the Story:

  1. In the pop-up window on the left (where you selected “Front Matter”), select “Document Body.”
  2. Click on the button labeled “Transcription” (below the button labeled “New Page,” and select “General Use.”
  3. Click “Upload Image” and choose the first image of the story’s text.
  4. In another tab of your browser, navigate https://sherlock-holm.es/ascii/ (plain text for all Holmes stories)
  5. In this website, find the title of your story and click the link.
  6. Find only the text that appears on the page you’ve uploaded into your digital edition, and copy it.
  7. Paste it into the empty box on the right-hand side of the Juxta Editions screen.
  8. Use the “Names, Dates, and Places” tagset to tag every name, date, and place in each page. Use the “General Use” tagset to bold and italicize anything that needs it in each page.
  9. Repeat step 3, 6 and 7 for every page of the story. Make sure to save frequently.
  10. Once every page image has been uploaded and every page has been transcribed, proofread the transcription and correct errors.

Getting Credit for Website:

  1. Click the button labeled “Preview” (between “New Page” and “Export as XML”) to see what your edition will look like in its final version.
  2. If you don’t see anything to change, then you’re ready to submit it.
  3. Email me your username and password so I can access your website.

 

 

On Gathering Data

Your group may only have a few pieces of data, or it may have hundreds. Either way, you need to be careful when collecting your data. Here are some instructions:

Archives/editions

  1. If you have any images/logos from before 1923, you can use them (but check with me to make sure).
  2. If they are from after 1923, you must either get permission from the publishers/authors/companies or NOT USE THEM.
  3. You must include all page images, including the cover, front, and back matter, and full, proofread transcriptions of every page.
  4. You must cite the original projects following MLA citation rules.
  5. Sample citation for a Works Cited page:

Made-up, Author. This is the Title. City: Publisher, date. Print.

Topic Modeling

  1. The files must be plain text ( .txt), not HTML (.html) or Word (.doc or .docx). Open the files in either in Notepad (on a PC) or in TextEdit (on a Mac). If you see extra characters in addition to the text, you’re probably looking at an HTML file. Go back to the website with your data, copy and paste the text into Notepad or Textedit, then save as a .txt file.
  2. The file name of each file should clearly identify the text (use data and short title).
  3. You must cite the data.

 GIS

  1. If you are digitizing an historical map, make sure to cite the original (and to georectify it)
  2. If you are plotting points on a map, each entry should have a citation explaining where the information came from
  3. If you are plotting lots of data from census records on a map and your data came from 1 source, make sure to cite that source below the map.

All Projects

  1. Your data should come from books published by reputable presses (e.g. university presses), peer reviewed or otherwise praised digital humanities projects, or other reputable websites (e.g. http://www.census.gov/), NOT Wikipedia.
  2. To avoid charges of plagiarism, make sure to write down the author, title, and publication information of each work before taking notes on it. When taking notes, write down the page number for each point.

Links for Thursday’s Class

Before Thursday’s class, in which we make various types of charts, I encourage you all to read the following articles:

  1. “Choose the Right Chart Type for Your Data”: http://www.labnol.org/software/find-right-chart-type-for-your-data/6523/
  2. “Data Visualization Checklist”: http://annkemery.com/portfolio/dataviz-checklist/
  3. “How to Use 6 Basic Charts to Create Effective Reports”: http://fluidsurveys.com/university/use-different-chart-types/

This background information will ensure you understand the purpose of the different charts and graphs we will make in our next class.