How to use: highlight some text, click the edit annotation button, and use the Z icon to search for and add Zotero citations. Zotero lookup is configured to use the publicly accessible Digital Humanities Topic Modeling Zotero group library (in order to avoid using a private account key).
The text here is excerpted from the beginning of “The LDA Buffet: A Topic Modeling Fable” by Matthew Jockers (CC BY-SA).
. . . imagine a quaint town, somewhere in New England perhaps. The town is a writer’s retreat, a place they come in the summer months to seek inspiration. Melville is there, Hemingway, Joyce, and Jane Austen just fresh from across the pond. In this mythical town there is spot popular among the inhabitants; it is a little place called the “LDA Buffet.” Sooner or later all the writers go there to find themes for their novels. . .
One afternoon Herman Melville bumps into Jane Austen at the bocce ball court, and they get to talking.
“You know,” says Austen, “I have not written a thing in weeks.”
“Arrrrgh,” Melville replies, “me neither.”
So hand in hand they stroll down Gibbs Lane to the LDA Buffet. Now, down at the LDA Buffet no one gets fat. The buffet only serves light (leit?) motifs, themes, topics, and tropes (seasonal). Melville hands a plate to Austen, grabs another for himself, and they begin walking down the buffet line. Austen is finicky; she spoons a dainty helping of words out of the bucket marked “dancing.” A slightly larger spoonful of words, she takes from the “gossip” bucket and then a good ladle’s worth of “courtship.”
Melville makes a bee line for the “whaling” trough, and after piling on an Ahab-sized handful of whaling words, he takes a smaller spoonful of “seafaring” and then just a smidgen of “cetological jargon.”
The two companions find a table where they sit and begin putting all the words from their plates into sentences, paragraphs, and chapters.
At one point, Austen interrupts this business: “Oh Herman, you must try a bit of this courtship.”
He takes a couple of words but is not really fond of the topic. Then Austen, to her credit, asks permission before reaching across the table and sticking her fork in Melville’s pile of seafaring words, “just a taste,” she says. This work goes on for a little while; they order a few drinks and after a few hours, voila! Moby Dick and Persuasion are written . . .
[Now, dear reader, our story thus far provides an approximation of the first assumption made in LDA. We assume that documents are constructed out of some finite set of available topics. It is in the next part that things become a little complicated, but fear not, for you shall sample themes both grand and beautiful.]
. . . Filled with a sense of deep satisfaction, the two begin walking back to the lodging house. Along the way, they bump into a blurry-eyed Hemingway, who is just then stumbling out of the Rising Sun Saloon.
Having taken on a bit too much cargo, Hemingway stops on the sidewalk in front of the two literati. Holding out a shaky pointer finger, and then feigning an English accent, Hemingway says: “Stand and Deliver!”
To this, Austen replies, “Oh come now, Mr. Hemingway, must we do this every season?”
More gentlemanly then, Hemingway replies, “My dear Jane, isn’t it pretty to think so. Now if you could please be so kind as to tell me what’s in the offing down at the LDA Buffet.”
Austen turns to Melville and the two writers frown at each other. Hemingway was recently banned from the LDA Buffet. Then Austen turns toward Hemingway and holds up six fingers, the sixth in front of her now pursed lips.
“Six topics!” Hemingway says with surprise, “but what are today’s themes?”
“Now wouldn’t you like to know that you old sot.” Says Melville.
The thousand injuries of Melville, Hemingway had borne as best he could, but when Melville ventured upon insult he vowed revenge. Grabbing their recently completed manuscripts, Hemingway turned and ran toward the South. Just before disappearing down an alleyway, he calls back to the dumbfounded writers: “All my life I’ve looked at words as though I were seeing them for the first time. . . tonight I will do so again! . . . ”
[Hemingway has thus overcome the first challenge of topic modeling. He has a corpus and a set number of topics to extract from it. In reality determining the number of topics to extract from a corpus is a bit trickier. If only we could ask the authors, as Hemingway has done here, things would be so much easier.]
. . . Armed with the manuscripts and the knowledge that there were six topics on the buffet, Hemingway goes to work.
After making backup copies of the manuscripts, he then pours all the words from the originals into a giant Italian-leather attache. He shakes the bag vigorously and then begins dividing its contents into six smaller ceramic bowls, one for each topic. When each of the six bowls is full, Hemingway gets a first glimpse of the topics that the authors might have found at the LDA Buffet. Regrettably, these topics are not very good at all; in fact, they are terrible, a jumble of random unrelated words . . .