Gephi + MALLET + EMDA

As I prepare to attend Early Modern Digital Agendas next week, I’ve been exploring a few tools that have been on my to-try list for a while — things that have come up at many a DH-related event over the years.

Gephi

I’m embarrassed to say that I haven’t really done any data visualization myself beyond the occasional wordle. Reading a handful of data viz blogs or Tufte, for example, has therefore been an act of imagination rather than practicality. But today I tinkered with Gephi to get a visual glimpse of where EMDA participants’ interests lay. Here’s one of the better visualizations I made (words are stemmed):

Gephi text viz

To do this, I dumped all of our application essays into one big .txt file, stripping out essay titles and name/page number headers. Then I processed the text using Python and NLTK to make a Gephi-friendly XML file, following the algorithm and file format as demonstrated as described in the article “Identifying the Pathways for Meaning Circulation using Text Network Analysis.” You can see my script at Github. (Don’t make too much fun of my novice code.)

This Python script spits out each stemmed non-stopword as a node, and counts word-pairs as edges. That is, an edge occurs whenever one word occurs within 4 words of another word. The edge weight increases with the frequency of the word-pair. So digit human is a strong word pair because we mention digital humanities quite a lot.

The data my script output gave me 2,900 nodes and 11,000 edges. I filtered out nodes with fewer than 17 degrees so we’d only be looking at the top 175 nodes. Then I used the modularity algorithm, which detects ‘communities’ (almost like topics?). With a modularity resolution of 2.0, I narrowed it down to 10 communities, which are indicated by color in the visualization above. They’re sort of clustered. I’m not really sure if this is a good visualization — it seems like it is, but I’m not experienced enough to critique knowledgeably.

And what does it look like if the visualization considers all 2900 nodes? Here’s one look:

2900 nodes in 39 communities
2900 nodes in 39 communities, not really clustered at all, no labels, data party!

Gephi text viz
Circle layout, ordered by community. Crisscrossing lines show relationships between word communities.

MALLET

I also tried out topic modeling using MALLET on the same essay dump. Here’s a list of topics limited to 5:

network
seminar
social
historical
milton
reading
scholarship
field
make
approach
sdfb
form
networks
terms
community
long
society
fact
benefit
digital humanities
work
early
projects
institute
research
english
teaching
scholarly
editions
current
library
working
students
future
experience
part
scholarship
digital research
shakespeare
institute
studies
university
tools
methods
dh
hope
language
graduate
develop
study
analysis
based
corpus
large
focus
early modern
project
texts
agendas
scholars
resources
eebo
questions
folger
ways
period
works
books
online
tcp
information
bring
existing
data
digital
literary
media
history
book
time
text
database
century
archives
order
learn
political
share
cultural
narrative
eager
press

And limited to 10:

early modern
scholars
network
social
texts
agendas
words
scholarship
sdfb
approach
inquiry
criticism
mining
world
chapter
ontologies
actors
persons
texts
ways
eebo
questions
folger
period
tcp
bring
existing
online
understand
books
reading
corpus
neh
work
present
develop
understanding
digital humanities
modern
university
tools
english
teaching
study
part
projects
scholarship
development
practice
future
professional
provide
past
technology
developing
data
work
digital
opportunity
seminar
database
archives
discussions
literary
narrative
text
eager
press
conversations
relationships
relationship
archive
interface
interested
shakespeare
research
work
dh
hope
language
methods
graduate
analysis
agendas
based
application
plan
training
university
literature
approaches
writing
benefit
media
milton
interest
field
historical
society
theory
means
performance
larger
arts
prose
write
reflect
professor
team
college
readings
basic
projects
resources
library
students
experience
faculty
collections
london
make
explore
information
place
john
practical
curation
center
important
end
moeml
history
early
book
time
project
technologies
build
share
paper
agendas
experiences
scale
poems
writers
space
ocr
thinking
courses
form
early modern
institute
research
project
studies
scholarly
editions
working
current
summer
knowledge
edition
large
collaborative
textual
electronic
renaissance
participation
literary
century
order
political
text
methods
historical
works
topic
natural
learn
great
scientific
computer
public
complex
discuss
eighteenth
applying

Well, this was rather fun. And all this from a relatively small text. Methinks my MacBook Air would explode if I cranked whole corpora through these exercises.