Reading Experience Database text mining project

Cross-posted at my Day of DH 2012  blog.

About the project

I have really been enjoying Dr. Cathy Blake‘s Text Mining class this semester, in a large part because I’ve been given access to data that really excites me. The kind souls at the Reading Experience Database (or RED, hosted at the UK’s Open University) sent me a .csv snapshot of the database from August 2011, for use in my final project. I first came across the Reading Experience Database in Bonnie Mak‘s History of the Book course in 2010 as I pursued my interest in reading history. RED aims to collect all information about reading experiences in Britain from 1450-1945. (See an example record.) The data is crowd-sourced, and anyone can contribute a reading experience by filling out a detailed webform. A contribution must include the text of the evidence of a reading experience — that is, a reference to someone reading something in a manuscript or published work. The 26,000 records in the RED make a truly incredible resource.

My final class project is still taking shape, but my goal is to perform a sentiment analysis with the records to explore British readers’ attitudes toward literature throughout this period in history. What is particularly awesome about RED is that it includes details like the reader’s socio-economic group (e.g. “Gentry”) and the type of experience (e.g. “aloud, in company”), so I may be able to analyze sentiment within these different subsets of the data. Too cool!

Dr. Blake warned us that pre-processing the data may be the most time-consuming parts of our projects, and as I’ve been playing with the RED data over the weekend, I don’t doubt it. Crowd-sourcing information is one of the best things about the internet (ordinary citizens can discover galaxies!), and the many thousands of RED records are only possible because of this technology. For a student text miner, however, crowd-sourced data can be pretty messy, especially when there is no authority control. There’s a lot of redundant data, or data that is simply inaccurate:

I’d hoped that a good way to get my feet wet with this project would be to make a quick series of infographics that reflected attributes of the authors and readers in RED, partly for practice but mainly to understand any biases the RED may have, such as having mostly male readers/authors (though comparing this data to accurate historical data is probably outside my project scope). But it will take me some time to groom the data to something more manageable. So far, the most interesting and (reasonably) accurate data I’ve been able to extract has been a list of the most popular authors listed as read in the RED, as determined by how often a reading experience involves a book attributed to these authors. I shall present it as a clumsy HTML table (how long as it been since I used an HTML table?!).

The 50 most popular authors in the Reading Experience Database:


# First name Last name # RED records
0 [n/a] 1956
0 [unknown] 1893
1 William Shakespeare 513
2 Walter Scott 414
3 Jane Austen 272
4 George Gordon, Lord Byron 267
5 Charles Dickens 222
6 Alfred Tennyson 217
7 John Milton 208
8 William Wordsworth 160
9 Samuel Johnson 145
10 H. G. Wells 143
11 Samuel Richardson 127
12 Homer 123
13 Plato 120
14 William Makepeace Thackeray 119
15 Robert Browning 112
16 Alexander Pope 105
17 John Galsworthy 102
18 Charlotte Bronte 98
19 Thomas Carlyle 96
20 Percy Bysshe Shelley 96
21 Robert Southey 91
22 John Ruskin 89
23 Victor Alexander 88
24 Thomas Moore 87
25 Virgil 87
26 John Keats 86
27 Voltaire 86
28 Margaret Dilks 85
29 George Eliot 85
30 Robert Louis Stevenson 84
31 Daniel Defoe 81
32 Ernest E. Unwin 80
33 William Godwin 79
34 Maria Edgeworth 77
35 Edward Gibbon 76
36 Jean Jacques Rousseau 74
37 George Bernard Shaw 74
38 Dante Alighieri 73
39 Samuel Taylor Coleridge 72
40 Edmund Spenser 71
41 Jonathan Swift 70
42 Thomas Hardy 69
43 James Boswell 68
44 George Meredith 68
45 Oliver Goldsmith 67
46 Harriet Martineau 67
47 Elizabeth Gaskell 65
48 Henry James 64
49 Arnold Bennett 62
50 Frances Burney 61

 

A note about the data

I have been using Oracle SQLDeveloper to explore the raw data, and I edited an exported CSV in Excel to refine things for this list. I consolidated various name spellings for the top-cited authors. I collapsed the various “n/a” and “anon/Anon./anonymous” etc. attributions into [n/a], and all the “unknown/Unknown/not known” etc. data into unknown. Many of the [n/a] texts are, as you would expect, holy scriptures, but there are also many, many newspapers listed as well. Unknown may indicate author anonymity, reader uncertainty, or record contributor uncertainty. Note that I have not taken into account any duplicate records in the RED (where 2+ record contributors may have read the same memoir and noted the same historical reader’s experience). Note also that this data:

  • reflects only the data in the RED, who would contribute to RED (one busy participant entered 8,000 records), and what contributors read
  • reflects what historically was published and sold in Britain (i.e. by my count, there are 8 female authors in this set of 50 named authors)
  • was not created with authority control

Data is never free of its context. Case in point: Ernest E. Unwin is not actually a widely-read author, but he kept the minutes of the “XII Book Club” and is often listed in the RED reading his own work. Case in point II: The entries that cite Samuel Johnson refer to both Dr. Samuel Johnson (of the Dictionary) and the Reverend Doctor Samuel Johnson. Ideally, I’d be able to assign an authorID to each individual author based on the publication title. Realistically, not gonna happen in the next month. I do still want to make some data visualizations though, so stay tuned.

What does this data tell us? Most of these authors are included in various iterations of the Western or British canon, with the exception of Unwin as well as the many unnamed journalists whose newspapers were mentioned. What’s interesting to me is the diversity these top authors’ output. The top 10 alone includes playwrights, poets, novelists, even a lexicographer. There’s also a lack of diversity, of course — only 8 of 50 are women. I will also confess that my English-major pride was a little hurt that there were some authors I hadn’t heard of before, like John Galsworthy and Harriet Martineau. There is always so much more to read! What else do you find interesting about this table?

4 comments

    1. Robin Camille Post author

      Stuart, thanks for the link! That looks like a great project. Love the RDF diagram. Yes, there’s quite a lot of data cleaning to be done. I would be interested to know how about the construction of the sameAs statements to capture different spellings of Virginia Woolf’s name — manually checking the RED? Or data wizardry?

      Reply
  1. Julia Pollack

    Hey Robin,

    I super enjoy reading this stuff… I am using you as suggested reading in my Text Mining class!

    hearts
    Julia Pollack

    Reply
    1. Robin Camille Post author

      Julia — very belated thanks! Someday soon I’ll write about the actual meat of the project, using opinion corpora to examine historical text. Someday…

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>