I was, I thought, ahead of schedule in prepping my US history survey course, spending the morning adding weekly readings from the American Yawp to the syllabus. Then I saw the Stanford UP edition is coming next week.

Converting PDF files for use with LiquidText

I’ve begun to use an iPad to annotate pdf files. Many apps let you highlight text and write comments that can be extracted as a text file. Currently I’m using LiquidText, but there are many other options.

On files created from Word documents this works great. But if I’m working with an older pdf – say, one from JSTOR – it’s painful to use. The OCR layer doesn’t line up exactly with the visible text layer, so that as you try to highlight the visible text the selection is actually set off somewhat, and individual words are highlighted instead of a straight passage. Here’s an example from Shari J. Stenberg, “Liberation Theology and Liberatory Pedagogies: Renewing the Dialogue” (2006):

The Stenberg article was also large – 6.1 MB for a 21-page piece. I’m getting near the 2 GB limit on my paid Zotero account, and would rather not upgrade to the next level until I really need to. So I decided to see if I could use Adobe Acrobat to reduce the size of the document, and to run a new OCR scan that would allow me to more easily highlight and otherwise mark passages.

The process was simpler than I thought it would be. I opened the file in Acrobat and went to Tools –> Optimize Scanned PDF –> Optimize Scanned Pages. I tested several different settings but found that the defaults worked best:

The process took 2-3 minutes on my 21-page, 6.1 MB file. The resulting file was one-twelfth the size: 480 KB. But was the OCR improved?

Yes! It was much easier to highlight passages, and the highlights look much better on the page.

P.S. On another PDF that was 53 MB but didn’t have the offset OCR problem, when I tried “Optimize Scanned Pages,” it returned an error because the pages were already rendered. But just using “Reduce File Size” with default options brought the file size down to 7 MB.

Chambliss and Takacs, How College Works: The Importance of Place

Chambliss and Takacs rely on sociologist Randall Collins’s theory of “how emotionally bonded groups come to exist.” What struck me is how material the theory is – how it’s not just institutions in the abstract, but the physical spaces themselves, that contribute to a sense of belonging. The best example in How College Works is Hamilton’s Science Center. Yet the emotional bonding that takes place at the Science Center is in part due to a sense of exclusivity, and is thus in tension with the inclusivity valued by critical librarianship.

Even within a selective liberal arts college like Hamilton, the sciences are seen a further narrowing of the field. Science majors regard themselves as an elite. They come better prepared from their high schools and take harder courses in college. (The authors, themselves social scientists, believe that natural science courses really are more rigorous than those in other departments). Yet the result is that women are underrepresented. They’re less likely to tolerate poor introductory teachers and leave for other majors. Students with less preparation from high school, disproportionately minorities, are at a disadvantage. Moreover, science is increasingly taught via research-based education, which may be better for future scientists but not for the majority of students. In short, the authors see a rigorous science education as a good thing, and something that should be available to more than a relatively small, self-selected share of students (119-126).

Collins argues that there are four requirements necessary for a “dynamics of belonging,” all of which are present within the sciences at Hamilton. The Science Center, and its many labs and classrooms designed for different disciplines, provide for the physical copresence of people.1 Lab experiments and activities make for a shared focus of attention, and for ritualized common activities. And the Science Center’s existence, set off from the rest of the college, as well as science students’ perceived status as an elite, give the sciences an exclusivity based on clearly defined boundaries (79-81).

The Writing Center, in contrast, gives up some of the exclusivity of the sciences for a deeper integration across the college. It still maintains a particular status because of the importance of writing at the college. The student tutors at the Writing Center meet peers from across the disparate subcommunities on the campus and thuse are near the center of the larger social network amont Hamilton students — thus reinforcing the curricular primacy of writing at the college.

The Science Center and the Writing Center together point to ways libraries should consider their physical space. Libraries can, like the Science Center, fulfill Collins’s four requirements for a dynamics of belonging. The library allows for physical copresence with one’s peers. The work done at the library forms a ritualized common activity. And those students who spend time in the library come to recognize each other and to consider themselves an exclusive set.

This leaves out the second of Collins’s requirements: that there be a shared focus of attention. When considering the library as a study space, there is a tension between this requirement and C&T’s argument that study alone is more effective than in groups. When studying alone, even co-present people have different foci of attention.

But libraries have moved away from their traditional role as a space for concentrated individual study. Instead they emphasize group-friendly additions like lounges, presentation rooms, and the “information commons.” Yet where else can students go for quiet study space? For this is what libraries, better than anyone else, can provide. The library should enable the copresence of peers around the ritualized common activity of concentrated study, in an atmosphere that asserts the importance and the exclusivity of that concentrated study. Surely this is how the physical space of the library can best help college to work.

Other posts in this series:

  1. On Chambliss and Takacs, How College Works
  2. Chambliss and Takacs, How College Works: Teaching and Learning Skills

  1. C&T are, unsurprisingly, not big fans of MOOCs or of online learning in general. 

Browsing matters

There was a moment of serendipity on my Twitter feed today. First, Nnedi Okarafor wrote:

I had to discover African lit on my own by accident at the Michigan State Library. I walked past a wall of books in the stacks...

https://twitter.com/Nnedi/status/682961927467872258

...An Igbo name (Buchi Emecheta) caught my eye and I stopped and picked up the book. I ended up reading everything on that shelf.

https://twitter.com/Nnedi/status/682962118845546497

Then, Zeynep Tufeczi penned a paean to the United States’ robust infrastructure, especially things we can take for granted like the post office and the library:

I bit my tongue and did not tell my already suspicious friends that the country was also dotted with libraries that provided books to all patrons free of charge. They wouldn’t believe me anyway since I hadn’t believed it myself. My first time in a library in the United States was very brief: I walked in, looked around, and ran right back out in a panic, certain that I had accidentally used the wrong entrance. Surely, these open stacks full of books were reserved for staff only. I was used to libraries being rare, and their few books inaccessible. To this day, my heart races a bit in a library.

For Tufeczi, accessible stacks remain a powerful metonym for our infrastructure, while for Okarafor, the serendipity of browsing revealed a previously unthought-of literature.

Chambliss and Takacs, How College Works: Teaching and Learning Skills

In How College Works, Daniel Chambliss and Christopher Takacs (C&T) argue that “college works when it provides a thick environment of constant feedback, driven by the establishment and maintenance of social relations” (132). They’re particularly interested in how that environment helps students learn the skills of writing, speaking, and critical thinking that are central to a liberal arts education. (Note that they neither include information literacy as a separate skill nor even mention it).

Though each department at Hamilton has its own disciplinary content, all are expected to teach certain core skills: writing, speaking, and critical thinking. Writing, in particular, is deeply embedded in Hamilton’s culture. Students must take several writing-intensive courses to graduate, often beginning at the 100 level. The basic messages about good writing are repeated over the next four years. Writing is not merely an intellectual activity in the curriculum, but a social activity. And, accordingly, students believe the most important factor in learning to write well is good feedback from their professors both written and, especially, face-to-face (106, 140-142).

Similarly, speaking and critical thinking are learned as part of a community. In speaking, short presentations and small efforts, accumulated over years, lead to large gains in skill and confidence. Speaking, too, has value in that it forces the student to be engaged and rewards emotional intensity, traits that assist learning. Critical thinking also demands emotional engagement by exposing the student to new people and ideas. Faculty emphasize critical thinking across the curriculum, with the goal that, like good writing and good speaking, it will become a habit (112-119).

Yet these skills are not learned without a context. C&T argue that the faculty do not need any special expertise in the theory of writing or speaking when the goal is to instill a basic fluency. Skills are learned by learning and doing, rather than by mastering formal rhetoric or technical perfection. Though critical thinking may be more a general than a professional skill, it is still taught and learned in the context of a particular discipline. The key, then, to learning these skills is a teacher who is both competent and who cares, about both their subject and their students. The discipline or program forms the necessary context, but it is the relationships with faculty and peers that are most crucial to learning to write, speak, and think critically (130-131, 133).

Though C&T don’t include information literacy in list of their core skills, it seems to me that the discipline-based learning they celebrate is a model for IL as well. As Barbara Fister’s said in her LILAC 2015 keynote “The Liminal Library,” “…librarians don’t teach students how to be information literate. This isn’t a failing. It’s the nature of the thing we want students to learn. … You learn how information works by encountering, using, and creating it. Having good guides helps, but this kind of learning only happens in the doing of it.”

Shimer College and Diversity

Jo Becker, writing in Vice:

In every class I've had at Shimer, women have struggled to speak and be heard. I've had classes where students had panic attacks and had to leave. One student, an undocumented Latina woman was called a "wetback" by a professor in casual conversation. Why should anyone pay tens of thousands of dollars into debt only to have their basic humanity disrespected?

I’ve been fascinated with Shimer for the past little while. In large part that’s been due to Adam Kotsko’s blogging. Kotsko is a Shimer professor and believes in the pedagogy and the curriculum. Much of his writing has been about his experience teaching Islam and the Qur’an. So, though I knew Shimer was based on the Great Books tradition, I thought that was balanced by an emphasis on women and minority voices in dialogue with that tradition, and that the culture of the college was in line with that dialogue. I’m disappointed to find out it isn’t so.

Oral History and Earthquakes

Ann Finkbeiner, “The Great Quake and the Great Drowning,” in Hakai:

From the Tolowa people in northern California: one autumn, the earth shook and the water began rising. People began running and when the water reached them, they turned into snakes. But a girl and a boy from the village, both adolescents, outran the water by running to the top of a mountain where they built a fire to keep themselves warm.

The New Yorker article “The Really Big One” touched briefly on the earthquake stories of the Pacific Northwest’s indigenous peoples. Finkbeiner’s piece is centered on those same stories. Very much worth reading.

On Chambliss and Takacs, How College Works

How College Works, by Daniel F. Chambliss and Christopher G. Takacs, is a book about the ingredients of a successful education at a small liberal arts college. The library is, to appearances, largely irrelevant to this education; the authors only mention the library a single time. Moreover, that is a remark in passing: time and space can be flexible at college, as for example how “…the library morphs into a social center” (86). The library is otherwise absent, buildings, collections, and librarians alike. Yet How College Works is worth librarians’ time, and I think a place exists for the library within the authors’ vision of higher education.

The subject of How College Works is upstate New York’s Hamilton College. The authors are, respectively, a sociology professor at Hamilton and an alumnus finishing a doctorate in sociology at the University of Chicago. The main source for the research is a series of interviews with former students at Hamilton, conducted five to ten years after the student’s arrival. The goal was to allow the students time and space to reflect on their college experiences so that they could identify the elements that significantly affected their lives, whether positive or negative. A second important source for the book is a qualitative, longitudinal study of student writing.

In short, Chambliss and Takacs (C&T) argue that person-to-person relationships are central to the college experience. Healthy relationships, among students and between students and faculty, are a prerequisite to learning. Ten years after students enter, these relationships with peers and with faculty — the long-lasting friendships students made — are what they remembered about their time at Hamilton. These friendships were far more significant than the particulars of any program of study (4-5).

A student does not have to have a large number of friends for their college experience to be a success. Rather, having a few close friends — two or three good friends and one or two faculty mentors — are sufficient but essential (17).

This process of building relationships begins even before classes start. If students miss out early on they’re likely to be at a permanent disadvantage. Where one lives makes a difference, and the “lifestyle integration” that C&T find essential to a successful college experience begin here, with “a selected group of residents, close living around the clock, meeting and interacting with others in a variety of roles, multiple uses of time and space, [and] separation from the rest of the world” (89).1

If a community of peers is essential to the college experience, so too are relationships with faculty. C&T identify four characteristics of good teachers: they excite the student about the course material, they are skilled and knowledgeable in their discipline, they are accessible, and they are engaging. One of the four can’t be used to differentiate between faculty: at an elite liberal arts college like Hamilton, all the faculty are skilled and knowledgeable. Thus, it’s the other three characteristics that make the difference. Taken together, they represent students’ reaction to teachers. And first impressions are important, for a discipline as much as for a person. An introductory teacher represents a whole field of study to a new student. A good teacher can remake a student’s entire career by drawing them into their field, while a single bad professor can keep a student from ever returning (47, 50-51).2

Close contact with faculty is essential to students’ well-being. Mentoring over the long term, whether from professors or from other figures like athletic coaches, is far more important than formal relationships with an assigned advisor. Good mentoring results in a virtuous cycle of success and attention. But even smaller gestures make significant differences. C&T found that, ten years later, students remembered and valued invitations to professors’ houses. This practice, though, may only be possible at smaller colleges within a particular geography and culture — urban universities, or large land-grant institutions, don’t offer this possibility (55-59).

It turns out that these relationships with peers, with faculty, and with a larger campus community are not merely pleasant, but are essential motivators in the learning process. By their junior years students are focusing on the specifics of their chosen discipline. They need to acquire skills, knowledge, and methods particular to those fields. This requires hard, concentrated work: “tasks [that] must be faced by each individual, and students studying alone seem to perform better than those working in groups.” The student might need to study alone for the deepest learning, but even then the community of peers and mentors provides important support (105-106).

To sum up, Chambliss and Takacs believe that “college works when it provides a thick environment of constant feedback, driven by the establishment and maintenance of social relations” (132). Later I’ll look at how those factors influence several topics covered in How College Works that are of particular interest to librarians: the teaching of skills, the importance of place, and assessment.


  1. C&T find that dorms with long hallways and shared facilities, on which students will be able to regularly interact with 30-100 peers, are ideal. Some will have the preexisting community of an intercollegiate sports team. Others might join the Greek system. These practices come much easier to middle and upper-middle class whites (still the preponderance of the student body) than to their working class or minority fellows. The authors also recommend a serious, meaningful orientation --- for example, a full week of outdoor adventure. But if this isn't paid for by the college it will leave disadvantaged students further behind. 
  2. The authors argue that, contrary to usual practice, the best teachers --- those who are not only expert, but also exciting, engaging, and accessible --- should be teaching introductory classes and the largest lecture classes. Less engaging faculty, though knowledgeable, shouldn't be a student's first experience of a given field. 

Historical Documents in a Digital Library: OCR, Metadata, and Crowdsourcing

This was originally written as a paper for Chris Tomer’s graduate class on Digital Libraries at the University of Pittsburgh this past spring. It’s my attempt to articulate some ideas about what makes online historical documents usable — or not usable — for researchers. Comments and criticism are welcome!

Over the past decade, a vast number of historical materials from the past three centuries have been digitized and placed on the Internet. The majority of these have been printed sources — newspapers and books. Some have been digitized as part of a proprietary system (for example, Readex’s Early American Imprints.) Others have been made publicly accessible (Google Books, or projects from the Library of Congress under American Memory.) The grand hope of all of these was to provide searchable full text online. This would be done through the magic of optical character recognition software. Surely, librarians might have thought ten or fifteen years ago, software quality and processing power would improve rapidly, soon permitting quick and accurate reproduction of any text.

The promise of OCR has gone largely unfulfilled. While modern printed sources are easily read, older ones are not. This should lead us to reconsider how we think about these documents — how we categorize them. In a pre-digital world, there is not much difference between the modern newspaper and the eighteenth-century one. Both are opened and easily skimmed, column by column. Contrast that to a manuscript — a letter or diary — which is much harder to read.

But in the digital age, if images and computer-generated text are available over the web, the older newspapers have more in common with manuscripts than they do with newer printed materials. The latter are searchable; the former are not.

For a researcher, to profitably use a big digital collection of historic materials, he or she needs to be able to search the contents, to winnow down centuries of text. In other words he or she needs either quality OCR or quality metadata. For a large corpus, if you have a collection that is well-OCR’d, then you can get by without robust metadata. But if you have a collection that is poorly OCR’d, text search will not work — you need to have robust metadata for the library to be useful at all.

An example of the latter is the old microprint edition of Early American Imprints. The documents were in physical form and thus, not searchable at all. But the makers had created robust metadata — and this meant that, when libraries began using digital catalogs, the metadata could be ported into that catalog. Early American Imprints would be searchable along with the rest of the library’s holdings.

Printed materials and manuscript materials should be seen as parts of a larger continuum. Indeed, the major difference is not between manuscript and print, but between modern (post-1950) printed materials on the one hand and pre-modern printed works and manuscripts together on the other. This primary difference is based on the ability to create an accurate OCR text from a high-quality scan of the paper source.

Modern printed materials can be easily transformed into accurate, searchable text. Twentieth-century printing methods produced a clear, precise, and, importantly, regular type. A computer can convert those shapes into text with little trouble.

An example is the work done on the George Washington papers by the Library of Congress. Many of Washington’s letters and papers had been transcribed and printed. These included works printed in the years 1898, 1931–1944, and 1976–1979.  Even in 1998–2000, when OCR technology was significantly less powerful than it is today, librarians were able to achieve high accuracy rates — they claim 99.95%, or one error in every 2000 characters. It is a measure of the advancement of the technology that today a measure of 99.98% — sixty percent fewer errors than the George Washington Papers project — is considered a bar for high accuracy.

On the other hand, handwriting recognition is still fraught with problems and errors, even when done with software designed to learn a particular person’s script. As a result, it isn’t currently possible for a computer to transcribe a historical manuscript into searchable text.

The category in the middle — pre-modern printed materials — is deceptive. To a human eye, it is very similar to modern print. But for the OCR program, it is much more like handwriting. The deficiency of current OCR capabilities when applied to older print sources are illustrated by a series of rigorous tests performed on the Nineteenth-Century Newspaper Project. This recent study found that character accuracy was 83.6 percent. Already this is alarming. But the corpus is searched not by character, but by word. And the word accuracy was significantly worse — 78% percent. Further, proper nouns — the names and places beginning with capital letters, and those words that a researcher would be most likely to search for — were recognized only 63.4% of the time.

Other researchers have found similarly appalling accuracy rates. A 2007 study by the Dutch National Library of the results produced by several OCR contractors found a significant amount of variance “…the rates respondents gave for newspaper digitisation projects vary from 99.8% for 700,000 newspaper pages (word accuracy, manually corrected) to 68% (character accuracy, no correction) for 350,000 pages of early 20th century newspapers.” Another study, this one by the Australian Newspaper Digitisation Program, found similar variance: “In a sample of 45 pages to be representative of the libraries digitised newspaper collection 1803-1954, we found that raw OCR accuracy varied from 71% to 98.02% (character confidence, no correction).”

Clearly a raw character accuracy of 68 to 71% renders the resulting text useless for searching. Higher rates are more useful — yet even the 98% character accuracy at the high end of the Australian study will result in many missed words. If the British finding that proper nouns are recognized significantly less than their common-noun counterparts holds true here as well, then the power of full-text search is hampered even more. In sum, the researchers found that in their corpus of nineteenth-century newspapers, recall was high, precision was relatively low, and fallout was high.

This can be contrasted to a pre-digital form of searching: the index. An example is the comprehensive index to the Virginia Gazette from 1737 to 1790, prepared by historians at Colonial Williamsburg in 1950. In this index are contained references to proper names (people and places) and subject terms. (Colonial newspapers generally were populated by anonymous or pseudonymous pieces, so no authors.) An index like this, rigorously compiled and checked, provides a very different profile: very high precision, moderate to good recall, depending on the rigor, and low fallout.

#

What the Virginia Gazette index provides, in essence, is metadata. In a pre-digital world, this was the only way of “searching” the corpus. But in a world of digital libraries, such an index would seem unnecessary. And perhaps it would be, were the online text of newspapers acceptably accurate.

When digitizing the eighteenth-century run of the Virginia Gazette, the digital humanities specialists did not even seriously consider putting searchable text online. OCR was quickly found not to work well on the microfilm versions of the newspaper, and costs to have the text inputted manually were far beyond their budget. Instead they went through a laborious process of scanning and OCRing the index (which, typeset in Courier in the mid-twentieth century, could be done with high accuracy). They then placed the index online in HTML format, with links leading to the scanned images of newspaper pages. In this they were helped by another feature of the print index: it listed not just the issue date, but the page and column of the entry.

The creators envisioned a workflow that took advantage of the diligent labor of the mid-century index compilers and married it to the speed and convenience of the digital library. When working with the digital Virginia Gazette, a researcher would first search the index web page for a relevant term. Then he or she could tab back and forth between the index and a set of open images, quickly running through a list of results. All in all, the technique was successful; the disadvantage, of course, is that it is not so easy to read a run of consecutive issues, or even consecutive pages.

Yet not every old newspaper was printed in a town that had been bought by the scions of a Gilded Age dynasty. The money that John D. Rockefeller pumped into Colonial Williamsburg for restoration of the colonial city, and the research to make that possible, was not present everywhere. Thus, most eighteenth and nineteenth century newspapers do not have detailed, proofread indexes waiting as a gateway to a digital edition. Another source of metadata must be found.

OCR might be one option: it can read article titles with a moderate degree of accuracy, and, if it could pick out proper nouns with any consistency whatsoever, could index those. But, given the poor quality of the microfilm that is used to make scans of newspaper pages, OCR simply can’t cope with the demands. The amount of cleanup required would mean that librarians might as well just read the articles and index the text themselves. At least in this way they could index concepts and make a true subject index — not something that literal OCR software can do.

A workaround, tried by the Australian Newspaper Digitisation Program, is to correct the OCR version of only the headline and the first several lines of text, in hopes that this would catch the most significant aspects of the article. But in practice this still takes a great deal of time. And, from the historical point of view, another problem emerges. The standard journalist’s model of writing in the present day — in which a story begins with a lead paragraph containing a summary and essential details — was simply not part of the eighteenth or nineteenth-century repertoire. Articles from those periods are just as likely to unfold slowly — like an oration rather than a summary of facts.

To recap: for most old newspapers no preexisting metadata exists, the software to correctly create that metadata is similarly absent, and the costs in time and money for librarians to create the metadata on their own are unworkable. Yet the quality of OCR means that, without metadata, the digital text itself is an imperfect and unreliable reflection of the actual content of the newspaper.

What is to be done? The Australian Newspaper Digitisation Program came up with an innovative solution: crowdsourcing. They made it possible for users to “view, edit, correct and save OCR text on whatever article they happened to be looking at.” Knowing that particular documents had unusually bad OCR, they highlighted those images to encourage patrons to improve them. The crowdsourcing was an instant success. Within three months of the project’s launch 1200 individuals had edited “700,000 lines of text within 50,000 articles.” Further, the volunteer correctors were — based on information from that two-thirds who had registered for accounts rather than working anonymously — largely experts in the places and time period covered in the newspapers. This meant they were better able to use context to puzzle out difficult words.

As the project progressed, more and more users began to edit the newspaper text. They also developed elements of a community. Since the interface was very basic and there was no forum area, they took to using the comment mechanism as a way to interact with their fellow correctors. To their infinite credit, the Australian Newspaper Digital Project has not tried to exert particular control over user activities. Realizing that they’ve got a good thing going — valuable work being performed by a vibrant community — they have instead stood back and watched that community develop. They found that “having no moderation and being open and free like the internet has raised many questions but has so far resulted in bringing more advantages than issues to the program.”

This is the sort of project over which librarians in the United States seem to drag their feet, unwilling to give up control. The best example of crowdsourced editing of historical newspapers I know of in this country is that put together by University of Virginia history professor Ed Ayres (now president of the University of Richmond). Ayers had his lecture classes — often several hundred students — go through nineteenth-century Virginia newspapers and cull local news to be put into a database by county. This is, of course, a different kind of project — the intent is to produce a refined database rather than improve the primary sources. But, once again, hundreds or thousands of people working on small bits of a project produced usable data far superior to what modern software would have come up with.

#

So in at least two cases crowdsourcing has worked as a way to produce usable, index-ready text from image files and low-quality OCR. Old newspapers are but one source for which this technique has potential. Other printed materials could be made accessible, and beyond print is manuscript. Historical archives in the United States and elsewhere are notoriously conservative institutions. But it would take relatively little effort and not much more in the way of resources for them to provide the materials that could generate their own online community of researchers. It would be enough to provide a digital library of reasonably decent image files of manuscripts, and a web interface that allowed researchers to transcribe the material for their own use while also saving the transcription for other patron’s benefit. Allow users to create tags for the material — as the Australian project does — and you also have the beginnings of a robust set of metadata.