Information Empires:
The Challenge of Excess Text

Geoffrey Rockwell

TSH 511A, McMaster University
1280 Main St. W.
Hamilton, ON, L8S 4M2

georock@mcmaster.ca

"The cultural record is currently fragmented over more or less arbitrary institutional boundaries—for example, the relevant materials for understanding one artist will be held in a dozen different museums, twenty libraries, and ten archives." (ACLS Commission on Cyberinfrastructure, 2005)

One of the challenges digital humanities should tackle is the overwhelming amount of distributed information, especially textual information relating to our cultural record. A cultural theorist who is interested in "the cool", for example, will find that Google reports 14,500,000 results. [1] For the first time we have an excess of information, mostly textual, that is online and can be treated as one "empire" of information for the purposes of research.

How can these large scale aggregations, which we will here call "empires", be studied? [2] What questions can we ask of them? What methods can we use to study them? Naming the excesses of information is the first challenge. What are we talking about? I will use the word empire for aggregations of information that have the following properties.

Information empires pose a related set of interesting problems:

1. It is hard just to discover what exists across collections in order to identify a empire of study in the first place. The problem of discovery is what metadata harvesting should help us with, but once we discover things it is impossible to aggregate things so as to treat them as one.

2. Even if you could create a virtual library of texts, it is hard to study them as one. Studying a aggregated empire of information means:

The focus of this paper will be, however, on the problem of methods for empires. In short our thesis is that the methods and tools developed for the study of literary texts will not scale to empires of information because they are built on the concordance model of search for a pattern and then reading the results. How can one meaningfully read a concordance with over a million results? Reading is not an option with aggregations of over a certain size. Reading concordances in also not an option. What we need are new models that we can call text mining methods. These are suited to providing prospects or meaningful views of the whole not finding something within the whole. A related problem is that the internet search engines, which are the only empire-wide tools available, are premised on the idea that we want to find documents. Google is designed to help you find a small set of documents that will answer your question. It is not designed to study the whole result set.

In this paper I will lay out the problem of large scale information methods by first talking about the problems of empires, and then talking about methods and empires. I will propose an agenda for adapting data mining methods to text empires for computing humanists. I will connect this problem of methods to the conclusions of the recent Summit on Digital Tools for the Humanities which proposed that the Exploration of Resources on the large scale is one of the major tool challenges ahead.