Extending Our Capabilities Through Automated Knowledge Acquisition

Omics have been popping up everywhere. While reading the paper “Quantitative Analysis of Culture Using Millions of Digitized Books” (published January 14, 2011 in Science) I gasped at the neologism culturomics. Surely the omics trend is out of control! But wait. It occurred to me that just a couple days ago I used an omics word in a post to this blog. The word was holonomic, which was used within the context of Dr. Karl Pribram’s holonomic brain theory. Didn’t those of us working with Karl at the time he was writing his book “Brain and Perception” talk about the holonomic brain theory? And wasn’t that during the pre-omic world? Time to go to Google’s Ngram Viewer!

The frequency of use of the words 'genomic' and 'holonomic'  in books published from 1880 through 2008.
Figure 1. The frequency of use of the words genomic and holonomic in English language books published from 1880 through 2008. Smoothing is set to zero to show raw data results.

A quick search for the words genomic and holonomic in books from 1880 through 2008 shows a typical lesson from history (see Figure 1 above). The words have been in use for a lot longer than I expected. Notice the little bumps as far back as the late 1890s. But how were the words used?

First let’s consider the accuracy of the data. When I look at books cited as containing the word genomic before the 1930’s most if not all of the citations are errors. The great majority of errors are due to mistakes in optical character recognition. Many of the late 19th century mistakes are due to the French word generale but also from aenemic, economics, and Cenozoic. Some of the errors are due to wrong dates. For example, a book from 1982 may be listed as from 1882. This changes around the 1930s. For example a 1939 lecture given by by Richard Goldschmidt stated “The facts reported indicate differences between species which are on a chromosomal level and, maybe, frequently even on a genomic level.” This was published in a 1940 book titled “The Material Basis of Evolution.”

Note: A 1-gram is a string of characters uninterrupted by a space. An n-gram is a sequence of 1-grams, such as the phrases “holonomic brain” (a 2-gram) and “the neuron doctrine” (a 3-gram). Usage frequency (y-axis in graphs by Google’s Ngram Viewer) is computed by dividing the number of instances of an n-gram in a given year by the total number of n-grams in the corpus in that year.

The y-axis in Figure 1 shows the search word’s percentage of all the words (1-grams; see note above) in all the books published in English that are currently part of the database. Even when genomic becomes relatively common the word peaks at showing up about 0.000300% of the time. That means genomic occurs 3 ten thousandths of one percent of the time in English language books. The little bumps between 1930 and 1960 are far smaller; less than 0.00000050% or 5 ten millionths of one percent during most of the 1930s and less than 0.00000250% or 25 ten millionths of one percent (or 5 times more) during 1960. These little bumps due to genomic aren’t even detectable in the graph shown in Figure 1 but they include a high percentage of real usage (rather than errors) in ways similar to the way the word is used today. Around 1970 the use of genomic becomes detectable in Figure 1 at 0.00000700% or 7 millionths of one percent.

The 2-gram 'holonomic brain' appears in English language books beginning around the publication of 'Brain and Perception'
Figure 2. The 2-gram holonomic brain appears in English language books beginning around the publication of “Brain and Perception” on June 1, 1991. Smoothing is set to zero to show raw data results.

Interestingly, even though the word holonomic has remained rare, early references actually pan out as genuine rather than errors. For example, a set of 16 books containing the word holonomic was returned for between 1902 and 1905. All 16 citations were correct. On the other hand the use of the word is so rare in the corpus as to barely be detectable at less than 0.00000040% or 4 ten millionths of one percent. Holonomic was defined as “a dynamical system for which a displacement represented by arbitrary infinitesimal changes in the coordinates is in general a possible displacement” in the 1904 book “A Treatise on the Analytical Dynamics of Particles and Rigid Bodies” by Edmund Taylor Whittaker. The use was not in relation to the brain but was used in mathematical definitions of specific types of dynamical systems. It wasn’t until around the publication of “Brain and Perception” on June 1, 1991 that the 2-gram holonomic brain appears in the literature (see Figure 2 above).

All of this points to how fun the tools and data set presented in the paper “Quantitative Analysis of Culture Using Millions of Digitized Books” can be. The paper states that over 15 million books (about 12% of all books ever published) have been digitized by Google so far. The authors carried out some cultural investigations on a subset of those data containing 5,195,769 books (about 4% of all books ever published).

Note: Those interested in research methods and other details should download the supporting online material for this article, an 88 page pdf file, available here.

Mass access to our published heritage is a positive development. However, even the most voracious reader may only read an extremely small percentage of published books and literature. As the authors said in the paper “If you tried to read only English-language entries from the year 2000 alone, at the reasonable pace of 200 words/min, without interruptions for food or sleep, it would take 80 years.”

How will we, as finite beings, be able to keep up? Even within our areas of special interest? Clearly twenty-first century breakthroughs will be about extending our capabilities through automated knowledge acquisition. That’s where the Semantic Web comes in.

Note: The full data set described in the paper is available for exploration or download at www.culturomics.org and ngrams.googlelabs.com.


Other related blog posts:

Sex Matters But the Brain is Like Nothing Else