Notional Slurry Logo

Blog plugin I would like: Similar Old Books

Say I write a post. This one.

The plugin (which may suffer from CPU and storage issues, but screw that) compiles a trigram-based linguistic vector of the post. That is, it counts how many there are of every 3-letter substring in the entire text. It also notes the length of the post. Or maybe it creates some other sort of computationally linguistic indexing key, like a Statistically Improbably dohickey, or a discrete wavelet alphabetotron. One of those.

The plugin then takes this trigram frequency vector (or whatever), and scans the entire corpus of Project Gutenberg text files for the most similar passages of the same length.

So in the end, you get a link like “Public-domain eTexts which may be interesting,” and it links to the eText, with a same-length passage.

Maybe a few processing and storage considerations. Like I said. But offline processing would work….

Branko Collin said,

August 26, 2006 @ 4:52 pm

Maybe a few processing and storage considerations.

I’d say there are about 40 times 40 times 40 = 640,000 reasonable trigrams, unless you want case-sensitivity. These can be stored in a manner that is easy to retrieve (that is: sorted, hashed), so there should not be any real storage problems. You’d probably want to have your trigrams stored on a separate server that uses its spare time to update its database.

What would count as most similar though? The post that shares the most trigrams with a given etext, or the post that shares the rarest trigrams with a given etext?

Tozier said,

August 28, 2006 @ 8:51 am

My thought was inspired by traditional linguistic Ngram vector embeddings: For a passage, the occurrence rates of all trigrams are counted, and the resulting list of [more like] 30^3 elements is normalized so all values fall in a unit hypercube.

To simplify, consider what we would do if we were using an alphabet of two letters, a and b, then there are eight trigrams: aaa, aab, … bbb.

If we measure the occurrence of trigrams in two texts, we might get two “raw” vectors (6,2,7,0,0,2,1,12) and (12,12,14,2,9,1,8,11). When these are normalized, we get two vectors in an 8-cube: (6/30,2/30…12/30) and (12/69,12/69…11/69).

A “match” here might simply be the k nearest neighbors (by a Euclidean distance metric, or some other) to a point in this space.

Say we have only one ab text in our “corpus” (the equivalent of PG) , and it’s 1000 characters long. Suppose what we want to do (I’m experiencing mission creep a little as we talk about this) is find the 100-character passage in that text that most closely matches the trigram vector of the search text.

So we create an 8-element normalized vector for the search text. The brute force approach would start at the first 100 charcters of the corpus, and calculate its normalized vector, and then calculate a different vector for characters 2-101, and so on until we create 900 or so vectors covering overlapping substrings from the text.

But notice also that the normalized vector can only change a little bit when we shift the sampling window 1 character to the right or left. One trigram will leave, and one will enter. This suggests faster search methods for the corpus. If nothing else, we could create “keys” by making ten nonoverlapping 100-character normalized vectors for the 100-character corpus, and we can expect the best exact match for the search vector to lie between the best adjacent pair of those.

Tozier said,

August 28, 2006 @ 9:02 am

Right, I posted too quickly.

Let’s look at a more reaslistic example.

So for a 2×106-character corpus, with 30 characters in the alphabet (ignoring diacritical marks for now), suppose I’m wanting a 200-character passage. So with the semi-dumb search mentioned above, that’s 104 nonoverlapping “keys”, each of which represents (independent of how you compress them) a 303-element vector of rational numbers.

A hash or well-crafted database will indeed manage the key data, and there are relatively fast nearest-neighbor search methods. But I’ve still kept the numbers very small, compared to the actual PG corpus, now and in the future….

Branko Collin said,

September 1, 2006 @ 8:41 pm

I had overlooked the “passages” bit, or rather interpreted that as “entire documents”, of which there are only 20k in PG.

Still, why do you assume that matching histograms will produce the most similar passages? And why use trigrams instead of bigrams? Or why not even use other measurements, such as word lengths, or letters, and the way in which they are distributed in a small window?

What is the assumption you base this method on?

Tozier said,

September 2, 2006 @ 6:56 am

Trigram classification is a standard design pattern in Natural Language Processing research: a good starting point, with a lot of evidence to show that bigrams are insufficient to differentiate, and 4-grams a bit too sparse for clustering [most] documents. But trigrams do only capture local information about documents: the structure and frequency of certain words.

My goals, perhaps more important than my assumptions, are that one wants such a thing to balances exploration and exploitation: to present surprising texts, because that’s the nature of surfing and blogging; but at the same time to be sufficiently similar to the blog entry to seem nonrandom, and potentially share topics or stylistic material.

The sort of “statistically improbable phrases” approach used by Amazon would probably not work, because the phrases themselves tend to be subject-specific. A blgo entry bitching about a bad meal at McDonalds would probably not come up with anything in the 19th Century, by phrase, but may bring up other bitching passages. Depending on how it’s phrased. Use enough words like “greasy” and “oleaginous” and you may be surprised at the links that arise spontaneously….

RSS feed for comments on this post · TrackBack URI

Leave a Comment