<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Blog plugin I would like: Similar Old Books</title>
	<atom:link href="http://williamtozier.com/slurry/2006/08/24/blog-plugin-i-would-like-similar-old-books/feed" rel="self" type="application/rss+xml" />
	<link>http://williamtozier.com/slurry/2006/08/24/blog-plugin-i-would-like-similar-old-books</link>
	<description>Pontification without all the gritty gravitas</description>
	<pubDate>Thu, 07 Aug 2008 23:47:13 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>By: Tozier</title>
		<link>http://williamtozier.com/slurry/2006/08/24/blog-plugin-i-would-like-similar-old-books#comment-1752</link>
		<dc:creator>Tozier</dc:creator>
		<pubDate>Sat, 02 Sep 2006 10:56:01 +0000</pubDate>
		<guid isPermaLink="false">http://williamtozier.com/slurry/2006/08/24/blog-plugin-i-would-like-similar-old-books#comment-1752</guid>
		<description>Trigram classification is a standard design pattern in Natural Language Processing research: a good starting point, with a lot of evidence to show that bigrams are insufficient to differentiate, and 4-grams a bit too sparse for clustering [most] documents. But trigrams do only capture local information about documents: the structure and frequency of certain words.

My goals, perhaps more important than my assumptions, are that one wants such a thing to balances exploration and exploitation: to present surprising texts, because that's the nature of surfing and blogging; but at the same time to be sufficiently similar to the blog entry to seem nonrandom, and potentially share topics or stylistic material.

The sort of "statistically improbable phrases" approach used by Amazon would probably not work, because the phrases themselves tend to be subject-specific. A blgo entry bitching about a bad meal at McDonalds would probably not come up with anything in the 19th Century, by phrase, but may bring up other bitching passages. Depending on how it's phrased. Use enough words like "greasy" and "oleaginous" and you may be surprised at the links that arise spontaneously....</description>
		<content:encoded><![CDATA[<p>Trigram classification is a standard design pattern in Natural Language Processing research: a good starting point, with a lot of evidence to show that bigrams are insufficient to differentiate, and 4-grams a bit too sparse for clustering [most] documents. But trigrams do only capture local information about documents: the structure and frequency of certain words.</p>
<p>My goals, perhaps more important than my assumptions, are that one wants such a thing to balances exploration and exploitation: to present surprising texts, because that&#8217;s the nature of surfing and blogging; but at the same time to be sufficiently similar to the blog entry to seem nonrandom, and potentially share topics or stylistic material.</p>
<p>The sort of &#8220;statistically improbable phrases&#8221; approach used by Amazon would probably not work, because the phrases themselves tend to be subject-specific. A blgo entry bitching about a bad meal at McDonalds would probably not come up with anything in the 19th Century, by phrase, but may bring up other bitching passages. Depending on how it&#8217;s phrased. Use enough words like &#8220;greasy&#8221; and &#8220;oleaginous&#8221; and you may be surprised at the links that arise spontaneously&#8230;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Branko Collin</title>
		<link>http://williamtozier.com/slurry/2006/08/24/blog-plugin-i-would-like-similar-old-books#comment-1727</link>
		<dc:creator>Branko Collin</dc:creator>
		<pubDate>Sat, 02 Sep 2006 00:41:06 +0000</pubDate>
		<guid isPermaLink="false">http://williamtozier.com/slurry/2006/08/24/blog-plugin-i-would-like-similar-old-books#comment-1727</guid>
		<description>I had overlooked the "passages" bit, or rather interpreted that as "entire documents", of which there are only 20k in PG.

Still, why do you assume that matching histograms will produce the most similar passages? And why use trigrams instead of bigrams? Or why not even use other measurements, such as word lengths, or letters, and the way in which they are distributed in a small window?

What is the assumption you base this method on?</description>
		<content:encoded><![CDATA[<p>I had overlooked the &#8220;passages&#8221; bit, or rather interpreted that as &#8220;entire documents&#8221;, of which there are only 20k in PG.</p>
<p>Still, why do you assume that matching histograms will produce the most similar passages? And why use trigrams instead of bigrams? Or why not even use other measurements, such as word lengths, or letters, and the way in which they are distributed in a small window?</p>
<p>What is the assumption you base this method on?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tozier</title>
		<link>http://williamtozier.com/slurry/2006/08/24/blog-plugin-i-would-like-similar-old-books#comment-1519</link>
		<dc:creator>Tozier</dc:creator>
		<pubDate>Mon, 28 Aug 2006 13:02:57 +0000</pubDate>
		<guid isPermaLink="false">http://williamtozier.com/slurry/2006/08/24/blog-plugin-i-would-like-similar-old-books#comment-1519</guid>
		<description>Right, I posted too quickly.

Let's look at a more reaslistic example.

So for a 2x10&lt;sup&gt;6&lt;/sup&gt;-character corpus, with 30 characters in the alphabet (ignoring diacritical marks for now), suppose I'm wanting a 200-character passage. So with the semi-dumb search mentioned above, that's 10&lt;sup&gt;4&lt;/sup&gt; nonoverlapping "keys", each of which represents (independent of how you compress them) a 30&lt;sup&gt;3&lt;/sup&gt;-element vector of rational numbers.

A hash or well-crafted database will indeed manage the key data, and there are relatively fast nearest-neighbor search methods. But I've still kept the numbers very small, compared to the actual PG corpus, now and in the future....</description>
		<content:encoded><![CDATA[<p>Right, I posted too quickly.</p>
<p>Let&#8217;s look at a more reaslistic example.</p>
<p>So for a 2&#215;10<sup>6</sup>-character corpus, with 30 characters in the alphabet (ignoring diacritical marks for now), suppose I&#8217;m wanting a 200-character passage. So with the semi-dumb search mentioned above, that&#8217;s 10<sup>4</sup> nonoverlapping &#8220;keys&#8221;, each of which represents (independent of how you compress them) a 30<sup>3</sup>-element vector of rational numbers.</p>
<p>A hash or well-crafted database will indeed manage the key data, and there are relatively fast nearest-neighbor search methods. But I&#8217;ve still kept the numbers very small, compared to the actual PG corpus, now and in the future&#8230;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tozier</title>
		<link>http://williamtozier.com/slurry/2006/08/24/blog-plugin-i-would-like-similar-old-books#comment-1518</link>
		<dc:creator>Tozier</dc:creator>
		<pubDate>Mon, 28 Aug 2006 12:51:35 +0000</pubDate>
		<guid isPermaLink="false">http://williamtozier.com/slurry/2006/08/24/blog-plugin-i-would-like-similar-old-books#comment-1518</guid>
		<description>My thought was inspired by traditional linguistic Ngram vector embeddings: For a passage, the occurrence rates of all trigrams are counted, and the resulting list of [more like] 30^3 elements is normalized so all values fall in a unit hypercube.

To simplify, consider what we would do if we were using an alphabet of two letters, &lt;tt&gt;a&lt;/tt&gt; and &lt;tt&gt;b&lt;/tt&gt;, then there are eight trigrams: &lt;tt&gt;aaa&lt;/tt&gt;, &lt;tt&gt;aab&lt;/tt&gt;, ... &lt;tt&gt;bbb&lt;/tt&gt;.

If we measure the occurrence of trigrams in two texts, we might get two "raw" vectors (6,2,7,0,0,2,1,12) and (12,12,14,2,9,1,8,11). When these are normalized, we get two vectors in an 8-cube: (6/30,2/30...12/30) and (12/69,12/69...11/69).

A "match" here might simply be the k nearest neighbors (by a Euclidean distance metric, or some other) to a point in this space.

Say we have only &lt;i&gt;one&lt;/i&gt; &lt;tt&gt;ab&lt;/tt&gt; text in our "corpus" (the equivalent of PG) , and it's 1000 characters long. Suppose what we want to do (I'm experiencing mission creep a little as we talk about this) is find the 100-character passage in that text  that &lt;i&gt;most closely matches&lt;/i&gt; the trigram vector of the search text.

So we create an 8-element normalized vector for the search text. The brute force approach would start at the first 100 charcters of the corpus, and calculate its normalized vector, and then calculate a different vector for characters 2-101, and so on until we create 900 or so vectors covering overlapping substrings from the text.

But notice also that the normalized vector can only change a little bit when we shift the sampling window 1 character to the right or left. One trigram will leave, and one will enter. This suggests faster search methods for the corpus. If nothing else, we could create "keys" by making ten nonoverlapping 100-character normalized vectors for the 100-character corpus, and we can &lt;i&gt;expect&lt;/i&gt; the best exact match for the search vector to lie between the best adjacent pair of those.</description>
		<content:encoded><![CDATA[<p>My thought was inspired by traditional linguistic Ngram vector embeddings: For a passage, the occurrence rates of all trigrams are counted, and the resulting list of [more like] 30^3 elements is normalized so all values fall in a unit hypercube.</p>
<p>To simplify, consider what we would do if we were using an alphabet of two letters, <tt>a</tt> and <tt>b</tt>, then there are eight trigrams: <tt>aaa</tt>, <tt>aab</tt>, &#8230; <tt>bbb</tt>.</p>
<p>If we measure the occurrence of trigrams in two texts, we might get two &#8220;raw&#8221; vectors (6,2,7,0,0,2,1,12) and (12,12,14,2,9,1,8,11). When these are normalized, we get two vectors in an 8-cube: (6/30,2/30&#8230;12/30) and (12/69,12/69&#8230;11/69).</p>
<p>A &#8220;match&#8221; here might simply be the k nearest neighbors (by a Euclidean distance metric, or some other) to a point in this space.</p>
<p>Say we have only <i>one</i> <tt>ab</tt> text in our &#8220;corpus&#8221; (the equivalent of PG) , and it&#8217;s 1000 characters long. Suppose what we want to do (I&#8217;m experiencing mission creep a little as we talk about this) is find the 100-character passage in that text  that <i>most closely matches</i> the trigram vector of the search text.</p>
<p>So we create an 8-element normalized vector for the search text. The brute force approach would start at the first 100 charcters of the corpus, and calculate its normalized vector, and then calculate a different vector for characters 2-101, and so on until we create 900 or so vectors covering overlapping substrings from the text.</p>
<p>But notice also that the normalized vector can only change a little bit when we shift the sampling window 1 character to the right or left. One trigram will leave, and one will enter. This suggests faster search methods for the corpus. If nothing else, we could create &#8220;keys&#8221; by making ten nonoverlapping 100-character normalized vectors for the 100-character corpus, and we can <i>expect</i> the best exact match for the search vector to lie between the best adjacent pair of those.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Branko Collin</title>
		<link>http://williamtozier.com/slurry/2006/08/24/blog-plugin-i-would-like-similar-old-books#comment-1492</link>
		<dc:creator>Branko Collin</dc:creator>
		<pubDate>Sat, 26 Aug 2006 20:52:08 +0000</pubDate>
		<guid isPermaLink="false">http://williamtozier.com/slurry/2006/08/24/blog-plugin-i-would-like-similar-old-books#comment-1492</guid>
		<description>"&lt;i&gt;Maybe a few processing and storage considerations.&lt;/i&gt;"

I'd say there are about 40 times 40 times 40 = 640,000 reasonable trigrams, unless you want case-sensitivity. These can be stored in a manner that is easy to retrieve (that is: sorted, hashed), so there should not be any real storage problems. You'd probably want to have your trigrams stored on a separate server that uses its spare time to update its database.

What would count as most similar though? The post that shares the most trigrams with a given etext, or the post that shares the rarest trigrams with a given etext?</description>
		<content:encoded><![CDATA[<p>&#8220;<i>Maybe a few processing and storage considerations.</i>&#8221;</p>
<p>I&#8217;d say there are about 40 times 40 times 40 = 640,000 reasonable trigrams, unless you want case-sensitivity. These can be stored in a manner that is easy to retrieve (that is: sorted, hashed), so there should not be any real storage problems. You&#8217;d probably want to have your trigrams stored on a separate server that uses its spare time to update its database.</p>
<p>What would count as most similar though? The post that shares the most trigrams with a given etext, or the post that shares the rarest trigrams with a given etext?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
