The All-Thing: Research
https://all-thing.net/Research
Mankind's Decision-Making Apparatusen-us2005-02-18T18:52:38+00:00hourly12000-01-01T12:00+00:00http://creativecommons.org/licenses/by-nc-sa/2.0/Zettairhttps://all-thing.net/Research/zettair.htmlResearch/zettair@https://all-thing.netResearchWilliam2004-12-14T23:20:10+00:00<p>We’re going to start some more IR experiments at work and I was looking for a decent open-source full-text search engine to play with. SourceForge turns up a motley collection of overly-complex, overly-web-oriented software, chock full of <span class="caps">PHP</span> and <span class="caps">SQL</span> dependencies. People, let me tell you two things right now:</p>
<ol>
<li><span class="caps">PHP</span> is the visual basic of the web. Use it, and no one will take you seriously.</li>
<li>Putting a <span class="caps">SQL</span> backend on a search engine is like building a car around a winch motor. Yah, it’s really “powerful”. It’s also really “slow”.</li>
</ol>
<p>The closest things I could find were <a href="http://www.htdig.org/">htdig</a>, which was way more web-specific and heavy-weight than I desired, and <a href="http://www.namazu.org/">namazu</a>, which had the advantage of Ruby support, but the disadvantage of being a buggy dead project.</p>
<p>Finally a colleague pointed me to <a href="http://www.seg.rmit.edu.au/zettair/">Zettair</a>, from the makers of <a href="http://www.amazon.com/exec/obidos/tg/detail/-/1558605703/qid=1103065908/sr=8-1/ref=pd_ka_1/104-6987541-0123117?v=glance&s=books&n=507846">Managing Gigabytes</a>. It’s open-source, fast, simple, written in C, and extensible. It builds on a ton of platforms and has a <span class="caps">C API</span>. And these guys know what they’re doing. I can’t wait to start playing around with it.</p>Importance Samplinghttps://all-thing.net/Research/importance_sampling.htmlResearch/importance_sampling@https://all-thing.netResearchWilliam2004-11-05T15:54:57+00:00<p>We learned about a neat trick when doing some Monte Carlo simulations last week. It’s not new by any means, but it’s cool, very useful, and I hadn’t heard of it before.</p>
Consider the application of Monte Carlo to the task of determining the probability that an event occurs. Let <math>$S$</math> be the state space and <math>$R \subseteq S$</math> be the region where the event occurs. Assume you have a (“generative”) model that allows you to calculate the probability of any state, <math>$p(s)$</math>, and that you also have a delta function
<math> \[ \delta(s) = \left\{ \array{ 1 & \text{ if } s \in R \\
0 & \text{ otherwise } } \right
\] </math>
Monte Carlo says: <math> \[ \int_0<sup>1 f(x)\,dx \approx \frac{1}{N}\sum_{i=1}</sup>N f(x_i) \] </math> where <math>$x_i$</math> is chosen uniformly randomly. So to estimate the size of <math>$R$</math> we have:
<math> \[ \sum_{s\in S} p(s)\delta(s) \approx \frac{1}{N}\sum_{i=1}^N p(s_i)\delta(s_i) \] </math>
where <math>$s_i$</math> are chosen uniformly randomly from <math>$S$</math>. However, computationally speaking, this is not ideal, as typically <math>$p(s)$</math> is extremely small, necessitating software solutions like rational number packages to deal with the underflow.
<a href="http://en.wikipedia.org/wiki/Importance_sampling">Importance sampling</a> to the rescue. If we have a function <math>$g(x)$</math> “similar” to <math>$f(x)$</math>, importance sampling says we can rewrite the Monte Carlo equation as
<math> \[ \int_0<sup>1 \frac{f(u)}{g(u)}g(u)\,du \approx \frac{1}{N}\sum_{i=1}</sup>N \frac{f(y_i)}{g(y_i)} \] </math>
where <math>$y_i$</math> is drawn according to <math>$g(y)$</math> rather than uniformly.
In our case, if we choose <math>$g(y)=p(s)$</math>, we have
<math> \[ \sum_{s\in S} p(s)\delta(s) \approx \frac{1}{N}\sum_{i=1}^N \delta(s_i) \] </math>
where we choose <math>$s_i$</math> according to <math>$p(s)$</math>. And now we’re accumulating integers rather than very small floating-point numbers. In our case this lead to a speedup of many orders of magnitude.
<p>Neat eh?</p>