The All-Thing

All stick and no carrot, since ought-three.

各位能夠讀中文得來賓您好。小的這還在學中文中,恐怕中文寫得不太好,希望你們還看得懂。


宮詞 (顧況)

玉樓天半起笙歌,風送宮嬪笑語和。
月殿影開聞夜漏,水晶簾卷近秋河。

Contact:
| web page

Other views:
RSS 1.0
RSS 0.91
Plain (good for lynx)

Past posts:

March
Sun Mon Tue Wed Thu Fri Sat
  4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Recent comments:

Recent search referers:
korean bit torrent movies
pr0n torrent
gay torrent (x2)
"chinese movies" torrent
老女人torrent
"xiii crack" (x2)
XIII torrent (x2)
xiii no cd crack
"ack too long"
shogun total war bit torrent
i want look for gay fried (x2)
"living in Boston"
bit torrent books
Last Samurai bit torrent
taboo bit torrent
24 torrent season 1

Exits:
William's Aggregated Feeds



Creative Commons License
This work is licensed under a Creative Commons License.

       
Fri, 19 Dec 2003

Burrows-Wheeler Transform

So we've been talking a little about this at work and I just found a good page describing it here: http://dogma.net/markn/articles/bwt/bwt.htm

It's like magic. Basically, the BWT is a completely reversible O (n log n) transformation that increases the compressibility of text. It's pretty mind-blowing to follow through the algorithm because it's really not clear, at least on the first pass, how it works, much less why it works. But the basic idea is that it permutes characters by sorting on their (left or right) contexts, which tends to group similar characters together, as they typically have similar contexts. Of course, doing this in a reversible manner is the brilliant part.

So this will work on any data that exhibits local contexts, which is the vast majority of data that we, as humans, care about (though not all—there are apparently certainly types of image data for which this doesn't hold true, though I'm not clear on the specifics). It would be interesting to compare the effect this has on, say, English vs. Chinese at the character level vs. Chinese at the byte level in different encodings.

I think bzip2 uses the BWT as a first pass, so they've apparently decided it is useful for arbitrary binary data.

Posted at 20:22 | /computing | 1 comment | permalink


   

When in doubt, tell the truth. -- Mark Twain