Monday, October 23, 2006

Pi compression

This idea I've had for a long time, and if it were mathematically and computationally feasable, it would probably have already been done. But never mind all that. I still like the idea, and I'm still going to spout it off.

I chose pi because I just like it, but e or any of those other fun infinite-digit, non-repeating numbers would probably work, and possibly be better. An unfathomably good pseudo-random number generator might even work. But let's call it pi for now, because the important thing is that it's this infinite sequence in which, theoretically, every possible sub-sequence exists. Somewhere in that infinitely-long series of digits, there'll be a hundred sequential zeros. Somewhere, there'll be a million. And somewhere, there'll be a sequence that matches the sequence of ASCII values for this blog entry, and every other text you've ever read (and ever will read... but that's a different story).

The idea is, whatever data you have, your term paper, that movie you swear you didn't download, or that log of leaked AOL searches, that data is already encoded somewhere in Pi. It may be several centillion digits out, but it's there. Theoretically, you just need to let a grep algorithm slide along Pi until it finds a match, and then just remember the offset and the length to get it back out. Suddenly all files, all volumes, all anything, can be represented by two numbers.

Parts of this have already been implemented, though on a scale much too small to be useful. (Four billion binary digits of Pi? Pff, weak sauce.)
Finding sequences in pi:
Getting hex digits (half-bytes) out of pi:

Our CPUs' abilities to crunch numbers aren't advancing as fast as our networks' abilities to pump big bunches of data around, so this whole thing isn't terribly practical. Still, you could back up your whole hard drive on a floppy diskette. The idea is fun even if it wouldn't be worth doing.

Let's take it a little further, though. Maybe we can still do something useful with those first four billion binary digits. Play around with that Pi-search link above. It only uses a 5-bit character set rather than ASCII's 8, but let's experiment anyway. I'm going to search for "penduin".
In 5-bit-per-character binary, that's:
The sequence is found at 8748556 binary digits in. 8748556 in binary is:

...that's a noticably smaller sequence than the binary representation of "penduin". Our data is 7 "bytes" long, which in binary is:
Those two numbers, the offset and length, happen to take fewer bits to represent than the string "penduin", in ascii or even in 5-bit characters. Houston, we have compression! But hang on, let's say that sequence didn't happen within 4 billion bits of pi (according to that site, an arbitrary 7-character sequence has just an 11% chance of being found that early). We could still find "pend" and "uin", though - the odds of finding three or four letters are essentially 100%. If we do that, though, we'd need more bits than the raw ascii just to keep the two offset/length pairs, let alone any padding we'd need for any sanely-packed descriptor.

That packed descriptor would be 35 bits per chunk: 32 bits (maximum value:
4,294,967,295) for offset, and 3 bits for length. Why 3? We're unlikely to find an arbitrary 8-letter pattern (8 is 4 digits, 1000 in binary) in just the first 4 billion bits of pi, so our algorithm won't even try. So every 35 bits in our "compressed" file points to the next up-to-7-letter chunk of our data. On a good day, that'll beat ASCII, which would take 56 bits to store 7 letters. But, we're still playing with 5-bit characters, a paltry subset of ASCII. We're also playing with odds where only 11% of our chunks will be able to point to the maximum 7 characters.

Looking at hex digits instead of 5-bit letters gives us a better picture of how useful this might be. Any byte (8 bits) can be expressed in 2 hex digits. Any four sequential bytes has about a 60% chance of being found in these first four billion bits of pi, and sequences of three and a half (7 hex digits) are pretty much guaranteed. 7 was the maximum length of our chunks from before anyway, so let's keep using that. So, to recap:
With our 35 bits descriptor, we can know where in pi to find up to three and a half bytes, or 28 bits, of data.
As far as compression algorithms go, that doesn't seem particularly good. :^) But we're chopping into 3.5-byte pieces or less because of the near statistical certainty of finding any 3.5 bytes in the first half-gigabyte of pi (four billion bits). Give us twice as many digits of pi, and we'd only need one more address bit, plus another bit or so for length, as longer patterns would have improved odds of being found. Keep bumping up the searchable pi digits and we could eventually hit a sweet spot beyond which we'd actually be able save disk space. Once that happens, we can keep on increasing the available pi digits to achieve better compression, but more interestingly, we can add another layer of logic in which we take our packed bits, chunk them up, and look for those sequences in pi. If we're going to do that, we need to pack an extra bit for each chunk to say either "I point to data" or "I point to a pointer", but if we've got so many pi digits to search that we're actually saving space every time, we can still wind up with simply a pi digit offset and size, which would lead to a lot more lookups, and eventually, chunk by chunk, the data we care about.

A tip of the hat to the first person to find this post within pi. It's there! Extra points if you also find my next post before I post it. :^)

Free Driver's Ed, part 2

Lesson # 2: Speed Matching
Here we go again, Utahns. You can keep right on driving like that, but I'm gonna keep on complaining and attempting to educate you. Because I know that under your tough facades, at the bottom of your low-mileage hearts, you really do give a rat's ass.

Picture this. You're not on the freeway, but you want to be. You find one of those big ramps that gets you there. Great! You notice the right lane of the freeway isn't moving. Dumbasses. You zip past all those slowpokes and merge in either at the very last moment or several hundred yards past where that line painted on the road suggested. You can do this, because you are the single most important person on Earth.

But wait! The story isn't done! Turns out, those dumbasses weren't moving because several people in front of you were also the most important person on Earth. Rather than matching speed and slipping in undisruptively, they waited until the last moment or went several hundred yards past the line. The "dumbasses" who were already on the freeway had to give them room to merge, and the dumbasses behind those dumbasses, not wanting to crash, had to slow down too. Maybe they're not the dumbasses after all.

The merging technique that involves the least amount of dumbassery is, astonishingly, to match the speed of the lane you're about to merge into. If they're booking it, you should too. If they've come to a grinding halt, you should too. I know, you're super-important, you deserve to dart ahead so you can be "first", but that leads to the dark side... and to you being a dumbass. If even a small fraction of you 'Tahns would merge correctly, everybody's commute would be an awful lot smoother.

Behold, the amazing Utah formation-driving Thunderbirds! You can't help but smile as you approach these adorable blockades. The perfectly-synchronized motion soothes the soul. And it doesn't stop there -- you too can be a Thunderbird! Find your buddy in the next lane, and maybe even the lane on the other side, and lock in! If they speed up, you stay right beside 'em. If they slow down, you make damn sure to follow along. Herd behaviour is a sure path to safety, as well as a live exhibit of your vast intellect. Unison! Parity! Brotherhood! Thunderbiiirds!

Come on. Formation driving is about as cool as Utah Valley's bar-to-church ratio. If you don't want to speed up when you use the passing lane, then don't use it.

Thursday, October 19, 2006


lem-ma [lem-uh]
noun; a situation requiring a choice between desirable alternatives.