Adam Smith

The Upcoming Generation of Lossless Data Compression, Part 2

July, 2010

My last post about the future of lossless compression algorithms was about cross-file content-aware dedupe. (Hint: this is different from what most people think of as just 'dedupe'.)

Part 2 is about lossless compression algorithms and systems that support incremental addition and removal operations.

For example, in NTFS, probably the most popular file system in the world, if I turn compression on and store the same file twice each file will be compressed and stored individually, leaving additional compression opportunities on the table.

Another example: MySQL doesn't support compression for BLOBs of data. The first step towards this would be compressing each BLOB field individually. The second step would be to look for redundancies across BLOB objects.

PostgreSQL, another popular database package, does the first part but not the second.

In many cases this second step will be hugely important. For example, FriendFeed circa 2009 stored lots of 300 byte JSON blobs. In this case the difference between compressing two million 300-byte blocks of text individually versus together could be huge. (More background on FriendFeed's use of MySQL here.)

Of course this feature is fairly obvious when you're looking at something like a file system or database where adds and deletes are natural, but it's hard to pull off in practice due to the risk of inconsistencies.

(This is precisely the reason it should be done at the database engine layer, rather than the application layer.)

But just because it's hard doesn't mean it won't be pervasive in MySQL deployments in 10 years or so.

Final thought -- I'm sure there are technologies out there that do this already. The real shift/disruption will come when it becomes part of a non-proprietary standard, just as with 7zip/LZMA, and/or integrated into popular database engines.

← Home