This is PFFX


audio version

What is PFFX?

This article is an overview of the data compression algorithm I've been working on, which is called PFFX. PFFX is a general-purpose lossless data compression algorithm. It fills the same role as DEFLATE, LZMA, zstd, etc. It's based on a neural autoencoder.

Neural autoencoders are compression algorithms. An autoencoder learns to take a block of data, squish it down to a much smaller size, and then reflate it. That is exactly data compression. The top half of the network is a compressor and the bottom half of the network is a decompressor.

Why not just an autoencoder?

The problem is that it's lossy, and lossy in a way that really isn't acceptable to use for data compression. We tolerate lossy compression in things like JPEG or MP3 because the compression artifacts that those algorithms generate don't look to the human eye or the human ear like content that is real; it looks like noise. The human brain is very good at filtering out noise, so when you hear an MP3 compression artifact it's just some chirp thing that doesn't lie on the same distribution as eg: the voice recording that you're listening to, or if you hear a JPEG compression artifact it's some convolution-looking ghost edge thing that also is so unnatural that the human eye filters it out as noise.

When neural autoencoders generate compression artifacts we call those artifacts "hallucinations". The word "hallucination" in the context of AI systesm is exactly compression artifacts. We call them hallucinations because they look like the kind of content that you're trying to compress, which is a problem because the human eye and the human year don't filter that out. If you're listening to an audio recording that has silcence and the compression algorithm fills in that silence with babble (nonsense human speech) that's a problem, people don't like hearing that. Or when your image compression algorithm starts putting eyes on things that aren't supposed to have eyes on them that's disconcerting and people don't like that. Stochastic demonic posession isn't a desirable property of a compression algorithm. So that has been the main barrier to using neural autoencoders for data compression.

So we have a solution to data compression that can get increadible compression efficiency but it has this problem of hallucinations. What do we do about the hallucinations? And so the idea behind PFFX is that we can treat the compression artifacts, hallucinations, whatever you want to call them as a form of data corruption and use error correcting codes to correct that data corruption. Specifically we're using Low Density Parity Check (LDPC) codes to correct the data corruption.

Error Correcting Codes

Error correcting codes are solving the problem of (in the context in which they're typically used and the way that they're described in textbooks) you have some data that's being sent over some noisy information channel. You're sending your data over radio over the air, or you're sending through a wire that can have interference and so on. Just to think of everything in the digital domain (sometimes people describe this in the analog domain and then translate but I think it's simpler if we just talk about everything in the digital domain) you're sending bits over, say, a wire and then those bits are getting flipped at random. You're having electromagnetic noise induced into the wire that is causing random bit flips. What you'd like is given some assumptions about the probability that bits flip (assuming that the probability that a bit flips is less than some number) we want to be able to get back to the original data. We want to have perfect fidelity (reconstruct the data exactly) given those random bit flips.

You do that by adding a little bit of extra information. We're using something called a lienar block code (LDPC codes are linear block codes) which work by dividing the data that you're sening up into blocks. You have some fixed block size (say, 4096 bits). To each block you add some extra what are called check bits. Then you have some math magic that given the block with bits randomly flipped and the check bits (which can also be randomly flipped) those go in the pot, you stir the pot up and then out comes exactly what was sent. The way that works is you treat the block that you've received as a probability distribution over blocks (which is just another way of saying that bits can be randomly flipped). One way of saying literally "bits can be randomly flipped" (which isn't usually the way it's framed, usually it's described in the analog domain, but just to keep things digital) is to think of the block as a multinomial Bernoulli distribution (ie you have a vector of bits where each bit has some probability that it was flipped). Your error correcting code is defined as a set of blocks (with check bits included). The blocks that are in the set are blocks that are "in the code". Encoding is choosing check bits that make the block with the check bits be a block in the code. It's called a linear code because that function that determines whether a block is in the code is a matrix (called the check matrix) whose number of columns is the size of the block (ie you have a column per bit) and whose number of rows is the number of check bits. In GF(2) the block multiplied by the check matrix should equal the zero vector (you should get a column of all zeroes). GF(2) is a finite field where multiplication is bitwise and and addition is bitwise xor. These are sometimes called "parity codes" (LDPC is Low Density Parity Check code) because xor-ing a vector of bits is computing the parity of those bits. Parity meaning whether the number of one bits is even or odd. So when you do this matrix multiply in GF(2) that is doing parity check of the input and-ed with the rows of the check matrix.

So decoding is root finding, but it's root finding of an under-determined system. There are many solutions, and finding a solution is with high probability going to generate garbage. So we need an objective function to make sure we find the right solution and not just any solution, and that objective function is likelihood according to our probability distribution. Put another way decoding is maximum likelihood estimation subject to a system of linear constraints (our check matrix).

There are a couple of algorithms for doing this. One algorithm is to treat the check matrix as the adjacency matirx of a Markov random field and use belief propagation. A much simpler algorithm which also works really well is what's called in the literature "the bit-flip algorithm" which is basically binary coordinate descent. You just go through the block and for each bit evaluate the objective function with the bit set to 1 and with it set to 0 and choose the value that has the better score, and continue until all the constraints are satisfied.

Using LDPC codes with a neural network

This is all fine and good for error correction but we're trying to compression. How do we apply this to our neural autoencoder?

The autoencoder is a Mixture Density Netowrk, so the output of the autoencoder is a probility distribution (specifically a gaussian mixture model). What's encoded in the file is the autoencoder emebdding vector plus the check bits for the error correcting code.

We run our LDPC decoder, but instead of having the probability distribution be the bits we recieved from an information channel, it's the probability distribution that we were given by this neural network decoder. We do the optimization to find the maximum likelihood solution that satisfies our parity check bits and that gives us our output block.

The structure of the neural network

The front-end of the nework is very LLM-like: byte pair encoding followed by word2vec/GLoVE style token embedding. Specifically for the embedding we take the token-token co-occurrence matrix and embed the rows of the matrix with an MLP autoencoder and use the embedding vectors of the autoencoder as our token embeddings. Then we take a block of embedding vectors (block size = 4096 bits or 256 16-bit tokens) and run that through a convolutional autoencoder and the output of the decoder is the parameters of a gaussian mixture model. The CNN autoencoder is trained with the loss from the mixture density network paper (which is basically just maximum likelihood).

In the file we store the decoder wights and for each block we store the embedding vector and the check bits. We're 8-bit quantizing all of the floating point values in the file (which includes both the network weights and the embedding vectors). The quantization is done squeezenet-style usind 1D k-means where k = 256. The whole thing is then wrapped in an additional layer of zstd compression.

Decoding is:

  • decompress with zstd
  • load the network weights
  • for each block
    • run the embedding vector through the network to get the mixture distribution
    • run the bit-flip algorithm (though our coordinates here are actually tokens and not bits so instead of flipping a bit we're searching over the 2^16 possible tokens)

The search algorithm for a given coordinate isn't actually minimize hamming_distance(check bits, target check bits) - likelihood(block), it's of the values that have a lower hamming distance than the current one choose the one that has the highest likelihood.

Postamble

I've released parts of this as separate libraries already. The byte-pair encoding implementation is here and the LDPC decoder implementation is here.