Picking Optimal Token IDs

Fri 05 December 2025
Progress Report

I have a search index that stores, for each of a bunch of documents, the set of tokens that occur in that document. I encode that as a sparse bit vector, where each token has an ID and the bit at that index in the bit vector indicates whether that token is present in the document. Since most tokens do not occur in most documents, these vectors are sparse (most of the bits are zeros).

Since the vectors are sparse I can save a lot of space by not storing all the zero bits. The simplest way to do this is with different flavors of run-length encoding, but in order to make intersection and subset tests fast I actually use a representation that's more like the nodes in an array-mapped tree; there's a bitmask with one bit per 64-bit word of the bit vector indicating whether that word contains any nonzero bits, and we only store the 64-bit words that aren't zero.

Either way, whether we're using RLE or our mask encoding, the longer our contiguous runs of zeros are, the less data we have to store. While we don't get to pick what tokens occur in which documents, we do get to pick which tokens are next to each other in the vector; ie we're free to choose whatever token IDs we want. We want to choose token ids such that the one bits are all clumped together.

Another way to say we want one bits clumped together is to say we want to maximize the probability that a particular bit is a one given that its neighbors are ones, and maximize the propbability that a bit is a zero if its neighbors are zeros.

We can calculate those probabilities by generating a co-occurrence matrix. That's a symmetric matrix whose rows and columns are tokens, and where the value at a,b is the number of times a and b occur in the same document.

Now we want to choose an ordering of the rows where closer rows are closer together. We do that by finding the eigenvector of the matrix with the largest eigenvalue, multiplying by that vector (which gives us a single column vector), and then sorting by the values of that column vector. This is basically doing PCA down to 1D.

This works in practice, and gives me an overall 12% improvement in index size over choosing token ids at random.