One Weird Hashing Trick


Here's a little thing that's obvious once you see it but isn't obvious initially (or at least it wasn't to me).

Say you want to generate low-dimensional projections of bag-of-words (or bag-of-ngrams) vectors (in the rest of this post wherever I say "word" you can substitute "ngram"). Jhonson-Lindenstrauss says if you have the bag-of-words vector you can get a good projection by just multiplying it by a random matrix (!) and the distortion scales logarithmically (!!) not with the number of dimensions but actually with the database size (!!!). And even better that matrix can just be a matrix of 1 or -1 with equal probability.

But if the number of input dimensions is extremely large, as it is with bag-of-words, then that matrix has to be extremely wide and take up a lot of memory. And also since the vectors are extremely wide they also have to take up a lot of memory. The vectors are sparse so you can save memory by only storing the nonzero entries, but the projection matrix is not sparse.

If you declare that the column numbers of the different words are their hashes then you can store the vectors as just the word counts. But also since hashes have good randomness (good hash functions produce values that are independent and uniformly distributed), then you can use the hash value as your random number generator and replace the matrix with a function that maybe mixes the bits (if you're using the hash elsewhere and you want this hash to be uncorrelated) and take (mix(hashv) % 2) ? -1 : 1. Since we have a matrix with multiple rows that need independent hashes we can do mix(rotl(hashv, row_number)) % 2 ? -1 : 1 (where rotl is bitwise left rotation).

We've gotten out of having to generate the random matrix but can we get out of computing the word vectors too? Yes we can! Recall that matrix-vector multiplication is the dot products between the vector and each of the rows of the matrix, and that dot product is the sum of the products of the corresponding elements. The elements of our bag-of-words vector is a count, ie a sum of 1 for each occurrence of the word. We can distribute the multiplication through the sum and replace the matrix entry we're multiplying by with the hash function we defined before, and now we have, for a given output dimension, sum(mix(rotl(hash(word), row_number)) % 2 ? -1 : 1 for ngram in document). And then we can do all this with only one pass over the document by doing:

for word in document:
    for row in range(0, n_dims):
        outp_vector[row] += (mix(rotl(hash(word), row)) % 2 ? -1 : 1)

Putting all that together and including handling ngrams, doing multiple independent projections at the same time, and doing L1 normalization at the end:

typedef union Ngram {
    uint16_t tokens[4];
    uint64_t v;
} Ngram;

static inline uint32_t mix32(uint32_t key) {
    key = ~key + (key << 15); // key = (key << 15) - key - 1;
    key = key ^ (key >> 12);
    key = key + (key << 2);
    key = key ^ (key >> 4);
    key = key * 2057; // key = (key + (key << 3)) + (key << 11);
    key = key ^ (key >> 16);
    return key;
}


static inline uint32_t mix64to32(uint64_t key){
    key = (~key) + (key << 18); // key = (key << 18) - key - 1;
    key = key ^ (key >> 31);
    key = key * 21; // key = (key + (key << 2)) + (key << 4);
    key = key ^ (key >> 11);
    key = key + (key << 6);
    key = key ^ (key >> 22);
    return (uint32_t) key;
}


static inline uint32_t rotl32(uint32_t value, unsigned int count) {
    const unsigned int mask = CHAR_BIT * sizeof(value) - 1;
    count &= mask;
    return (value << count) | (value >> (-count & mask));
}

void ngram_featurizer_featurize(NgramFeaturizer *self, 
    uint16_t *tokens, size_t n_tokens, 
    size_t n_outputs, int32_t *res) {


    memset(res, 0, NDIM*n_outputs*sizeof(int32_t));
    for (size_t ngram_width=1; ngram_width <= NGRAM_WIDTH; ngram_width++) {
        for (int offset=-(ngram_width - 1); offset < ((int) n_tokens); offset++) {

            uint32_t ngram_hash;
            Ngram tokv;

            if (offset < 0) {
                tokv.v = 0;
                for (int i=-offset; i >= 0; i--) {
                    tokv.tokens[i] = tokens[i];
                }
            } else if (offset > n_tokens - ngram_width) {
                tokv.v = 0;
                for (size_t i=offset; i < n_tokens; i++) {
                    tokv.tokens[i-offset] = tokens[i];
                }
            } else { 
                for (size_t i=0; i < ngram_width; i++) {
                    tokv.tokens[i] = tokens[offset + i];
                }
            }
            ngram_hash = mix64to32(tokv.v);

            for(size_t outp_i=0; outp_i < n_outputs; outp_i++) {

                uint32_t outp_hash = mix32(rotl32(ngram_hash, outp_i));
                int32_t *res_o = &(res[outp_i * NDIM]);

                for (size_t row=0; row < NDIM; row++) {

                    uint32_t row_hash = mix32(rotl32(outp_hash, row));
                    if (row_hash % 2 == 0) {
                        res_o[row]++;
                    } else {
                        res_o[row]--;
                    }

                }

            }
        }
    }

    // L1 normalize vectors
    for (size_t outp_i=0; outp_i < n_outputs; outp_i++) {

        int64_t sum = 0;
        int32_t *res_o = &(res[outp_i * NDIM]);

        for (size_t res_i=0; res_i < NDIM; res_i++) {
            sum += abs(res_o[res_i]);
        }

        if (sum > 0) {
            for (size_t res_i=0; res_i < NDIM; res_i++) {
                res_o[res_i] = (((int64_t) res_o[res_i]) * NORMALIZATION_BIG_NUMBER) / sum;
            }
        }

    }

}

This seems to work, and it can embed about 665k documents on my 2018 thinkpad in 12.3 seconds.