- Mon 10 November 2025
- Progress Report
I'm working on adding proper search support to GotoSocial, the fediverse server that I use. GtS has some pretty extreme design requirements for a search indexer:
- needs to run well on machines that have under 1gb of RAM (and ideally on machines that have under 512mb, which some low-end VPSs do)
- database sizes regularly run well into the millions of documents (even my little personal one is ~700k documents)
- needs to run well on single board computers running off sd cards
- needs to handle moderate to high write volume
- data needs to be queryable immediately (no big index rebuilds)
- can't run anything out-of-core (so no solr or elasticsearch)
- everything needs to be a go library that can be installed with no external dependencies and no extra build steps (C and C++ are okay as long as all the source files are in one directory)
To my knowledge there isn't a search system that meets all those requirements today, so I'm writing my own that does. The memory and the "don't thrash the disk (sd card)" requirements basically mean it has to be vector search; traditional postings lists would either require way too many writes per document insert or would only let you query data added before the last index rebuild. If you can embed documents into one or a handful of vectors and just index those (or low-dimensional projections of them) then you can do document inserts with only a handful of writes with Morton code keys into an on-disk btree (I'm probably going to use LMDB), which lets you meet all the requirements.
This being 2025 of course my first thought for how to do the embedding was to use a neural network. Using something like llama or qwen (or even roberta) on this hardware is totally out of the question, so we need something smaller, and that means training our own, which I did. The basic architecture is the CNN one from Collobert and Weston (2011).
The first step of doing that kind of embedding is tokenization and token vectorization. The tokenization I did with my byte-pair encoding library Hashtok. I then did token embedding old-school LSA style, by just doing PCA over the token-token co-occurrence matrix:
import numpy as np
from hashtok import Hashtok
import pickle
import os
import gzip
import lxml.html
import json
with open('tokens.pickle', 'rb') as inf:
tokens = pickle.load(inf)
exists = os.path.exists('token_affinity.bin')
n_tokens = len(tokens) + 256
affinity_mat = np.memmap('token_affinity.bin', dtype=np.float32, shape=(n_tokens, n_tokens),
mode=('w+' if not exists else 'r'))
if not exists:
tok = Hashtok()
tok.set_tokens(tokens)
max_length = 0
doc_tokens = []
with gzip.open('/mnt/sd/posts.txt.gz', 'rb') as inf:
for row in inf:
text = lxml.html.fromstring(json.loads(row)).text_content().encode('utf-8')
if not text:
continue
start = 0
token_ids = []
while start < len(text)-1:
n_read, local_token_ids = tok.transform(text[start:])
token_ids.append(local_token_ids)
start += n_read
if not token_ids:
continue
token_ids = np.concatenate(token_ids)
doc_tokens.append(token_ids)
if token_ids.shape[0] > max_length:
max_length = token_ids.shape[0]
for rot in range(token_ids.shape[0]):
ax2 = np.roll(token_ids, rot)
affinity_mat[token_ids, ax2] += 1
affinity_mat[ax2, token_ids] += 1
print(max_length)
affinity_mat[np.arange(affinity_mat.shape[0]), np.arange(affinity_mat.shape[0])] = 0
with open('posts_tokens.pickle', 'wb') as outf:
pickle.dump(doc_tokens, outf)
from sklearn.decomposition import PCA
vectors = PCA(n_components=128).fit_transform(affinity_mat).astype(np.float32)
# normalize vectors so all values lie in [0, 1]
mv = np.min(vectors)
if mv < 0:
vectors += mv
vectors /= np.max(vectors)
vectors.tofile('lsa128.bin')
Then, once all the documents in the training set (the text from the database backup of my GtS instance) are tokenized, we use them to train our CNN. The CNN architecture is pretty simple, just a few Darknet blocks, treating the word-vector matrix as if it were a 2D image:
def darknet_block(self, factor, in_channels):
return nn.Sequential(
nn.Conv2d(in_channels=in_channels, out_channels=(16 * factor), kernel_size=3),
nn.BatchNorm2d(16 * factor),
nn.LeakyReLU(True),
nn.Conv2d(in_channels=(16 * factor), out_channels=(128 * factor), kernel_size=3),
nn.BatchNorm2d(128 * factor),
nn.LeakyReLU(True),
)
self.encoder = nn.Sequential(
self.darknet_block(1, 1),
self.darknet_block(2, 128),
self.darknet_block(3, 256),
nn.MaxPool2d(2),
self.darknet_block(4, 384),
self.darknet_block(5, 512),
self.darknet_block(6, 640),
nn.MaxPool2d(2),
self.darknet_block(7, 768),
self.darknet_block(8, 896),
nn.Conv2d(in_channels=1024, out_channels=1, kernel_size=1),
)
The training loss is a little more complicated. Collobert and Weston trained their net against a collection of supervised learning tasks, but we don't have training data for that. Instead we use the sum of two losses: a modern "fill in the blank" loss and projection error between the embeddings and the context windows (flattened out into a single vectors).
# sample a batch of context windows
batch = np.empty((self.batch_size, MATRIX_N_ROWS, MATRIX_N_COLS))
batch.fill(-1)
for i in range(self.batch_size):
post = self.vectors[random.choice(self.posts)[:MATRIX_N_ROWS]]
if post.shape[0] < MATRIX_N_ROWS:
post = np.tile(post, (ceil(MATRIX_N_ROWS/post.shape[0]), 1))
post = post[:MATRIX_N_ROWS, :]
batch[i, :MATRIX_N_ROWS, :] = post
mat = torch.from_numpy(batch).float().unsqueeze(1)
masked_mat = batch.copy()
# choose random positions to put blanks
mask_indexes = np.random.randint(0, MATRIX_N_ROWS, self.batch_size)
# replace the blanks with random tokens
masked_mat[:, mask_indexes, :] = self.vectors[np.random.randint(0, self.vectors.shape[0], self.batch_size), :]
masked_mat = torch.from_numpy(masked_mat).float().unsqueeze(1)
# get our context window embeddings
enc = self.model.encode(masked_mat)
row_weights, row_delta = self.model.predict(enc)
# the target for guessing where the blanks were
mask_vectors = nn.functional.one_hot(torch.from_numpy(mask_indexes), num_classes=MATRIX_N_ROWS)
# this is where we're filling in the blanks
# row weights is the embedding run throuhg a linear layer and a softmax to output MATRIX_N_ROWS
# row delta is the embedding just run through a linear layer to output MATRIX_N_COLS
# row_delta is guessing a vector to add to the random one to push it toward the correct one
# so if we multiply row_weights transpose by row delta we get a matrix the same shape as the context window
# where adding the matrix to the input should produce the correct output (or closer to it)
perturbation = row_weights.transpose(0, 1) @ row_delta
perturbed = masked_mat + perturbation
recon_loss = self.loss(mat, perturbed)*0.5 + self.mask_loss(mask_vectors, row_weights)*0.5
# this is the projection error loss
# we're taking the L1 loss between the normalized pairwise distances of the input batch and the normalized pairwise distances of the output batch
# yes I know a random projection can acheive that too but just think of this as a regularizer
enc_pdist = nn.functional.pdist(enc, p=1)
enc_pdist = enc_pdist / (torch.sum(enc_pdist) + 0.000001)
inp_pdist = nn.functional.pdist(masked_mat.reshape((self.batch_size, MATRIX_N_ROWS*MATRIX_N_COLS)), p=1)
inp_pdist = inp_pdist / (torch.sum(inp_pdist) + 0.000001)
pdist_loss = self.pdist_loss(enc_pdist, inp_pdist)
loss = recon_loss + pdist_loss
loss.backward()
self.optimizer.step()
You may have noticed that row_weights transpose times row_delta is a diffusion model. That means we can use it to generate! What happens when we do? Well, with uniform random input it quickly converges to something like: seeeseee@eeeeeee@ ees@@e e@@@ e@e eeeeeee @@ee@@@se@es@ee@@s e@e@eee@e@e@e@e@s@eeeeeee@@ seeeeeeeeeeeeeeseseee@ee@e @@@e eeee@@. This is the sort of "playing the averages" result that you would expect.
What happens if we multiply the delta matrix by a factor, say 10, to push it out of that equilibrium? Well it turns out there's another equilibrium it falls into:

And if we inject noise into the matrix every iteration to stop it from converging:

What does this mean? Nothing, I just thought it was funny. It's probably reproducing some copypasta that happens to be low entropy.
Does this embedding work for search queries? Yes, maybe too well. After playing around with it as a search index I realized that I don't want semantic search, I want predictable search that gives me results that contain my query terms. So I'm changing tack and ditching neural networks for hashing trick random projections of bag of ngrams vectors. I'll probably write a post about that in the next few days.