Options optimization ala Bellman

Mon 02 March 2026
Progress Report

Say I gave you a time machine that let you look into the future and see the prices over time of a particular stock for the next two years. You don't get to look at all the stocks, you get to pick what stock to look at and then turn on the machine and see the future prices. But actually it's kind of an unreliable time machine, so there's some probability that the price is wrong. In other words you get to see $P(price, time)$ for all prices and times between now and two years from now.

What is the trading strategy that gives you the most profit given that information? If the price is higher in the future than it is today then you can just buy it today, but can you do better than that? And what if the price is lower than it is today? Instead of just buying the stock you can buy options. But when to buy them? And at what strike and expiration?

One way to think about this is as a decision process. States are portfolios (options and cash), actions are trades, and rewards are payoffs.

A portfolio is a set of options, which are (price, strike, expiration) triples, and an amount of cash.

A payoff is the profit from selling or exercising an option, or interest at the risk-free rate for cash. This is our reward

A trade is selling or exercising an option, or buying an option.

An option which is expired can't be exercised or sold.

At a given time, a given trade (buy/sell, option, price) has a $P(fill)$. We don't know that probability, but if we know the price and the volatility then we can estimate it as an exponential distribution whose knee is at the current price and whose variance is the square of the volatility. That is, the probability that we get a fill increases exponentially the worse a deal it is for us and approaches zero the worse a deal it is for our counterparty. If your time machine gives you probability estimates over prices of the underlying instead of over options prices traditional finance has standard answers for how to translate one to the other.

$P(fill)$ gives our MDP its transition probabilities. Let's call trades actions $a$, portfolios states $s$, and the fill probability for a given action $P_a(s, s')$ where $s'$ is the prortfolio after the trade fills. If the trade isn't filled nothing happens so $P_a(s, s) = 1 - P_a(s, s')$.

We can only buy options that cost less than the amount of cash we have. For some approaches that makes the reward function annoyingly non-smooth, but for us since we're considering a finite graph it just prunes some edges which is fine.

We choose a time base, eg: one hour, where we have a new state at every time step. If no actions are performed in a given time step the cash amount increases by the interest it earned at the risk-free rate over that interval.

Our reward function for a given trade, $R_a(s, s')$, is our immediate payoff for that trade plus the risk-free rate interest on our cash since the previos tick.

Since we have a reward function and a transition probability distribution we can plug those into value iteration and now we have an algorithm that gives us optimal returns. We can make sure the number of states is tractible by choosing a time base, resolution for amount of cash (eg: quantize into $100 increments) and resolution for price and strike for options. For example, if we have a time horizon of two years and a time resolution of one hour that's two years times 251 trading days per year times seven hours per day, so $t_{max} = 2 * 251 * 7 = 3514$. If the number of options contracts (the length of the options chain) $o_{max} = 100$, and our bankroll is up to $1M at $1000 increments $b_{max} = \frac{1{,}000{,}000}{100} = 1000$, then the size of the state space $\lvert S \rvert = t_{max} * o_{max} * b_{max} = 351{,}400{,}000$, which is an array that it's no problem to fit on a modern computer.

If your time machine gives you point samples of $P(price, time)$ instead of the whole distribution then you can do gaussian mixtures or a kernel density esitmator to fill in the in-between values.

If you don't have a time machine then you can estimate future prices as a gaussian random walk but now you're just doing a worse version of Black-Scholes, and honestly you're better off leaving that to people that have direct access to the exchanges.