## TL;DR

How to find a value in a range of integers where every test carries a cost with pretty graphs at the end.

## Don’t ruin that soufflé

You know that problem. If you take it out of the oven too soon, and you get a raw egg; leave it in for too long, and you’ll end up with a fat, flat omelet (or is it omelette?). A soufflé is like Schrödinger’s cat: if you open the oven to observe, it will collapse from the good-bad superposition into a slobbery glob of goo. So what to do?

The real answer is to get a good cookbook or tips from someone who knows. But let’s not bother with reality here: there’s an interesting problem hiding here that’s worth exploring. Let’s assume you can only make one soufflé at a time, and that if you open the oven to check, you can’t put it back in. It will either be underdone, exactly ready, or overdone. Also let’s assume that the ideal time to bake a soufflé is a constant, integer number of minutes. You need to find out quickly what’s the best baking time because you are hosting a party in a few hours and want to be prepared. Finally, let’s assume that we tried to make one and after baking it for 30 minutes found that it was overdone.

So what’s the *fastest* way to find out the perfect baking time for a soufflé?

## So here’s the actual question

To state the problem more generally: we have a test, a boolean function of integers * f(t)*, that tells us whether a desired value

*is less than the tested value*

**s***. Or, in Haskell:*

**t**f :: Integer -> Bool f t = t >= s

Every time we do the test (evaluate * f(t)*) we pay a penalty – in the case of a soufflé, we waste time: we have to restart the process and start baking a new one. Let’s call this penalty the

*cost function*, and let it depend on the value we are testing. For our simple purposes, let the function be from integers to integers.

cost :: Integer -> Integer

Finally, we point out that * s*, that value we’re trying to find, is a random variable with probability function

*for finding*

**pf***in a range*

**s***. This function satisfies:*

**[x,y]**pf :: Integer -> Integer -> Integer -> Double -- the rule is: pf x y z = sum . map pmf $ [x..y]

Where * pmf *(probability mass function) is a distribution

**on the total range**

*where*

**[a,b]***is known to reside.*

**s**pmf :: Integer -> Double

Being a probability mass function, the rule that * pmf *must hold is:

sum . map pmf $ [a..b] = 1 -- or pf a b = 1

The problem is then, to find an algorithm that **finds s with minimal expected cost**. The average total cost of checking a range

*when testing a value*

**[x..y]***in the range, can be calculated recursively. Let’s say our algorithm decides which value to test next using a function*

**t***. First we need to define the conditional probability*

**guessVal(x,y)***of finding*

**cpf***in a range*

**s***given the fact that it is know to lie in an extended range*

**[x’,y’]***:*

**[x,y]**cpf :: Integer -> Integer -> Integer -> Integer -> Double cpf x y x' y' = ... --related topfand depends on the distribution

Then we can write:

avgCost :: Integer -> Integer -> Double avgCost x y = if (x == y) then 0 else cost t + (cpf' x t) * (avgCost x t) + (cpf' (t+1) y) * (avgCost (t+1) y) where t = guessVal x y cpf' = cpf x y

This assumes that * t* is actually in the range

*and that*

**[x,y]***is known to be in that range. If the range is a single integer, there is nothing to check and thus zero cost. Otherwise, the cost is the probability-weighted average of the two next branches (the guessed value yielded a test result false or true). Note that for any distribution, the total of the two*

**s***values above will always be 1 since we are covering the entire range on which the condition is applied. We need to minimize the expected cost, so the general problem is:*

**cpf**Choose

that minimizesguessValfor a given rangeavgCost[a,b].

A minor subtlety to note is that we never need to test the highest value of a range because it will always test positive. This can be expressed as the rule:

x <= (guessVal x y) < y

### Simplifying assumption

Let’s assume that our probability distribution is uniform. In that case the probability function depends only on the size of the range being tested and a constant that depends on the complete range size:

pf x y = (y - x + 1) / (b - a + 1) -- and conditional probability: cpf x y x' y' = (y' - x' + 1) / (y - x + 1)

Under these conditions the avgCost function may be rewritten as:

avgCost x y = if (x == y) then 0 else cost t + c * ((t - x + 1) * (avgCost x t) + (y - t) * (avgCost (t+1) y)) where t = guessVal x y c = 1.0 / (y - x + 1)

In many cases uniform distribution is not a good pick – for example, when we have a pretty good guess about where * s *lies in the range a normal distribution would be more appropriate.

## Cost functions

#### Constant cost

In the trivial case, the cost function is just a constant – most trivially, it is always 1. In that case our problem boils down to minimizing the **number of tests** and is the same as finding a value in an array. The well-known solution to that is** binary search**. Our average cost is then:

avgCost x y = if (x == y) then 0 else 1 + 2 * 1/2 * (avgCost x t) where t = x + (y - x) / 2

This reduces to the expected value of * log(n)* (rounded up) where

*is the size of the complete range.*

**n**#### Linear cost

Unfortunately, when baking a soufflé every attempt has more than a constant cost – you have to wait a certain amount of time to find out if a given duration for baking is too short or too long. If time is of the essence, an appropriate cost function would be **linear and increasing**, such as * cost(t) = t*. In that case the equation becomes:

avgCost x y = if (x == y) then 0 else t + c * ((t - x + 1) * (avgCost x t) + (y - t) * (avgCost (t+1) y)) where t = guessVal x y c = 1.0 / (y - x + 1)

Just a reminder: the problem is to minimize the average cost by picking the right **guessVal ** function – picking the right t every time. If anyone solves this, let me know.

## Visualization

As suggested by this answer^{1}, any sensible algorithm is equivalent to a binary decision tree (given a range * [a,b]*). At each node, the algorithm tests a value

*, spending*

**x***, to determine which of the branches to take. If we assign a cost value to each node in the tree, minimal expected cost is the weighted average root-to-leaf total cost (where the weights depend on the distribution*

**cost(x)***; for uniform distribution the weights are identical and can be ignored.) The leaves themselves have zero cost because they signify that the result is known.*

**pmf**I wrote a small program that finds the best algorithm (decision tree) for a given range and cost function, and prints out the results in graphviz-compatible .dot format. Here they are for constant cost (= binary search), linear and quadratic costs for the range [0,14]. Since I don’t (yet) have a solution for the linear cost case, I was surprised to find that the best algorithm for (**linear cost is still binary search****update:** **that’s not true, this was a side effect of testing only small trees – I’ll write a new post about the updated analysis**) despite having what is intuitively a considerable bias toward the beginning of the range. For quadratic cost (* cost(t) ~= t^2*), the best algorithm is quite lopsided as you can see below. It’s possible that there’s a slight bias effect also for linear cost that is only prevalent for larger intervals.

## The dreaded Space Leak

The range is that small because the number of trees to test increases very fast – I could only get up to 14 before running out of memory, which is another story. The implementation is naive: the program builds all the trees before calculating the average cost of each – the right way would be to calculate the cost of each tree and discard it so that the GC could get rid of it when memory gets tight.

## So what about that soufflé?

As the visualized in the graph for linear cost above, when time is the issue **you should** ** still use binary section search. (update: next post will explain why this is wrong). **If the cost is quadratic, give higher priority to lower values.

## Open questions

- Solve the minimization problem for the recursive equation for
when the cost function is linear:**avgCost**.**cost(t) = t** - Take into consideration other probability distributions. For example, solve for linearly increasing distribution (where larger values are more likely to equal
) and normal distribution around the middle of the range.**s** - How do I improve the implementation of the brute force algorithm to not explode in RAM?

Let me know when you’re done :)

If you have any comments or insights, I’d be glad to hear them! My family can take a few more soufflé attempts…

1. Since I couldn’t find any information about this problem I asked this question on cstheory.stackexchange.com. I got some backlash from one person about this not being a “research-level question” as dictated by the site’s FAQ… You can check out the other questions on the site to decide for yourself if this question exceeds the maximum dumbness level.

It would also be interesting if, say, over and under cook-ing had different costs.

I’m thinking of that interview question, where you have to find out what floor a dropped egg will break from in the fewest number of steps, but, you only have n eggs. (n = 2 most of the time, but it variable)

In that case cost(t) becomes a random variable with the rule:

`cost t = if t < s then 0 else 1`

In the

`avgCost`

function we’d have to plug in a probability:`P(t >= s | s in [x,y])`

Giving it a little more thought: for uniform distribution that conditional probability

~~is linear in t, and the problem is equivalent to linear cost.~~wait… no, it isn’t linear in t because it depends on the subrange [x,y] which we are calculating (where we are in the recursion). I’ll have to think about it more…I think you might be able to improve your implementation by using dynamic programming.

Done, it solves the memory issue completely and runs faster. I’ll post an update about that – with larger trees you see that linear cost is

notequivalent to constant cost (binary search).Ultimately, aren’t you in all cases doing binary search, not in feature space but in *cost* space? That is, at each decision point, you choose two subtrees such that the total cost of exploring the each subtree is equal? This should lead to a closed-form solution, but I’ll have to think about it more.

Noah, counterexample: in a range [a,b],

cost(a) = 0, and for any other value = 1. In that case your “balanced cost subtrees” would lead to picking some value other thanaitself, butamust be the right choice for minimizing cost overall (we can test it for free).I also think there’s a way to show equivalence of this problem to the regular searching one.

~~For integer cost, one way would be to map the range~~(actually that’s also wrong)[a,b]to a “dilated” range where each valuexin the original range takes upcost(x)cells in a target array, with order retained.Pingback: Penalty-search problem revisited: Dynamic Programming & Control.Parallel to the rescue | Enough Blogging!