# See it.

I spend too much time in traffic.

So much time, that I start to see things. I see patterns.

You can get stuck in a slow lane because the others are going so fast you never manage to merge. So on average, how much time do you spend stuck in unescapable lanes? Does it make a difference in the cosmic scheme? (No, it doesn’t)

I decided to build a simulator and see how driving styles affect traffic.

Source is on github.

# xml-to-json is now a library

A while back I needed to convert a ton (millions) of small xml files to json, so I could store them in MongoDB. To that end I wrote a teensy-tiny tool called xml-to-json (github, Hackage). Originally it was just a command-line tool with all the code thrown in a single file.

So, I did a quick refactor this week to split it into a library + executable, and pushed it to github (to deafening cries of joy).

## Features

First, a non-feature. xml-to-json is “optimized” for many small xml files. If you have many small xml files, you can easily take advantage of multiple cores / cpu’s. You should be aware that for large files (over 10MB of xml data in a single file) something starts to eat up RAM, around 50 times the size of the file.

Other features:

• You can filter xml subtrees to convert, by element name regex (and you can skip the matching tree root if you wish, converting only the child elements and down).
• Output either a top-level json object or json array.
• (Optionally) simplify representation of xml text nodes in attribute-less elements (e.g. “<elem>test</elem>” -> { elem: “test” })

## Packages used

For XML decoding, I’m using hxt (over expat using hxt-expat). I tried a few of the xml packages on hackage, and hxt + expat was the only way I could parse quickly while avoiding nasty memory issues. Apparently, tagsoup can be used with Bytestrings to avoid the same issue but I didn’t try.

JSON is encoded using aeson.

C# Pitfalls, Compiler Bugs & .NET gotchas

Rarely, but not rarely enough, I find myself banging my head against some unexpected semantics of the C# language. Sometimes I’m simply misunderstanding the language, other times it’s a limitation, or even a compiler bug.

To document my findings I’ve started repository on github for a list of C# pitfalls, compiler bugs, and various other .NET framework gotchas. These are not “beginner tips” – they are things that may surprise a seasoned developer (or is it just me?).

In any case, all are welcome to join this effort – don’t hesitate to fork & send your pull requests.

# Flat design sucks.

Recently Microsoft updated the design of its Dynamics CRM to a “flat” look. Since then, we can’t figure out what is clickable or editable, and what is just content.

My plea to designers: make that clickable thingy look clickable. Make that editable input box look like an input box, and not just when I hover over it.

Buttons and interface widgets, when present, need to be easily distinguishable from content. They need to have good affordances that invite users to action. In the absence of strong signifiers, they can get ignored, and users may find themselves lost and disoriented.

Not all “flat” design has indistinguishable chrome from content, as pointed out in the above article.

# Brackets – a great editor for Javascript, HTML and CSS

http://brackets.io/

Besides being great, it also has a fast release cycle (2.5 weeks) and is open source. It is being developed by folks at Adobe (plus others from the OSS comunity).

Favorite features:

• Nice hinting and inline code helpers for CSS.
• Live editing of HTML and CSS – you can fire-up a browser (currently only Chrome) that is updated as soon as you change the code, no need for F5. It also highlights the HTML your cursor is at.
• Very lightweight. Yet, it has everything you need to start working on a web project. You just install it.
• It comes with JSLint built-in that immediately bugs you about your Javascript pitfalls. Other IDEs I’ve used leave it to you to set up that kind of stuff.
• Extensible. The extensions I’ve installed include CSS color hinting, and a context-menu to open a file’s folder in the OS’s file system thingy (explorer, in Windows).

Weaknesses: a few basic editor features (such as code folding or more intelligent search) are lacking, but the basics are good enough to be very productive. Extension management is currently done via manual file copying, but according to the dev blogs, they’re working on an extension manager.

Platform support: currently built only for Windows and Mac OSX.

Tip: if you’re using Live Development on a site that uses AJAX to access it’s own web server, you need to either enable cross-site requests on your development web server, or (easier) tell brackets where your development web server is serving your html/js/css from – it’s under File…Project Settings. See here for more.

# Windows 8 is the worst.

Don’t install Windows 8. Don’t buy a laptop that has it.

I tried. I installed windows 8 on my personal laptop. It was so cheap – $15 to upgrade – I decided to give it a shot. What a mistake. If you do anything at all with your computer, don’t use Windows 8. For a tablet – maybe, I don’t know. For a computer – no way. Just say no. I’m not a hater or a flamer, I use almost only Microsoft products all day long (Windows 7 and Visual Studio most of my work day). I like Windows 7. Here’s a very short list of stuff that kills me (there are many other problems): 1. Windows 8 restarts your machine WHILE YOU ARE USING IT to finish installing updates. It warns you but never gives you the chance to prevent it. You could be in the middle of reviewing an emergency patient’s diagnosis, and *BOOM* machine shuts down and is unusable for at least 5 minutes. You have no option to save your work. You have no option to stop this from happening. 2. The “metro UI” or “new start screen” or “tile area” is horribly confusing because it hides everything else. You can’t see what’s running at a glance. You can’t change this behavior. 3. The “charms” bar requires you to point your mouse to a magical location. You can’t change this. You can’t disable this (and make the freakin’ bar always visible) 4. There is no bloody CLOCK or calendar on the default screen, that annoying “start screen”. You can’t add one without installing a third-party “windows store app”. 5. The control panel is split between the “classic” control panel and some super-simplified – yet non-overlapping – set of options in the “charms” settings area. Go figure where to fix something. Too bad, Microsoft really screwed this one up. p.s. some search engine food: windows 8 sucks, windows 8 is bad, windows 8 is horrible, terrible, nasty, don’t use windows 8, don’t install windows 8, keep windows 7, windows 7 is better than windows 8. Google, you get the point. # Penalty search, part 4 and finale: the x-section algorithm # What’s the problem? This is the last post in a series dedicated to exploring the problem of “penalty search”. Penalty search involves an ordered array: the first part of the array has values that do not satisfy some property, while the second part – the rest of the values – do satisfy the property. Testing a value for the property carries an index-dependent penalty, given by a cost function. We want to find the first value (lowest index) in the array that satisfies the property. The “total cost” of a search attempt is the total penalty spent looking for the value until it is found. The goal is to devise an algorithm that searches the array with minimal expected total cost. # Why is this interesting? Besides being fun to wrestle with, this problem is interesting because it deals with a realistic physical situation: you need to find out the ideal timing for some process. Any time-dependent experiment can be modeled by this problem. Every time you try, you spend time, so the cost of the test is time being spent. The goal is to find the solution as quickly as possible, so we want to minimize the expected time spent searching for a solution. # Linear cost – recap The last post explored how a linear cost function affects the recursive difference equation that calculates the expected cost. Linear cost is interesting because first, it is easier to analyze, and second, because it fits problems where the penalty is the amount of time spent waiting to check a time-sensitive process. Analytically, the nice property of linear cost is that the size of the array being searched is what determines the cost (up to a constant), not the specific range of indexes. In that last post we took advantage of this translational quasi-invariance of a linear cost function to simplify the problem. After defining a couple of help functions we ended with the following equation: $h(N)=N(at_N+b)+h(t_N+1)+h(N - t_N - 1)+a(N - t_N - 1)(t_N+1)$ Where: • $N$ is the size of the region (array) that’s being searched • $h(N)$ represents the expected cost of a region (almost – you need to divide by $N$ to get the actual cost) • $t_N$ is the index that our algorithm decides to test in the current step • and $at_N+b$ is the cost of testing at index $t_N$ # Constant ratio search For linear cost, we expect the solution – the optimal search algorithm – to be somewhat similar to binary section search: we always pick a constant ratio (“section”) of the current range to pick our next search target. Although I can’t prove that this is a valid assumption, I think it’s interesting enough to see what the ratio should be if we do assume constant ratio search. To begin, we define $t_N := \lfloor rN \rfloor$ where $r \in [0,1)$ (remember we are already assuming that the region is $[0,N-1]$). To make things simpler (though admittedly less precise) we’ll drop the floor notation and treat $rN$ as if it were an integer. Substituting in the last equation, we get: $h(N)=N(arN +b)+h(rN +1)+h(N - rN - 1)+a(N- rN - 1)(rN +1)$ With a little simplification (saving time using Maxima) we get: $h(N)=h(rN+1)+h((1-r)N-1)+(2ar-a r^2)N^2+(-2ar+b+a)N-a$ Notice that the two recursive references to the function $h$ nicely represent testing the left and right sub-regions around the split point $rN$. I’m personally not any closer to solving that equation, so here’s where we use approximations. In the last post we saw that for $N=1$ the value must be zero. We can use this information to solve $N=2$: $h(2)=h(2r+1)+h(2(1-r)-1)+4(2ar-ar^2)+2(a+b-2ar)-a$ Following the same logic as in that last post, we know that we can only split a region of size 2 into two subregions of size 1. This means $h$ must be referencing recursively calls to $h(1)$ twice. This limits the range of $r$: $\lfloor 2r \rfloor +1=1$ and $2- \lfloor 2r \rfloor -1=1$ Implying: $\lfloor 2r \rfloor =0 \implies r < 1/2$ Here we have a first result: we must split somewhere lower than the middle. Less than makes sense if our linear curve is increasing, such that $a>0$. I didn’t have time to go back and check, but I’m guessing I somewhere made that assumption implicitly along the way. It also makes sense: if the cost is increasing linearly, the upper bound for a constant split ratio is the middle – which is exactly binary section when cost is constant (not increasing at all, $a=0$). It’s an interesting exercise to plug values for $N$ and see where the bounds on $r$ go. I’m leaving that for another day, or for you to try. That’s all. If you find interesting instances of this problem, let me know! # Penalty search, part 3: exploring linear cost ## Recap The previous post – a second in a series discussing optimal search under test penalty – concluded with an open question. We can write an equation to describe the expected cost, which we want to minimize the following by picking the best $t_i$ for each interval $[x_i,y_i]$: $E[cost(x_i,y_i)] = c(t_i) + E[cost(x_{i+1},y_{i+1})]$ The open question is: how? $c(t_i)$ is the test cost (penalty) function. Different values have different costs, so we expect that the optimal search will take into account the different costs and be smart about picking which values to test. As we saw previously, for constant cost the algorithm is simply binary section search. We have a recursive function of two variables (the range – $x_i, y_i$) which we want to minimize by picking yet another function (the one that picks $t_i$ given the current range). I don’t know how to solve this generally, which is why I resorted to manual calculations and pretty graphs. In that same previous post I used dynamic programming – a natural problem for Haskell – to find the optimal decision tree for a specific range and cost function. Thanks to Control.Parallel it was easy to improve performance by taking advantage of today’s multi-core processors. What I would really like, though, is to actually find a closed-form solution, one that defines the optimal search algorithm. ## Disclaimer I’m no mathematician (nor a computer science wizard). If you think there’s a mistake anywhere, let me know. ## When you don’t know, make the problem simpler So what to do when you don’t know how to solve a problem? Make more assumptions.TM Hopefully, we’ll learn something from the solution of the simpler problem. ### Uniform Probability The first assumption we can make is that of uniform probability (of the search target, in the range). Here’s our equation: $E[cost(x_i,y_i)] = c(t_i) + E[cost(x_{i+1},y_{i+1})]$ The range $[x_{i+1},y_{i+1}]$ is what our algorithm decides to search in the next step, and $t_i$ is the current searched value. We expand the expectation $E[]$ of the next test (next level in the decision tree). For clarity I’ll dump the indexes since we don’t need them now: $E[cost(x,y)] = c(t) + P_{x,y}(s \le t)E[cost(x,t)] + P_{x,y}(s > t)E[cost(t+1,y)]$ Where $P_{x,y}(A) := P(A | s \in [x,y])$ is a conditional probability. From now on I’ll skip writing the expectation $E$. Using our assumption of uniform probability we can write: $cost(x,y) = c(t) + \frac{1}{y-x+1} \left ((t-x+1)cost(x,t) + (y-t)cost(t+1,y) \right )$ Notice that now our problem is mostly a function of range size, so we can define $N_{x,y}:=(y-x+1)$ and write: $cost(x,y) = c(t) + \frac{1}{N_{x,y}}(N_{x,t}cost(x,t) + N_{t+1,y}cost(t+1,y))$ Multiply by $N_{x,y}$ to get: $N_{x,y}cost(x,y) = N_{x,y}c(t) + N_{x,t}cost(x,t) + N_{t+1,y}cost(t+1,y)$ Being a non-mathematician, here’s where I said “Aha!”. We can make things simpler if we define: $g(x,y) := N_{x,y}cost(x,y)$ Since$g$is just$cost$multiplied by$y-x+1$, minimizing$g$will also minimize$cost$. Using this definition the equation becomes: $g(x,y) = N_{x,y}c(t) + g(x,t) + g(t+1,y)$ …which is beginning to look a lot more manageable. And remember – so far, the only assumption we’ve made is that of uniform probability. But how to proceed? We still have a recursive equation with two variables, x and y. ### From two variables to one Someone must have once said: “one is better than two.” Variables. In a recursive function. Let’s add another assumption – that the cost function is linear: $c(t) = at+b$ Where $a$ and $b$ are arbitrary constants. Why linear? This cost function has the nice property that the cost of any search tree is invariant under translation (of the searched range) up to a constant addition. That is, $cost(x,y)+a u=cost(x+u,y+u)$ or $g(x,y)+N_{x,y} a u = g(x+u,y+u)$ How do I know this? Here’s an informal proof. A search tree consists of nodes representing the values being tested (and cost being spent). If we translate the underlying range, we’re in effect changing the linear cost function by adding a constant that depends on the amount of translation. So the cost of each node in the search tree will vary by the same constant, leaving the distribution of cost throughout the tree equivalent and keeping the expected cost minimal compared to other trees on the same range. This property means that given a range size $N$ we can find the best search tree for just one translation of that range (say, $x=0$) and it will be correct for any other range of the same size. Plugging in the linear cost in the equation from before gives us: $g(x_i,y_i) = N_i a t_i + N_i b + g(x_i,t_i) + g(t_i+1,y_i)$ I’m explicitly writing that the choice of $t$ depends on $x,y$, just to make things clearer. We’ll use the indexed notation $t_i$ to signify that $t$ depends on the $i$th range, and use it also for the range size, that is $N_i := N_{x_i,y_i}$. Now we can use the translation invariance to replace $x_i$ everywhere with 0. At the same time we can get rid of $y_i$ since it will be equivalent to make the function depend on $N_i=y_i+1$. To that end, define: $h(N) := g(0,N-1) = g(x,x+N-1)-Nax = N(cost(x,x+N-1)-ax)$ Then, the equation above for $g(x_i,y_i)$ becomes: $h(N)=Nat_N+Nb+h(t_N+1)+h(N - t_N - 1)+a(N - t_N - 1)(t_N+1)$ That last term is due to the required translation of the $h(N-t-1)$ from the origin to $t+1$ (the original value was $g(t_i+1,y_i)$. ## Results please… Let’s look at the actual value of the expected cost for a few small N’s (region sizes): ### N=1 When the size of the region, $N$ is 1, there is nothing to check – by assumption the target value is in the region and there is only one cell, which must be it. Since there is nothing to check, there is no cost, so$cost(0,1)=0=h(1)$. Also, we don’t care what the offset (translation) of this one-celled region is – so we don’t need to add the additive term when calculating translated calls such as cost(x,x+1) because the value is always 0. ### N=2 Let’s move on to $N=2$. For a region of two cells, the only logical algorithm is to check the first cell. If it yields the value 0, then the target (first post-step cell) is cell #2, otherwise (value at first cell is 1) – the target is cell #1. So the expected cost is just the cost of the first cell – which has index 0 – which is$0a+b=b=cost(0,1)\$.

That’s what simple logic dictates. Now let’s see how our equation fairs:

$h(2) = 2at+2b+h(t+1)+(h(2-t-1)+a(2-t-1)(t+1))$

Remember that $t \in [0,N-1]$, so in this case t can only have the value 0 (matches our logic that we can only test cell 0). Since the second recursion ($h(2-t-1)$) turns out to be a 1-celled region, we delete the additive term and replace it with 0:

$h(2) = 2b+h(1)+h(1)=2b$

Since $h(2)=2cost(0,1)=2b$ the result matches our “manual logic” from above.

### N=3

Here’s where things start to get interesting. $N=3$ is the first case where we have to compare different values for $t$ to find out which one yields the minimal expected cost. The possible values are 0 and 1:

• $t=0 \Rightarrow h(3)=3b+h(1)+h(2)+2a=3b+0+2b+2a=2a+5b$
• $t=1 \Rightarrow h(3)=3a+3b+h(2)+h(1)+0=3a+3b$

So which is less? We should pick $t=0$ iff $2b \le a$. Otherwise, pick $t=1$.

This is interesting because it means that the optimal search algorithm depends on how our linear cost is defined. If the slope $a$ is “twice more dominant” than the constant $b$, then we should bias our search towards 0. Otherwise, we go with binary section and take the middle cell with index 1.

# What’s next

We could spend the rest of our lives calculating larger and larger values of N. What we really want, though, is a general, closed-form solution for any N. In my next post I’ll make another weakening assumption that will make it possible.

# Penalty-search problem revisited: Dynamic Programming & Control.Parallel to the rescue

My last post was about searching with a test penalty. The problem boils down to minimizing a recursive difference equation. Since I didn’t (and still don’t) have a closed-form solution, I had to base any conclusions on brute-force tests that find the best tree. The run time was dreadfully slow so I could calculate the solution for no more than a 15-cell array. For small arrays and a small cost constant, the solution for linear cost is just a half-interval search tree.

After writing that post I realized several things and here are the new results.

As suggested by several people, the solution can be found more efficiently using dynamic programming. I’ve implemented a revised version of the code that finds the solution by recursively finding it for subtrees. It runs much faster after this change. The DP approach also eliminates the space leak – the program uses very little memory. Also, I’ve added parallelization with a tiny bit of code using parMap to cut the execution time by a factor of 4 (on my core i5). It’s pretty cool how easy it is to parallelize purely functional code! After these changes, calculating an array one cell larger takes almost exactly three times longer.

Here’s my theory for why three times longer. Since we split the array at the test index and have two branches for each possible root node, an array of size n requires solving the subtrees for arrays of sizes 1 through n-1, twice [see note 1]. So, relative to the n-1 case we have at least twice as much work (for the two extremities that have n-1 sized sub-arrays). Additionally we have the twice the sum from 2 to n-1 of 2^-x amount of work to do (relative to the n-1 case) for the smaller sub-arrays, which converges very quickly to 1. In total, we need to do 3 times as much work when using the DP approach.

With the improved code I’ve calculated the solutions for slightly larger arrays. Here are the results for the range [0,17] (an array of size 18):

Constant cost – binary section search

Linear cost – slight bias towards lower indices

Quadratic cost – strong bias towards lower indices

## Revised recommendation for confectionery research

Ok, I’ll skip the soufflé analogy this time. Assuming a monotonically increasing cost function, a good strategy is what you’d expect: prioritize searching in the lower-costing area, using some sort of biased interval search (rather than half interval).

## Open question still…

I’m still looking for a closed-form solution to the cost equation. For those who didn’t read the last post, the problem is to find the search tree that minimizes the expected cost of a search on an array, where each test carries a cost (given by a cost function). To solve the problem, one must minimize the following value by picking the optimal test point t depending on the current range [x,y]:

\large \inline \dpi{120} \begin{align*} E[cost(x,y)] = c(t)&+P(t \ge s | s \in [x,y])cost(x,t)\\ &+ P(t

The simplifying assumptions were uniform distribution of the search target, and linear cost:

\large&space;\inline&space;\dpi{120}&space;\begin{align*}&space;E[cost(x,y)]&space;&=&space;(c_0&space;t+c_1)\\&space;&+\frac{(t-x+1)cost(x,t)&space;+(y-t)cost(t+1,y)}{y-x+1}&space;\end{align*}

Another possible simplification: assume that the search algorithm uses a constant ratio to pick the next search target. That is, for a range [x,y] the next search target will always be c * (y – x + 1) (rounded down) where c is a constant, between 0 and 1. I’ve toyed around with this one but ended up with the same basic problem – how to solve recursive equation?

The code is now up on github for your doubtless enjoyment.