xml-to-json – new version released, constant memory usage

I’ve released a new version (1.0.0) of xml-to-json, which aims to solve memory issues encountered when converting large XML files. The new version includes two executables: the regular (aka “classic”) version, xml-to-json, which includes the various features, and the newly added executable xml-to-json-fast, which runs with constant memory usage and can process files of arbitrary size. It does this by not validating the input xml, and by basically streaming json output as it encounters xml elements (tags) in the input. The implementation takes advantage of the cool tagsoup library for processing XML.

Check the README.md for more details. Hackage is updated.

Generate Javascript classes for your .NET types

We open-sourced another library: ClosureExterns.NET (on github and nuget). It generates Javascript classes from .NET-based backend types, to preserve type “safety” (as safe as Javascript gets) across front- and backend. As a bonus you get Google closure annotations. The type annotations are understood by WebStorm (and other editors) and improve your development experience. Also, if you use Google Closure to compile or verify your code, it will take these types into account. We use it extensively with C#. We haven’t tried it with F#, but it’s supposed to work with any .NET type.

ClosureExterns.NET makes it easier to keep your frontend models in sync with your backend. The output is customizable – you can change several aspects of the generated code. For example you can change the constructor function definition, to support inheritance from some other Javascript function. For more details see ClosureExternOptions.

Getting Started

First, install it. Using nuget, install the package ClosureExterns.

Then, expose a method that generates your externs. For example, a console application:

public static class Program
{
public static void Main()
{
var types = ClosureExternsGenerator.GetTypesInNamespace(typeof(MyNamespace.MyType));
var output = ClosureExternsGenerator.Generate(types);
Console.Write(output);
}
}


You can also customize the generation using a ClosureExternsOptions object.

Example input/output

Input

class B
{
public int[] IntArray { get; set; }
}


Output

var Types = {};

// ClosureExterns.Tests.ClosureExternsGeneratorTest+B
/** @constructor
*/
Types.B = function() {};
/** @type {Array.<number>} */
Types.B.prototype.intArray = null;


For a full example see the tests.

See it.

I spend too much time in traffic.

So much time, that I start to see things. I see patterns.

You can get stuck in a slow lane because the others are going so fast you never manage to merge. So on average, how much time do you spend stuck in unescapable lanes? Does it make a difference in the cosmic scheme? (No, it doesn’t)

I decided to build a simulator and see how driving styles affect traffic.

Source is on github.

xml-to-json is now a library

A while back I needed to convert a ton (millions) of small xml files to json, so I could store them in MongoDB. To that end I wrote a teensy-tiny tool called xml-to-json (github, Hackage). Originally it was just a command-line tool with all the code thrown in a single file.

So, I did a quick refactor this week to split it into a library + executable, and pushed it to github (to deafening cries of joy).

Features

First, a non-feature. xml-to-json is “optimized” for many small xml files. If you have many small xml files, you can easily take advantage of multiple cores / cpu’s. You should be aware that for large files (over 10MB of xml data in a single file) something starts to eat up RAM, around 50 times the size of the file.

Other features:

• You can filter xml subtrees to convert, by element name regex (and you can skip the matching tree root if you wish, converting only the child elements and down).
• Output either a top-level json object or json array.
• (Optionally) simplify representation of xml text nodes in attribute-less elements (e.g. “<elem>test</elem>” -> { elem: “test” })

Packages used

For XML decoding, I’m using hxt (over expat using hxt-expat). I tried a few of the xml packages on hackage, and hxt + expat was the only way I could parse quickly while avoiding nasty memory issues. Apparently, tagsoup can be used with Bytestrings to avoid the same issue but I didn’t try.

JSON is encoded using aeson.

C# Pitfalls, Compiler Bugs & .NET gotchas

Rarely, but not rarely enough, I find myself banging my head against some unexpected semantics of the C# language. Sometimes I’m simply misunderstanding the language, other times it’s a limitation, or even a compiler bug.

To document my findings I’ve started repository on github for a list of C# pitfalls, compiler bugs, and various other .NET framework gotchas. These are not “beginner tips” – they are things that may surprise a seasoned developer (or is it just me?).

In any case, all are welcome to join this effort – don’t hesitate to fork & send your pull requests.

Flat design sucks.

Recently Microsoft updated the design of its Dynamics CRM to a “flat” look. Since then, we can’t figure out what is clickable or editable, and what is just content.

My plea to designers: make that clickable thingy look clickable. Make that editable input box look like an input box, and not just when I hover over it.

Buttons and interface widgets, when present, need to be easily distinguishable from content. They need to have good affordances that invite users to action. In the absence of strong signifiers, they can get ignored, and users may find themselves lost and disoriented.

Not all “flat” design has indistinguishable chrome from content, as pointed out in the above article.

Brackets – a great editor for Javascript, HTML and CSS

http://brackets.io/

Besides being great, it also has a fast release cycle (2.5 weeks) and is open source. It is being developed by folks at Adobe (plus others from the OSS comunity).

Favorite features:

• Nice hinting and inline code helpers for CSS.
• Live editing of HTML and CSS – you can fire-up a browser (currently only Chrome) that is updated as soon as you change the code, no need for F5. It also highlights the HTML your cursor is at.
• Very lightweight. Yet, it has everything you need to start working on a web project. You just install it.
• It comes with JSLint built-in that immediately bugs you about your Javascript pitfalls. Other IDEs I’ve used leave it to you to set up that kind of stuff.
• Extensible. The extensions I’ve installed include CSS color hinting, and a context-menu to open a file’s folder in the OS’s file system thingy (explorer, in Windows).

Weaknesses: a few basic editor features (such as code folding or more intelligent search) are lacking, but the basics are good enough to be very productive. Extension management is currently done via manual file copying, but according to the dev blogs, they’re working on an extension manager.

Platform support: currently built only for Windows and Mac OSX.

Tip: if you’re using Live Development on a site that uses AJAX to access it’s own web server, you need to either enable cross-site requests on your development web server, or (easier) tell brackets where your development web server is serving your html/js/css from – it’s under File…Project Settings. See here for more.

Windows 8 is the worst.

Don’t install Windows 8. Don’t buy a laptop that has it.

I tried. I installed windows 8 on my personal laptop. It was so cheap – \$15 to upgrade – I decided to give it a shot.

What a mistake.

If you do anything at all with your computer, don’t use Windows 8. For a tablet – maybe, I don’t know. For a computer – no way. Just say no.

I’m not a hater or a flamer, I use almost only Microsoft products all day long (Windows 7 and Visual Studio most of my work day). I like Windows 7.

Here’s a very short list of stuff that kills me (there are many other problems):

1. Windows 8 restarts your machine WHILE YOU ARE USING IT to finish installing updates. It warns you but never gives you the chance to prevent it. You could be in the middle of reviewing an emergency patient’s diagnosis, and *BOOM* machine shuts down and is unusable for at least 5 minutes. You have no option to save your work. You have no option to stop this from happening.

2. The “metro UI” or “new start screen” or “tile area” is horribly confusing because it hides everything else. You can’t see what’s running at a glance. You can’t change this behavior.

3. The “charms” bar requires you to point your mouse to a magical location. You can’t change this. You can’t disable this (and make the freakin’ bar always visible)

4. There is no bloody CLOCK or calendar on the default screen, that annoying “start screen”. You can’t add one without installing a third-party “windows store app”.

5. The control panel is split between the “classic” control panel and some super-simplified – yet non-overlapping – set of options in the “charms” settings area. Go figure where to fix something.

Too bad, Microsoft really screwed this one up.

p.s. some search engine food: windows 8 sucks, windows 8 is bad, windows 8 is horrible, terrible, nasty, don’t use windows 8, don’t install windows 8, keep windows 7, windows 7 is better than windows 8. Google, you get the point.

What’s the problem?

This is the last post in a series dedicated to exploring the problem of “penalty search”. Penalty search involves an ordered array: the first part of the array has values that do not satisfy some property, while the second part – the rest of the values – do satisfy the property. Testing a value for the property carries an index-dependent penalty, given by a cost function. We want to find the first value (lowest index) in the array that satisfies the property. The “total cost” of a search attempt is the total penalty spent looking for the value until it is found.

The goal is to devise an algorithm that searches the array with minimal expected total cost.

Why is this interesting?

Besides being fun to wrestle with, this problem is interesting because it deals with a realistic physical situation: you need to find out the ideal timing for some process. Any time-dependent experiment can be modeled by this problem. Every time you try, you spend time, so the cost of the test is time being spent. The goal is to find the solution as quickly as possible, so we want to minimize the expected time spent searching for a solution.

Linear cost – recap

The last post explored how a linear cost function affects the recursive difference equation that calculates the expected cost. Linear cost is interesting because first, it is easier to analyze, and second, because it fits problems where the penalty is the amount of time spent waiting to check a time-sensitive process.

Analytically, the nice property of linear cost is that the size of the array being searched is what determines the cost (up to a constant), not the specific range of indexes. In that last post we took advantage of this translational quasi-invariance of a linear cost function to simplify the problem. After defining a couple of help functions we ended with the following equation:

$h(N)=N(at_N+b)+h(t_N+1)+h(N - t_N - 1)+a(N - t_N - 1)(t_N+1)$

Where:

• $N$ is the size of the region (array) that’s being searched
• $h(N)$ represents the expected cost of a region (almost – you need to divide by $N$ to get the actual cost)
• $t_N$ is the index that our algorithm decides to test in the current step
• and $at_N+b$ is the cost of testing at index $t_N$

Constant ratio search

For linear cost, we expect the solution – the optimal search algorithm – to be somewhat similar to binary section search: we always pick a constant ratio (“section”) of the current range to pick our next search target. Although I can’t prove that this is a valid assumption, I think it’s interesting enough to see what the ratio should be if we do assume constant ratio search.

To begin, we define $t_N := \lfloor rN \rfloor$ where $r \in [0,1)$ (remember we are already assuming that the region is $[0,N-1]$). To make things simpler (though admittedly less precise) we’ll drop the floor notation and treat $rN$ as if it were an integer. Substituting in the last equation, we get:

$h(N)=N(arN +b)+h(rN +1)+h(N - rN - 1)+a(N- rN - 1)(rN +1)$

With a little simplification (saving time using Maxima) we get:

$h(N)=h(rN+1)+h((1-r)N-1)+(2ar-a r^2)N^2+(-2ar+b+a)N-a$

Notice that the two recursive references to the function $h$ nicely represent testing the left and right sub-regions around the split point $rN$. I’m personally not any closer to solving that equation, so here’s where we use approximations.

In the last post we saw that for $N=1$ the value must be zero. We can use this information to solve $N=2$:

$h(2)=h(2r+1)+h(2(1-r)-1)+4(2ar-ar^2)+2(a+b-2ar)-a$

Following the same logic as in that last post, we know that we can only split a region of size 2 into two subregions of size 1. This means $h$ must be referencing recursively calls to $h(1)$ twice. This limits the range of $r$:

$\lfloor 2r \rfloor +1=1$ and $2- \lfloor 2r \rfloor -1=1$

Implying:

$\lfloor 2r \rfloor =0 \implies r < 1/2$

Here we have a first result: we must split somewhere lower than the middle. Less than makes sense if our linear curve is increasing, such that $a>0$. I didn’t have time to go back and check, but I’m guessing I somewhere made that assumption implicitly along the way.

It also makes sense: if the cost is increasing linearly, the upper bound for a constant split ratio is the middle – which is exactly binary section when cost is constant (not increasing at all, $a=0$).

It’s an interesting exercise to plug values for $N$ and see where the bounds on $r$ go. I’m leaving that for another day, or for you to try.

That’s all. If you find interesting instances of this problem, let me know!