xml-to-json is now a library

A while back I needed to convert a ton (millions) of small xml files to json, so I could store them in MongoDB. To that end I wrote a teensy-tiny tool called xml-to-json (github, Hackage). Originally it was just a command-line tool with all the code thrown in a single file.

So, I did a quick refactor this week to split it into a library + executable, and pushed it to github (to deafening cries of joy).

Features

First, a non-feature. xml-to-json is “optimized” for many small xml files. If you have many small xml files, you can easily take advantage of multiple cores / cpu’s. You should be aware that for large files (over 10MB of xml data in a single file) something starts to eat up RAM, around 50 times the size of the file.

Other features:

  • You can filter xml subtrees to convert, by element name regex (and you can skip the matching tree root if you wish, converting only the child elements and down).
  • Output either a top-level json object or json array.
  • (Optionally) simplify representation of xml text nodes in attribute-less elements (e.g. “<elem>test</elem>” -> { elem: “test” })

Packages used

For XML decoding, I’m using hxt (over expat using hxt-expat). I tried a few of the xml packages on hackage, and hxt + expat was the only way I could parse quickly while avoiding nasty memory issues. Apparently, tagsoup can be used with Bytestrings to avoid the same issue but I didn’t try.

JSON is encoded using aeson.

 

About these ads

One thought on “xml-to-json is now a library

  1. We use xml-conduit extensively for this kind of thing. It satisfies all of those needs perfectly – conduit is specifically designed for the kind of streaming you are describing (and there is also a lazy bytestring interface if you want to stick with that), simple and straightforward DOM-like and SAX-like interfaces are available for simple XML parsing, and XML cursors – kind of a monadic approach to XPath – are available for more complex XML parsing needs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s