Summer Work And Gigantron Processor Of Data

» 02 Jun 2008

This summer I am again working for NASA DEVELOP to create visualizations of carbon flumes over the US to demonstrate some of the possibilities with the proposed ASCENDS mission. Basically, my team will be taking a large amount of random data and a bunch of scientific models and trying to squeeze impressive visuals out of it. We don’t really know exactly how things will work out, but there will be a lot of data processing and exploration. Our team will also probably be tasked with helping other teams process data for their projects.

In my time with NASA DEVELOP, I have done a lot of data processing scripting. Most of the time these tasks are pretty straight-forward and simple, but things become pains in our asses when we have to go in later and refine and build on these tasks. Generally these scripts are just hacks and not meant to grow. A little bit of organization and encapsulation would go a long way in keeping everything sane.

Of course, I’m lazy, so making an organized attempt at data processing with directory structures and buttons not labeled “Shock the Monkey1” for a simple script that will take at most 10 minutes to complete and hand test is not something that is going to happen.
Over the last year, I have spent an inordinate amount of time being some form of a Rails fanboy (I also spent some time stalking Alan Kay and preparing a hippie van to follow a Phish tour), so I had seen the benefits of arcane magick in programming. Thinking about my upcoming woes with the maintenance of poorly thought out DP scripts, I realized that I could have my organization — without having to give up time that could be spent making references to episodes of Family Guy. Clearly, code generation has my back in this situation.

In this spirit, I spent yesterday hacking away with RubiGen to create Gigantron: Processor of Data. It is currently a simple set of generators that create a directory and file structure to allow for organized and tested DP projects.

I was inspired by a blog post I found on reddit about Organized Bioinformatics Experiments to ape those techniques for my data processing. Gigantron imports data into SQLite (or any other db you feel like) and accesses it through DataMapper models. The actual logic and transformations that we write for the big bucks go in Rake tasks that operate on these models.

Backing data into an RDBMS and then getting at it through the DataMapper ORM adds some initial leg work and bloats some simple tasks, but it also encourages better treatment of input sources and couples your code and its data much less. Data formats and requirements change and having to untangle business logic from uncommented shortcut, good-enough parse jobs are a way to ruin a perfectly good day. The other advantage to this method is that if, for whatever reason (performance), the data must stay in whatever godawful format it was shipped in (damn you HDF), the model abstraction can still make it appear like any other datasource that comes out of our DB. It can be our little secret.

I’m not entirely convinced that Rake is the right golden hammer for this job, but it does make a compelling case for itself. I need to actually try it out in the real world to see if it will work with me rather than against me in production. I think it is the Right Way™, but I need usage and feedback to know for sure.

That’s the other thing. I have written only enough of a framework for it to be something that I would give a try in production. I don’t know what works or doesn’t, but I have a system that I think will save me some time and can be evolved into something that totally doesn’t suck. It is the usage of it that will allow me to find areas of improvement and evolve Gigantron into a true processor of data/destroyer of cities. Hopefully a few gigabytes of real satellite and atmospheric data will help clarify my ideas with it.

If this looks like something that could help in some way for your work, I would be tickled pink to have other people using it. I am very interested in adding anything that could help productivity as that is the true concern here.

Check it out:

1 I listen to Peter Gabriel while hacking sometimes, sue me.