How CoordinatedBolt Works » 03 Jan 2012

In which I don’t mention Clojure at all

Storm comes with a neat implementation of a common DRPC pattern, Linear DRPC. This pattern handles the common case where the computation is a linear set of steps. The ReachTopology in storm-starter is an example of a very parallel Linear DRPC topology. The cool thing about this is that at any stage for any request that comes through, you can emit as many tuples pertaining to that request as you want and even specify operations that should occur only once a step has seen every tuple for the request that it will ever get. The coordination that allows for this magic is completely invisible to the user and is handled through CoordinatedBolt.

A question about how CoordinatedBolt works came up on the mailing list, so I decided to look at the source code to figure out how it operates. As part of the process, I annotated some source code for my own edification. Reading code is good, so check out the annotated code

The first thing to understand is that LinearDRPCTopologyBuilder significantly changes your topology. This is what the Reach Topology actually looks like (click for fullsize):

Uploaded with Skitch!

You can see the structure of the ReachTopology encased in the framework of the Linear DRPC topology. The bolts that implement the computation are all wrapped by CoordinatedBolts. Direct streams have been added between all of the CoordinatedBolts. The final step in the ReachTopology gets an additional input stream from prepare-request that is grouped on the request id and is simply a stream of the ids of all the requests that have come in. There is also the scaffolding for the information necessary to return the result to the proper DRPC client that is handled by JoinResult.

CoordinatedBolts add a layer of tracking on top of other bolts. It delegates to the underlying bolt for everything that isn’t part of the book keeping or implementation of CoordinatedBolt itself. Internally, each task contains data for every request it has seen on the number of tuples received from the previous bolt (tracked by the OutputCollector when user code acks or fails a tuple, a total across all tasks of the previous bolt), the number of tuples that each previous task has sent to this task, and the number of previous tasks that have told this task how many tuples they sent. The reports from previous tasks are received over the direct stream, and are sent downstream only once the task is considered “finished”. In this way, the “finished” status asynchronously cascades down the topology.

For a task to be considered “finished” for a request (and it is only ever on a per request basis), it depends on a few different factors (in the code, this is the checkFinishId method). A task in the first bolt is complete once the single request tuple from prepare-request is acked or failed. A middle task is complete once all the tasks for the previous step have reported the number of tuples they sent to this exact task (or 0 if they sent none, still have to report it) and the number of tuples (not counting the coordinated bolts book keeping tuples) this task has received (e.g. acked/failed) matches the number of tuples the previous step has told it to expect. A task in the final bolt is complete when the conditions for the middle task are met AND it has received the id tuple from prepare-request. All of these are separated by the request id in field 0 of all the tuples.

Once a task is finished, if the underlying bolt implements FinishedCallback, the finishedId callback is called with the request id. After that, the task iterates through all the tasks in the next step, sending each one the number of tuples it sent to that task for the request over the direct stream. The order is important because the finishedId could (and usually would) emit more tuples, affecting the final count.

A task checks whether it is finished every time it receives a book keeping tuple and every time a tuple is acked or failed from the user provided bolt.

Once the topology completes the request, JoinResult puts the result together with the DRPC return info. ReturnResult handles the actual sending of the result back to the DRPC client that made the call.

The really cool part of all of this, is that it is entirely built on top of normal Storm primitives. As Nathan said on the mailing list:

Just want to point out the underlying primitives that are used by CoordinatedBolt: 1) When you call the "emit" method on OutputCollector, it returns a list of the task ids the tuple was sent to. This is how CoordinatedBolt keeps track of how many tuples were sent where. 2) CoordinatedBolt sends the tuple counts to the receiving counts using a direct stream. Tuples are sent to direct streams using the "emitDirect" method whose first argument is the task id to send the tuple to. 3) CoordinatedBolt gets the task ids of consuming bolts by querying the TopologyContext.

Testing Storm Topologies Part 2 » 21 Dec 2011

Previously, I wrote about testing Storm topologies using the built-in Clojure testing utilities. You should read Part 1 to understand what Storm gives you by default. This should be enough to test many topologies that you may want to build. This post digs in to more advanced testing scenarios, using the RollingTopWords topology from storm-starter as an example. I’ve forked that project to write tests for the provided examples.

But first, a brief digression.

Why using Clojure to test your Java topologies is not so bad

Currently, the testing facilities in Storm are only exposed in Clojure, though this seems likely to change in 0.6.2. Even if you write nearly everything in Java, I think Clojure offers a lot of value as the testing environment. You’ve already paid the price for the Clojure runtime through the use of Storm, so you might as well get your money’s worth out of it. Clojure macros and persistent data structures turn out to be really helpful when writing tests. In normal usage, mutable data structures shared between threads can often be a good fit if you are careful with thread safety and locks. Tests benefit from different constraints, though. Especially when testing a system like Storm, you might want to take state at a given time, perform some operation, and then ensure that the state changed thusly. While this can be accomplished using careful bookkeeping and setup, it’s almost pathetically easy to do when you can compare the old state with the new state at the same time. Clojure is also significantly terser than Java, so you can experiment with new tests with less typing.

Learning Clojure isn’t exceptionally difficult, especially if you have had some exposure to functional programming (Ruby counts). I read a book on it a month ago and have an acceptable grasp on it. The amount that you need to know to write tests in it is pretty small. You can mostly just use Java in it like so:

(Klass. arg1) ; new Klass(arg1)

(Klass/staticMethod) ; Klass.staticMethod()

(.method obj arg1 arg2) ; obj.method(arg1, arg2)

In any case, I personally like using Clojure to test topologies, no matter what language they were originally written in.

Dances with RollingTopWords

RollingTopWords is a pretty cool example that takes in a stream of words and returns the top three words in the last ten minutes, continuously. You have a counter bolt (“count” in the topology) that uses a circular buffer of buckets of word counts. In the default configuration, there are 60 buckets for 10 minutes of data, so the current bucket gets swapped out every 10 seconds. When a word comes in, that word’s count in the current bucket is incremented, and the bolt emits the total count of that word in all buckets. A worker thread runs in the background to handle the clearing and swapping of buckets. The word and its count are then consumed by the “rank” bolt, which updates its internal top 3 words and then, if it hasn’t sent out an update in the last 2 seconds, emits its current top 3 words. This is consumed by one “merge” bolt that takes the partial rankings from each “rank” task and finds the global top 3 words. If it hasn’t sent out an update in the last 2 seconds, it emits the rankings.

This topology’s behavior depends extensively on time, which makes it harder to test than topologies that are simply a pure function of their input. In writing the test for RollingTopWords . I first had to make a few changes to the source code to allow time simulation. Storm comes with the utilities backtype.storm.utils.Time and backtype.storm.utils.Utils that allow for time simulation. Any place where you would normally use System.getCurrentTimeMillis(), use Time.getCurrentTimeMillis(), and where you would use Thread.sleep(ms), use Utils.sleep(ms). When you are not simulating time, these methods fall back on the normal ones. The other thing that the timing element does is make complete-topology kind of useless for getting any sort of interesting results. I use a capturing-topology from my own storm-test library. It is basically an incremental, incomplete complete-topology.

Testing is now a matter of ensuring two things:

  1. Word counts are tabulated for a time period and then rotated.
  2. Ranks are actually calculated and emitted correctly.

The first is especially time sensitive since a bucket is current for all of 10 (simulated) seconds. The capturing-topology helpers wait-for-capture and feed-spout-and-wait! both depend on simulate-wait which takes at minimum 10 simulated seconds (and up to TIMEOUT seconds, in increments of 10). advance-cluster-time from backtype.storm.testing also requires care as by default it only advances the simulated time one second at a time (which is slow in real time). If you jack up the increment amount past (by default) 30, which seems reasonable if you’re trying to go forward 10 minutes into the future, your cluster will start restarting itself because of a lack of heartbeat. In this example, any value greater than 10 will confuse the worker thread handling the cleanup, creating weird results. Time is stopped while simulating, so, while still complicated, you can still be fairly precise in your control.

To test the first, the boilerplate looks like:

(deftest test-rolling-count-objects
    (with-simulated-time-local-cluster [cluster]
      (with-capturing-topology [ capture
                                 :mock-sources ["word"]
                                 :storm-conf {TOPOLOGY-DEBUG true} ]

At this point, the time is now 10s.

It’s time to test the single bucket functionality by feeding in a bunch of words and making sure the count is as we expect.

        (feed-spout! capture "word" ["the"])
        (feed-spout! capture "word" ["the"])
        (feed-spout! capture "word" ["the"])
        (feed-spout-and-wait! capture "word" ["the"])
        (is (= ["the" 4]
               (last (read-current-tuples capture "count"))))

The time is now 20s because of the wait after the four tuples are fed in.

We should advance time so we can test the multiple in play bucket case

        (advance-cluster-time cluster 50 9)

Time is now 70s, advanced in increments of 9 to let the worker thread do its business and avoid nasty timeouts.

        (feed-spout! capture "word" ["the"])
        (feed-spout-and-wait! capture "word" ["the"])
        (is (= ["the" 6]
               (last (read-current-tuples capture "count"))))

Time is now 80s. Let’s advance the cluster so the first bucket is now a long lost memory, but the second bucket we wrote to is still in play. To check that, we pump another word in and check the counts coming out.

        (advance-cluster-time cluster 540 9)
        (feed-spout-and-wait! capture "word" ["the"])
        (is (= ["the" 3]
               (last (read-current-tuples capture "count"))))

And that’s that. Over 10 minutes of fake time simulated in under 10 seconds of real time. The only thing left in this test is to close it out in true Lisp fashion:


The test for the rankings that come out of the system is similar, but much simpler because as long as there is at least 2 seconds between each ranking producing tuple and less than 10 minutes of total simulated test time, things pretty much just work. The feed-spout-and-wait! calls give at least 10 seconds of spacing which works out perfectly. The details of that test can be seen in test/storm/starter/test/jvm/RollingTopWords.clj


I released storm-test version 0.1.0 today. It’s installable using the standard lein/clojars magic as [storm-test “0.1.0”]. In addition to the capturing-topology that this blog post demonstrated, it also has the quiet logs functionality and a visualizer for topologies that could be helpful on certain hairier setups.

I should probably plug my company, NabeWise, as it is the reason I get to get my hands dirty with all of this data processing. We’re doing really exciting things with Clojure, Node.js, Ruby, and geographic data.

Testing Storm Topologies (in Clojure) » 17 Dec 2011

"Storm": is a very exciting framework for real-time data processing. It comes with all sorts of features that are useful for incremental map reduce, distributed RPC, streaming joins, and all manner of other neat tricks. If you are not already familiar with Storm, it is well documented on the "Storm wiki": . At "NabeWise":, we are in the process of creating and rolling out a new system that builds on top of Storm. Storm's Clojure DSL is really very good and allows us to write normal Clojure code that we can then tie up into topologies. This system will enable a large chunk of our feature set and will touch much of our data. Testing that the functionality works as expected is extremely important to us. By using Clojure, we can test much of our system without thinking about storm at all. This was critical while we were writing core code before even having decided on using storm. The functions that end up running our bolts are tested in the usual ways without dependency or knowledge of their place in a topology. We still want to be able to test the behavior of our entire topology or some part of it to ensure that things still work as expected across the entire system. This testing will eventually include test.generative style specs and tests designed to simulate failures. Luckily, Storm ships with a ton of testing features that are available through Clojure (and currently only through Clojure, though this is liable to change). You can find these goodies in "src/clj/backtype/storm/testing.clj": These tools are pretty well exercised in "test/clj/backtype/storm/integration_test.clj": . We will look into the most important ones here. h4. with-local-cluster This macro starts up a local cluster and keeps it around for the duration of execution of the expressions it contains. You use it like:
  (with-local-cluster [cluster]
    (submit-local-topology (:nimbus cluster)
                           {TOPOLOGY-DEBUG true})
    (Thread/sleep 1000))
This should be used when you mostly just need a cluster and are not using most of the other testing functionality. We use this for a few of our basic DRPC tests. h4. with-simulated-time-local-cluster This macro is exactly like before, but sets up time simulation as well. The simulated time is used in functions like complete-topology when time could have some impact on the results coming out of the topology. h4. complete-topology This is where things start getting interesting. @complete-topology@ will take in your topology, cluster, and configuration, mock out the spouts you specify, run the topology until it is idle and all tuples from the spouts have been either acked or failed, and return all the tuples that have been emitted from all the topology components. It does this by requiring all spouts to be FixedTupleSpout-s (either in the actual topology or as a result of mocking). Mocking spouts is very simple, just specify a map of spout_id to vector of tuples to emit (e.g. @{"spout" [["first tuple"] ["second tuple"] ["etc"]]}@). Simulated time also comes into play here, as every 100 ms of wall clock time will look to the cluster like 10 seconds. This has the effect of causing timeout failures to materialize much faster. You can write tests with this like:
  (with-simulated-time-local-cluster [cluster]
    (let [ results (complete-topology cluster
                                      {"spout": [["first"]
                                                 ["third"]]}) ]
      (is (ms= [["first transformed"] ["second transformed"]]
               (read-tuples results "final-bolt")))))
All the tuples that are emitted from any bolt or spout can be found by calling @read-tuple@ on the results set with the id of the bolt or spout of interest. Storm also comes with the testing helper @ms=@ which behaves like normal @=@ except that it converts all arguments into multi-sets first. This prevents tests from depending on ordering (which is not guaranteed or expected). As cool as @complete-topology@ is, it is not perfect for every scenario. FixedTupleSpout-s do not declare output fields, so you can't use them when you use a field grouping to a bolt straight off of a spout. (*Correction*: Nathan Marz pointed out that FixedTupleSpouts will use the same output fields as the spout they replace.) You also give up some control over timing (simulated or otherwise) with the dispatch of your tuples, so some scenarios like the RollingTopWords example in "storm-starter": which only emit tuples after a certain amount of time between successive tuples will not be predictably testable using complete-topology alone. This is where simple testing seems to end. I'm including the next macro for completeness and because I think it could be potentially useful for general testing with some wrapping. h4. with-tracked-cluster This is where things start to get squirrelly. This creates a cluster that can support a tracked-topology (which must be created with @mk-tracked-topology@). In your topology, you most likely want to mock out spouts with FeederSpout-s constructed with @feeder-spout@. The power of the tracked topology is that you can feed tuples directly in through the feeder spout and wait until the cluster is idle after having those tuples emitted by the spouts. Currently, this seems to be mainly used to check behavior of acking in the core of storm. It seems like with AckTracker, it would be possible to create a @tracked-wait-for-ack@ type function that could be used to feed in tuples and wait until they are fully processed. This would open up testing with simulated time for things like RollingTopWords. h3. Testing Style The first thing I like to do with my tests is to keep them as quiet as possible. Storm, even with TOPOLOGY_DEBUG turned off, is very chatty. When there are failures in your tests, you often have to sift through a ton of storm noise (thunder?) to find them. Clojure Contrib's logger and Log4J in general are surprisingly hard to shut up, but tucking the following code into a utility namespace does a pretty good job of keeping things peaceful and quiet.
  (ns whatever.util
    (:use [clojure.contrib.logging])
    (:import [org.apache.log4j Logger]))
  (defn set-log-level [level]
    (.. (Logger/getLogger 
      (setLevel level))
    (.. (impl-get-log "") getLogger getParent
      (setLevel level)))
  (defmacro with-quiet-logs [& body]
    `(let [ old-level# (.. (impl-get-log "") getLogger 
                           getParent getLevel) ]
       (set-log-level org.apache.log4j.Level/OFF)
       (let [ ret# (do ~@body) ]
         (set-log-level old-level#)
For testing the results of a topology, I like to create a function that takes the input tuples and computes the expected result in the simplest way possible. It then compares that result to what comes out the end of the topology. For sanity, I usually ensure that this predicate holds for the empty case. As an example, here is how I would test the word-count topology in storm-starter:
  (defn- word-count-p
    [input output]
    (is (=
              (fn [acc sentence]
                (concat acc (.split (first sentence) " ")))
          ; works because last tuple emitted wins
            (fn [m [word n]]
              (assoc m word n))
  (deftest test-word-count
      (with-simulated-time-local-cluster [cluster :supervisors 4]
        (let [ topology (mk-topology)
               results (complete-topology 
                         {"1" [["little brown dog"]
                               ["petted the dog"]
                               ["petted a badger"]]
                          "2" [["cat jumped over the door"]
                               ["hello world"]]}
                         :storm-conf {TOPOLOGY-DEBUG true
                                      TOPOLOGY-WORKERS 2}) ]
          ; test initial case
          (word-count-p [] [])
          ; test after run
            (concat (read-tuples results "1") 
              (read-tuples results "2"))
            (read-tuples results "4"))))))
h3. Conclusion This is my current thinking about testing storm topologies. I'm working on some tests that incorporate more control over ordering/timing, as well as, hooking a topology test into test.generative or something of that sort, so that I can test how a large number of unpredictable inputs will affect the system as a whole. "Part 2": is now available.

Painless Widget Armor » 09 Nov 2010

Developing attractive widgets for embedding on random pages can be an exercise in frustration. For "NabeWise":, we've been through many iterations of our widgets for purely technical reasons with almost no change in styling (though we have some new designs in the pipeline that will significantly improve look/feel). Our first iteration was a simple iframe embed. After a frantic call from our "SEO guy":http:// , we realized that we probably wanted to get some Google Juice out of these things, so we finally dove into the hell that is CSS armoring. The current version that we're offering is based on "": This armor is very thorough, but its a pain to actually do work with, causing the simple header and footer that wrap around the content iframe to take almost as much time to style as the actual content of the widget. We're in the process of drastically changing how we do widgets, shifting from the iframe technique to fully JavaScript templated widgets through Mustache and a new (private) API. This of course means dealing with more armor (a fate we tried and failed to cheat through less thorough CSS reseting). When it finally became clear that we were going to have to use real armor (1am, last night), there was much gnashing of teeth. Luckily, the process this time around was painless and finished in time to catch the tail end of a rerun of Millionaire Matchmaker (1:45 am, Bravo). !/images/armor.jpg![1] The key this time was using "CleanSlateCSS": . Hey, if it's good enough for the bbc, it's good enough for me. The two changes necessary for this to work were adding the class "cleanslate" to our widget container and then changing all of our CSS rules to !important. We already had all of our CSS written, and I had no intention of adding !important to each declaration manually (and then remembering to always do so in the future), so I whipped up a quick hack to do it for me based on "CssParser": . Just call CssImportantizer.process(css_string) and it's done for you.
  require 'css_parser'

  class CssImportantizer
    class << self
      include CssParser
      def process(string)
        s = string.dup
        parser =
        parser.each_rule_set do |rule_set|
          rule_set.each_declaration do |prop, val, important|
            rule_set.add_declaration!(prop, val + " !important")
Because of the way CssParser handles existing !important values (setting them as a separate flag on the parsed data), just reseting the declaration with string concatenation of " !important" works. The one major caveat here, is that the resulting string will not be compressed, so you're going to want to pass the result through the YUI CSS Compressor or similar before using it. In any case, this worked like a charm and I think I had to change exactly one other rule in our CSS to make it perfect. fn1. Photo from "marfis75":

Quick, Nimble Languages » 01 Oct 2009

h4. or Why the Mainstream Will Never Steal Our Surplus I am a programming languages bigot. There — I admit it. I write Rails code for startups and turn my nose up at those who slave away in Eclipse working with Java. I have always assumed that the promised land of modern, dynamic languages bears fruit for anyone who seeks it. I never really considered the "why" of Java, as that would interfere with my unbridled hate. My Software Engineering class is centered around the Miltonian task of justifying Java's ways to man. The professor constantly harps on the benefits for maintenance that the safety, strictness, and explicit verbosity provide to the development team, yet dynamic languages feel more productive and seem to entirely outstrip Java development in getting things done. Many prominent figures support this observation. David Heinemeier Hansson refers to the productivity gap as the "great surplus":, while Steve Yegge comments on "how much more productive dynamic languages are": Yegge's talk addresses how the current complaints about tooling, performance, and maintainability in dynamic languages are mostly bunk. DHH believes that if the Mainstream started using Rails, we'd lose our competitive advantage over them. The maintenance issue in large systems is generally dodged by asserting that those who use dynamic languages don't NEED millions of lines of code. Would switching to nimble, dynamic languages give mainstream Java code shops the surplus or productivity boost that so many of us enjoy? From the code I've seen, the average developer writes overly verbose dynamic code. This can be quite bad because of the extreme expressiveness of these languages, where every extra line makes the program more likely to hit upon the issues that legitimately make dynamic code unsafe. The added strictness and constant line noise of Java more readily highlights potential sources of failure. The other issue is that many programmers depend on rich IDE tooling and "debuggers": to understand even the standard operation of their code. !/images/safety-scissors.jpg! Java makes doing dangerous (or interesting) things painful. It protects programmers from themselves. If nothing else, it allows bad code to be isolated and encapsulated away from the rest of the system, protecting the other workers. This comes at the cost of velocity. I believe that this speed penalty is only a major factor for better programmers, because Java only contributes to the slowdown caused by inefficient tools. The other form of slowdown is the kind caused by not fully understand the problem or solution. The latter is by far the more significant source of pain for the developer, and in comparison the specific technology stack is largely irrelevant. The development team needs to consist of good programmers to make the best use of dynamic languages. Larger or more mixed teams are probably better off sticking with safer Java. This is also advantageous because nearly everyone knows Java, so finding staff is much easier. This helps explain why dynamic languages are such a great fit for startups. For startups, time is the critical factor, teams are small and carefully selected, earning potential is high, and the applications tend to be exciting and consumer oriented. These factors conspire to attract developers who can make the most of the latest and greatest technologies. Dynamic languages might be our "secret" advantage, but there is little danger that the mainstream will be able to use them to overtake us.

Snippet: List the Gems Your App Needs » 06 Aug 2009

When you aren't careful, it is easy to slip gems into your app without properly accounting for them. Often times it is simpler to just hope on system gems than mess with @config.gem@. This makes deployment more difficult and can make bringing a new development environment online take significant time and energy. To fix this later, you need some idea of the gems on which your app depends. Put this snippet into the Rakefile below the boot line and run @rake test | grep GEM@
    module Gem
      class << self
        alias orig_activate activate
        def activate(*args)
          puts "GEM: #{args.first}"

Expiring Rails File Cache by Time » 13 Jul 2009

One of the major weaknesses of the Rails cache file store is that cached pages cannot be expired by time. This poses a problem for keeping caching as simple as possible. The solution I came up with stores cached content as a JSON file containing the content and the ttl. Expiration needs only be set when @cache@ is called. @read_fragment@ should know nothing about expiration.
  <% cache('home_most_popular', :expires_in => do %>
    <%= render :partial => 'most_popular' %>
  <% end %>
The code that makes this work should go in @lib/hash_file_store.rb@
  require 'json'
  class HashFileStore < ActiveSupport::Cache::FileStore                                
    def write(name, value, options = nil)
      ttl = 0   
      if options.is_a?(Hash) && options.has_key?(:expires_in)                          
        ttl = options[:expires_in]                                                     
      value = JSON.generate({'ttl' => ttl, 'value' => value.dup})                      
    def read(name, options = nil)                                                      
      value = super
      return unless value                                                              

      #malformed JSON is the same as no JSON
      value = JSON.parse(value) rescue nil 

      fn = real_file_path(name)                                                        
      if value && (value['ttl'] == 0 || 
        (File.mtime(fn) > ( - value['ttl'])))  
Put the following line in @config/environment.rb@ and it should be good to go.
  ActionController::Base.cache_store = :hash_file_store, 

Old Yeller: Finding Dead Code » 04 Jul 2009

As Rails applications age, they tend to cruft up in ways that make maintenance difficult. Refactoring is the solution, and common best practices like unit testing and regular runs of "flog": and"flay": help facilitate this. Unfortunately, these techniques do not greatly help the elimination of dead code. Rails makes the accumulation of dead code, code that is never run in the application, very easy. Be it templates that are overridden by better, more format specific alternatives (e.g. @.html.erb@ vs. @.erb@), helper methods that are not excised when the views are changed, convenience methods on models that no longer reflect a use of the model, or controller actions for which no route exists, dead code confuses and complicates working with a code base. Refactoring a block of code that is never actually used is also an exercise in wasted frustration. To combat this, I have released "Old Yeller": Using the power of RCov, Shoulda, Rails Routing, and hilarious monkey patching, Old Yeller will tell you what Ruby code is not run and which templates are never rendered. It automatically generates test cases for every rule in your routes and then runs the code. This creates an RCov coverage report for the application and a list of unused templates. !! h4. Caveats This tool is not perfect. For it to be effective, you must correctly configure parameters for your actions in @dead_code.rb@ and specify working data in your test fixtures. It is important to note that only routes that specify both controller and action will be run. Old Yeller is just too old to deal with catch-all routes or route precedence. It is important to note that code that is reported as not run, may actually be live code that gets called in some scenario that is not exercised by the test cases or test data. Code that is reported as being run is most certainly live. Before deleting code, use @ack@/@grep@ and common sense. Once again, the link is , tell your friends.

Common Rails Beginner Issues » 27 Jun 2009

I've recently become somewhat addicted to "Stack Overflow":, and I have noticed some areas of confusion with using Ruby on Rails. Convention over configuration and awesome magick are pretty foreign concepts in most of CS, so the confusion is quite understandable. The Rails community also seems to have a love for demonstrating simple applications being created simply (Blog in 5 minutes, extreme!) and complex corner cases solved through advance techniques. There isn't much in the way of middle ground. If you are planning on doing a lot of Rails and are okay with buying dead tree books, you should go buy The Rails Way by Obie Fernandez at your soonest convenience. It is worth its (considerable) weight in gold[1] The following are what I've observed to be among the hardest issues for people moving from idealized Rails applications to the realities of actual websites. I will try to mostly avoid the issues of separation between the levels of MVC and any of the philosophical opinions of DRY. h3. Dependencies: How to make Rails find your classes. Starting out, Rails is amazing at automagically including everything you need to make the code run without any need for @require@ or @load@ lines. Things get a little more confusing when you write your own code and find that it is not being included where you expect. The key here is that Rails uses *naming conventions* to handle code loading. The naming convention is that class names are in CamelCase and the file containing them are named using underscores.
  #filename: foo.rb
  class Foo
  #filename: foo_bar.rb
  class FooBar
  #filename: admin/foo_bar.rb
  class Admin::FooBar
Classes that serve as models, views, controllers, or sweepers should go in their properly named folders under @app/@. Other classes should either be placed in @lib/@ or factored out into plugins. It is worth noting that files are only loaded when the class name is referenced in the code. Whenever an unknown class name (or any constant) is used in code, Rails attempts to find a file that might define it (according to file name) and then loads it. h3. Routing @config/routes.rb@ is kind of a mess. This is what you need to know. h5. Connect a url to a controller action This makes go to FooController#bar with params[:id] = 42
  map.connect '/foo/:id', :controller => 'foo', :action => 'bar'
This does the same, but only for GET requests
  map.connect '/foo/:id', :controller => 'foo', :action => 'bar', 
      :conditions => {:method => :get}
This validates that params[:id] is a number (/foo/42 will match, /foo/baz will not)
  map.connect '/foo/:id', :controller => 'foo', :action => 'bar', 
      :id => /\d+/
h5. Named routes Linking to controllers through @link_to "click here", :controller => 'foo', :action => 'bar', :id => 42@ is cumbersome. Named routes allow you to reference this same path with @foo_path(42)@ '/foo/:id', :controller => 'foo', :action => 'bar'
All previously discussed options work the same for named routes. h5. RESTful resources If you use REST (and you should), routes get a whole new load of magick.
  map.resources :products
  skip_before_filter :verify_authenticity_token, 
      :only => :action_name
That's it for now. The API ( is a great reference for other issues that might come up. fn1. Ron Paul has proposed using copies of The Rails Way as the basis for US Currency.

Using Fluid For Convenient Rails Diagnostics » 02 May 2009

I recently got a Macbook Pro and have been quite impressed with it. I have also been doing a ton of work for "BigThink": I'm a tabbed browsing nut, and I discovered that if I ever wanted to get anything done I had to limit my tabs to the point where they all still fit in the bar. This forces me to actually read or act upon the things I have thrown into tabs rather than just letting them simmer. Working on a large Rails project means that I often find myself needing information from trendy websites like Lighthouse, Hoptoad, and New Relic. That's three more tabs towards my limit. Every time I closed those tabs to make more room for normal web browsing, something happened that caused me to have to check them again. Then I discovered that Fluid supports tabs and saves the tab session per application. !! The trick is to set up the SSB to one site (in my case New Relic) and then on first run open up tabs and go to the other diagnostic services (Hoptoad and Lighthouse for me). The end result is a full diagnostic panel that pops up whenever you click the icon. !!

Kubuntu Intrepid Uselessness » 27 Nov 2008

I just don't understand why Kubuntu 8.10 had to switch to a desktop environment that is just not ready in any way for daily use. KDE4 is almost completely unusable on my machine due to the dual monitors and nvidia video card. Many of the features of KDE4 just feel incomplete. It is very pretty, but pretty does not make up for stability. Normally this wouldn't matter, but 8.04.1 causes system death with my wireless card. It all worked so wonderfully over the summer, but now it is broken. I need my Linux to be a stable platform for development as I have a large amount of work that needs to be done. I'm in the process of reinstalling 8.10 with the intent of switching to XFCE. I don't actually like XFCE, but I need something that works. If 8.10 does not solve my system freeze problem, then I'm going to have to switch distros. I don't really have the time for this. I hate it when this happens.

Why Windows, Why? » 29 Oct 2008

Pop Quiz: You are writing an operating system that sometimes needs to restart to install updates. How do you accommodate this? # Place an icon informing the user of the need to reboot on the system tray. # Pop up a dialog giving the user the option to restart now or not. # Do nothing, restart will happen someday. # Pop up a dialog informing the user of an imminent reboot. If there is no answer, reboot in 5 minutes. If the user cancels the reboot, pop up the dialog again in 5 minutes. Repeat until user goes to use the bathroom or get a snack, then go down for reboot while killing all unsaved data with extreme prejudice. Bonus points if this reboot can knock out a carefully arranged workspace full of consoles and Vim windows. Ensure that your desktop environment also provides no session management. Just when I start finding Windows usable, it goes and sucks harder. The fact that ruby on rails development is a nightmare and a half on Windows further contributes to my foul demeanor. I don't understand how Windows manages to make Ruby _slower_ -- an outcome I considered impossible. I also find Window's lack of being UNIX disturbing. That all said, I'm pretty sure Linux won't work on my laptop anyway, and it certainly wouldn't work particularly well (being a tablet and all). I wish things had worked out in a less complex way that would have involved me being able to buy a MacBook (but that is another story entirely). VMWare also boots incredibly slowly on my laptop. My laptop is a 2.5ghz Intel Core 2 Duo with 4gb of RAM, so I can only assume that the 10 minutes of near lockup that occurs every time I press the VM power on button is related to Windows in some way. If I can get VMWare running well, then I will be back in business.

Fog Creek Interview » 28 Oct 2008

Last week I was flown up to New York City by "Fog Creek Software": for an interview for a summer internship. The whole trip was amazing. Fog Creek took great care of me, I like the city, and I got to see one of my cousins while I was there. The actual interviews were grueling, and I didn't get the job, but I still have nothing but the warmest feelings about Fog Creek as a company. This all started about three weeks ago when I sent in my cover letter and resume. I saw Fog Creek as pretty much the ultimate long shot internship (read about the intern perks and the reasons will become clear), so I basically expected to be condescendingly dismissed for even attempting to apply to such an elite position. Within a few days, I had a response asking for a phone screen. The next Monday I was on the phone with a developer being brutalized over data structures and the like. The phone screen was a rather grueling hour. I walked around in a daze for the rest of the day. I felt like I did somewhat acceptably, but figured that this was where my adventure ended. I was just honored to have gotten that far. I woke up the next day to an email inviting me to interview in person at their Lower Manhattan office. After much celebration (including a few misguided attempts at dance), I rushed to Barnes and Noble to pick up a copy of K&R C to study for the interview. My schedule is fairly complex due to the general expectation that I attend class and don't fail exams, but I had a surprising free spot in my schedule after my Sanskrit midterm last Monday afternoon. A week after my phone screen, I was on a small airplane from Charlottesville Airport (CHO) to LaGuardia (LGA). Charlottesville Airport really is a nice airport. The view into the mountains is wonderful. It is a clean, modern, convenient facility. My biggest complaint with it is that I scheduled a lot of extra time for dealing with the usual crap of modern air travel, so I had a huge amount of time to just sit around after I picked up a boarding pass and cleared security in under ten minutes. The fall is a quite nice time to fly from the mountains and over the East Coast. I arrived in New York and found a limo waiting for me. I felt pretty pimp. I arrived at the hotel and checked into my beautiful suite with no incident, though I did fear that the jig was up when I was asked my age (18, which is less than 21 which is probably what hotels say they require). I then headed out to Greenwich Village to meet up with my cousin. The next day I got up early, cleaned up, and hit the town. My interview was not until 10:00, so I took the opportunity to explore the financial district. Fog Creek's new office is a block or two away from Wall Street. I slicked back my hair, put on suspenders, and started my corporate raiding because greed, for lack of a better word, is good. After those shenanigans, I headed up to the 25th floor of 55 Broadway to get interviewed. They really have a nice set up with a great view. I met with two people in the office and one guy for lunch. The first interview was on data structures. It took a little while for me to get my brain in gear and I was too slow starting off. I gradually improved as I got into the right mode for it, but I wasn't exceptional at it. A lot of this was my general inexperience with data structures and a lot of it was my nerves. I know the fundamentals of data structures, but I just don't practice with them too much. I get a little lazy and generally hide behind my abstractions. Going forward, I will definitely work on my DS chops as that seems to be a good investment. The second interview was on pointers and recursion. One section of it involved de-obfuscating a bit of C code. Generally I'm pretty good with pointer arithmetic and foolishness, but I was way off that day. I think I just got overwhelmed initially and tried to depend on idioms that I only half remembered rather than actual thought. I stumbled through it and mixed up silly things, but I was eventually prodded to recovery. It was completely embarassing. I beasted the recursion though, so that was decent. For lunch, we went to a nice Italian place, and I had a nice conversation with a member of the Smalltalk cult. I quite fancy Smalltalk myself, but I have not really walked the walk. This conversation convinced me to give GNU Smalltalk a shot. It was actually a really fun lunch. After lunch, we went back to the office and I was told that I was done and could go on my merry way. I pretty much knew that I was still on the job market at this point, but I was only a little bummed. I walked back to my hotel room and watched some daily show and family guy. Then I hit the streets and just walked around soaking it all in. I managed to, through Brownian motion, make my way from just south of TriBeCa to somewhere in Greenwich Village. I wandered for a while looking at things. At some point my cousin called and I happened to be a few blocks away from his apartment, which was an odd bit of chance. One of my cousin's roommates is an entrepreneur and founder of "BigThink":, which is pretty cool. The next morning, I got into a limo headed to the airport and prepared for life back in Charlottesville. A few days later I got the email from Fog Creek telling me that I didn't get the gig, but I was neither surprised nor disappointed. It was a pretty amazing couple of days and a great trip regardless. I do hope to find a cool internship this summer, but I am not dreadfully worried about it (yet).

New New-Blog » 22 Sep 2008

After setting up my blog with hobix installed on my Dreamhost account yesterday, I realized that I wasn't dreadfully fond of that situation either. Ideally I should be able to write content in the most convenient manner possible and have it somehow get to the interwebs. I'm much happier using my local install of Vi and having full control of all the libs that bake my blog into a delicious website ready for dissemination. I am also happy when my data can survive me randomly overwriting critical things. Given these general desires, I set out to modify hobix to run on my laptop and then, using git and github, get the compiled website to my webserver. Hobix seemed to operate in a very simple manner than would make it easy to modify, so I forked it on github and got to work. I quickly added in a hook and configuration option to commit and push to the remote repo specified by @blahg@. I then hooked that up to a repo on github. When changes are pushed to github, a post-receive message is sent to a PHP file on my webserver that then performs a @git pull@. All the plumbing is working now. While I was hacking this together, I ran into some issues that required going deeper into the code of Hobix. I ended up completely deleting the lockfile capabilities and also changing some of the path handling code that screwed up the templates when generated on Windows. While I was going through this, I realized just how much code was there. It seems to be an extremely fully featured and complicated piece of blogging software. I'm not entirely sure how I feel about that. I'm getting somewhat burnt out on blog features. I kind of feel that I should be able to compile my blog using @make@ like god intended. If I get some more free time, I will probably end up trying my hand at a Rake based blog build system. I really like my current setup and I think Hobix is quite the tool. I will continue using Hobix while I think of ways that might make me happier about my blog system.

New Blog » 21 Sep 2008

Typo angered me for the final time by not working for the last month. I am now trying this Hobix tomfoolery. I will be gradually migrating old postings over to my new blag.

Integrating Ruby-Processing Into An Existing Project » 24 Jun 2008

"Processing": and "Ruby-Processing": are really awesome programs for visualizing things and making pretty doodads. Ruby-Processing is great because it uses the JRuby Java bridge to expose all of Processing's immense power to normal Ruby code. Since I use JRuby at work for data processing anyway, Processing seemed like a natural fit for being braindead easy for drawing. When I went to hack it into my current project I was sorely disappointed to discover that Java was squawking about SecurityExceptions and signer mismatches when run from @jruby@. First, how is code-signing a first-class language function? Second, this appears to be a "known issue": I really hate Java. Out of frustration, I decided to hack at it and try to recreate the jar so it would be unsigned and happy. Doing this seemed to work as I can now use Processing wherever I want in my application. If you want to use Ruby-Processing in your JRuby app, download "core.jar": and "ruby-processing": and place them wherever your other lib files live. I created this jar by unzipping the normal ruby-processing core.jar, removing all metadata, and rebuilding it like jar cf core.jar processing/core/*.class This seems to work and is generating a "color bar": for me now.

Ascends Data Visualization And Further Adventures With Gigantron » 22 Jun 2008

I started work Monday. It's fun to work on an Air Force Base, because your commute gets all sorts of fun things like F-15s and F-22s making their landing approach right over where you are driving. My team is pretty cool so we should be able to get stuff done. When I arrived we hadn't gotten our data yet, so I experimented with the different directions in which my chair could swivel. The other guys on my team know the admin password and have to do IT grunt work. They have also picked up various data processing tasks from other teams. Because of this, I'm the one currently tasked with working on the actual ASCENDS project. We got the data Wednesday and were pretty shocked by what we saw. The data formats don't seem to correspond in any sort of logical manner. I can't find the sync-points between altimeter/GPS readings and the CO_2 instrument data. Judging from my skimming of the IDL code that was on the DVD with the data, it can't sync them either. There are magic numbers somewhere that we just don't know. The fallback in the face of ignorance is, of course, plotting what we can and making it pretty in Google Earth. To this end, I'm using gigantron. It's weird how helpful it is for these sorts of tasks. The "ascends_viz": project on github shows the current work. With the exception of a few hitches (JRuby JDBCSqlite3 driver mostly), it has been pretty smooth for data exploration. My teammates also seem to like what they have seen of gigantron. Work is also demonstrating how annoying Java is. My team lead has been warring with it for days. CLASSPATH is an evil signature of Satan, and JNI and the hdf4 jars are among the signs of the apocalypse. Observing how well Java complicates and obfuscates nearly everything makes me wonder if humanity has committed some form of mortal sin that has caused God to hate us. Write-once run-anywhere might be the cruelest joke ever. Most of the C/C++ that I write is more cross-platform. Java just makes it difficult enough to even get things working on one platform that you can't be dicked into testing it on others and therefore must assume that it works as advertised. I hate Java. Most of the JRuby issues we have had have been Java issues (or Ubuntu trying to use gcj without telling us). JRuby is a really cool implementation. The one thing that I hadn't noticed when I was setting up the gigantron database stuff was the Work-In-Progress status of the JDBCSqlite3 driver. It seems to work, but it errors out in odd ways when using migration methods. I believe that MySQL will behave much better for our project, and I will try it and benchmark that tomorrow. All in all, it is coming together. Code is being written and I am enjoying myself. Cool technical details or fun KML will be posted when I can.

Summer Work And Gigantron Processor Of Data » 02 Jun 2008

This summer I am again working for "NASA DEVELOP": to create visualizations of carbon flumes over the US to demonstrate some of the possibilities with the proposed ASCENDS mission. Basically, my team will be taking a large amount of random data and a bunch of scientific models and trying to squeeze impressive visuals out of it. We don't really know exactly how things will work out, but there will be a lot of data processing and exploration. Our team will also probably be tasked with helping other teams process data for their projects. In my time with NASA DEVELOP, I have done a lot of data processing scripting. Most of the time these tasks are pretty straight-forward and simple, but things become pains in our asses when we have to go in later and refine and build on these tasks. Generally these scripts are just hacks and not meant to grow. A little bit of organization and encapsulation would go a long way in keeping everything sane. Of course, I'm lazy, so making an organized attempt at data processing with directory structures and buttons not labeled "Shock the Monkey[1]" for a simple script that will take at most 10 minutes to complete and hand test is not something that is going to happen. Over the last year, I have spent an inordinate amount of time being some form of a Rails fanboy (I also spent some time stalking Alan Kay and preparing a hippie van to follow a Phish tour), so I had seen the benefits of arcane magick in programming. Thinking about my upcoming woes with the maintenance of poorly thought out DP scripts, I realized that I could have my organization -- without having to give up time that could be spent making references to episodes of Family Guy. Clearly, code generation has my back in this situation. In this spirit, I spent yesterday hacking away with "RubiGen": to create "Gigantron: Processor of Data": It is currently a simple set of generators that create a directory and file structure to allow for organized and tested DP projects. I was inspired by a blog post I found on reddit about "Organized Bioinformatics Experiments": to ape those techniques for my data processing. Gigantron imports data into SQLite (or any other db you feel like) and accesses it through DataMapper models. The actual logic and transformations that we write for the big bucks go in Rake tasks that operate on these models. Backing data into an RDBMS and then getting at it through the DataMapper ORM adds some initial leg work and bloats some simple tasks, but it also encourages better treatment of input sources and couples your code and its data much less. Data formats and requirements change and having to untangle business logic from uncommented shortcut, good-enough parse jobs are a way to ruin a perfectly good day. The other advantage to this method is that if, for whatever reason (performance), the data must stay in whatever godawful format it was shipped in (damn you HDF), the model abstraction can still make it appear like any other datasource that comes out of our DB. It can be our little secret. I'm not entirely convinced that Rake is the right golden hammer for this job, but it does make a compelling case for itself. I need to actually try it out in the real world to see if it will work with me rather than against me in production. I think it is the Right Way(TM), but I need usage and feedback to know for sure. That's the other thing. I have written only enough of a framework for it to be something that I would give a try in production. I don't know what works or doesn't, but I have a system that I think will save me some time and can be evolved into something that totally doesn't suck. It is the usage of it that will allow me to find areas of improvement and evolve Gigantron into a true processor of data/destroyer of cities. Hopefully a few gigabytes of real satellite and atmospheric data will help clarify my ideas with it. If this looks like something that could help in some way for your work, I would be tickled pink to have other people using it. I am very interested in adding anything that could help productivity as that is the true concern here. Check it out: * "Gigantron RubyForge": * "Gigantron Github": fn1. I listen to Peter Gabriel while hacking sometimes, sue me.

Mapping The Distraction That Is Wikipedia » 25 Apr 2008

It happens to us all. We go to Wikipedia with the intention of finding some specific information on some specific topic. Several hours later we realize that we are reading about "Sexual Abuse on Pitcairn Island": or something equally unrelated to "High-Energy Particle Physics": or whatever the initial topic was. Oftentimes the articles in between are forgotten and only revealed when one transforms into the smarmy smart ass at the next game of Trivial Pursuit. Much like waking up in a bathtub full of ice, missing a kidney, this loss of time and memory raises unsettling questions about recent events. A rather old XKCD confirmed that I am not the only person experiencing this most curious malady. "!!": I'm tired of the contents of the "3 hours of fascinated clicking" time block being unknown. I think I am reasonably sound of mind and the connections that get me from point A to point Z on the Wiki would make some sense in context. I might be wrong, but I want to find out with evidence. I've hacked together a simple hacktempt at graphing this solution. Basically I have an extremely simple greasemonkey script that runs on and captures the current page and the referer. It then runs some AJAX that tells a local mongrel hackjob to update the database of connections. The local mongrel server also has an HTTPHandler (localhost:9999/show) that uses Graphviz to render a fresh hot png to be delivered to the web-browser. This handler also takes a query string with start and end to set the date range of interest. The code is uglier than Fergie on a rainy day, but it works and I find the results to be pretty fascinating. The code is available on "Github": If it amuses you or you have any suggestions let me know.

Wal-Mart Sense Isn't On Sale » 17 Mar 2008

For those living in rural regions, the Wal-Mart serves as a center of commerce and culture. Much like the village greens of the old world, Wal-Mart provides a meeting place for the like-minded pillars of community. As one browses through camouflage clothing that serves equally well for hunting as it does for gaining social acceptance at the area schools or peruses the selection of NASCAR memorabilia, one can meet and greet all the movers and shakers of the hamlet. Discount sales of Big Mouth Billy Bass wall mountings and John Wayne posters are guaranteed to bring together the finest ladies and gentlemen. The spring time places this retail location as the center of the thriving debutante culture of the region. Suitors and suitees swarm the aisles for some of that old fashioned summer loving. Verily, this shop among shops provides for all the needs of the community in which it resides. Wal-mart stocks everything needed to maintain a proper and culturally fulfilled existence. One can experience the diversity of the world's cuisine in the comfort of their own home through a thousand varieties of Hamburger Helper, microwaveable burritos, and the finest DiGiorno Pizza. After an enjoyable meal the likes of which could not be replicated in any larger, morally bankrupt community, one can retire to a tastefully decorated sitting room that exudes the patriotism that can only come from having an eagle statuette with a Thomas Kinkade 9/11 with rippling American flag chiaroscuro screen printed on the wings. The soothing melodies of prima uomos like Scott Stapp of Creed and Austin Winkler of Hinder can delicately waft through one.s leisure time. The impeccable selection and wide range of goods offered by Wal-Mart is unmatched by other retailers. One can tell the quality of the establishment from the fact that it sells warm PBR, Dale Earnhardt shower curtains, and all material goods in between. While the commercial utility of the institution cannot be understated, it serves an important — perhaps a more important even — role as the center of society. As the place to see and be seen, one sees the arbiters of fashion and trends carefully lurking around the self checkout lanes, checking people out. It is truly a place for every age. From the children stuffed five to a cart being pushed around by consummate professionals in the field of tanning, to the hormonal teenage males in their pickup trucks desperately clinging on to any shred of female affection sent their way, and the adults in their slightly better pickup trucks, Wal-mart contains representation from every race, creed, color, national origin, and societal level that the average product of our fine educational programs could imagine. To further enhance the socialization, one must intermingle with all these groups at a snail's pace. This is the simple by-product of the legions of ambling socialites who must exert all concentration and effort on merely clearing the sides of the aisles without beaching themselves. Their single minded pursuit of inefficient locomotion is adequately offset by the depth and breadth of their conversations -- which are conveniently related at volume levels that are amicable to the prospect of idle eavesdropping. The conversations cover the range of happenings at both Tunstall AND Dan River. The Wal-mart elite certainly understands the importance of being true to one's school, even if it has not been one's school for the better part of three decades. Without exception, these conversations are exceptionally interesting and relevant to the affairs of a grown citizen. Whether it is the smell of deer urine being used in a last ditch effort to woo the opposite sex in the fall or the feeling of the pursuit of someplace that is open after 8p.m., Wal-mart exudes the ephemeral nostalgia of a better era. Once one has experienced the Wal-mart scene in all of its infinitesimal1 glory, one cannot be satiated by lesser means. One cannot underestimate the importance of this integral institution to this community or many others. It exists in a manner that enriches and extends the opportunities present in this delightful slice of society. It protects and shelters the surrounding citizens from unwanted experiences in the broadening of culture and knowledge. It insulates the people from the higher quality of other retailers. It ensures the status quo. It creates such a level of comfort, that referencing such a place without a well-positioned definite article elicits cries of blasphemy and heresy. The Wal-mart is the natural progression2 of this nation.s retail organizations. Wal-mart has the buying power to stock more useless stuff cheaper. The only way Wal-mart could become better is if it offered an in house payday loan office! In one-thousand years, after the moral and societal depravities that we hypocritically denounce cause the collapse of this golden era, anthropologists (and other tweed clad, effete, out-of-touch, elitist, liberal, intellectual types) will look back on the machinations of corporations like Wal-mart and finally understand it for the good that it was.