Monday, August 25, 2008

Improving HAML performance

I started using HAML about 10 months ago and really got used to it. It had some performance problems until version 1.8, but ever since performs pretty good.

Anyway, I decided to investigate if there are some options that can improve HAML performance a bit. One of them is so called 'ugly' mode, when HAML doesn't try to indent the code acording to it's nesting level. It's turned off by default, thus making a 'View Source' mode more traceable. Anyway, no one really needs it in the production, so let's play with it.

Here I use some integration tests for the Rails application performance benchmarking (influenced by this great article). Benchmarked page uses a very intensive partial rendering to represent hierarchical navigation, so it's a good candidate to play with.


oleg-desktop ~/Projects/grecipes/dev(dev) $ ruby test/integration/index_performance_test.rb
Loaded suite test/integration/index_performance_test
time: 1.020378947258 ± 0.206319272827473
memory: : allocated: 11714K total in 276027 allocations, GC calls: 2, GC time: 182 msec

With these lines added to environment.rb:

if ENV['RAILS_ENV'] == 'production' || ENV['RAILS_ENV'] == 'triage'
  Haml::Template::options[:ugly] = true

oleg-desktop ~/Projects/grecipes/dev(dev) $ ruby test/integration/index_performance_test.rb
Loaded suite test/integration/index_performance_test
time: 0.923057436943054 ± 0.146305060993389
memory: : allocated: 11714K total in 276033 allocations, GC calls: 2, GC time: 227 msec

Not much but still a good result for the one-liner! Of course, the next thing to battle will be these 2 GC calls, but it's another story.

Monday, August 18, 2008

Mechanize and memory consumption

Recently I was writing a small scraping script for one of my projects. The whole idea is simple: there is a bunch of categories on the main page of source site, each category leads to a list of subcategories, these subcategories contain a paginated list of entities for scraping. We needed only URLs of these entities in order to retrieve actual contents later.

Problem seemed not very hard at the first sight. After all, we have Mechanize, the great Ruby library for automated Web surfing.

My first version actually worked. The only issue was finding proper elements on page which has been solved with some XPath goodness (thanks HPricot). In a first couple of minutes the progress was nice: about freshly scraped 30000 URLs. But then the problems appeared.

I noticed that my machine performance became very poor. Htop showed me that my scraping.rb actually was the memory consumption leader, even Firefox took 2nd place with humble 330 Mb, while the scraping process used 500 Mb (and growing...). So I started the investigation.

Fortunately, I found the solution pretty soon. Mechanize tries to remember all visited pages (to return to them later or check if some page was already visited). The default behaviour is too keep an infinite history. You can imagine how it hurt the script performance and increased memory consumption with a history of 30k+ items. So setting max_history = 1 helped a lot.

Remember, every time you use Mechanize, especially on a lot of pages, and you don't need any information on the previously visited pages, don't forget:

agent =
agent.max_history = 1