Recently I was writing a small scraping script for one of my projects. The whole idea is simple: there is a bunch of categories on the main page of source site, each category leads to a list of subcategories, these subcategories contain a paginated list of entities for scraping. We needed only URLs of these entities in order to retrieve actual contents later.
Problem seemed not very hard at the first sight. After all, we have Mechanize, the great Ruby library for automated Web surfing.
My first version actually worked. The only issue was finding proper elements on page which has been solved with some XPath goodness (thanks HPricot). In a first couple of minutes the progress was nice: about freshly scraped 30000 URLs. But then the problems appeared.
I noticed that my machine performance became very poor. Htop showed me that my scraping.rb actually was the memory consumption leader, even Firefox took 2nd place with humble 330 Mb, while the scraping process used 500 Mb (and growing...). So I started the investigation.
Fortunately, I found the solution pretty soon. Mechanize tries to remember all visited pages (to return to them later or check if some page was already visited). The default behaviour is too keep an infinite history. You can imagine how it hurt the script performance and increased memory consumption with a history of 30k+ items. So setting max_history = 1 helped a lot.
Remember, every time you use Mechanize, especially on a lot of pages, and you don't need any information on the previously visited pages, don't forget:
agent = WWW::Mechanize.new agent.max_history = 1