Thursday, May 7, 2009

Caching rarely updated data with Rails

Let's say you have an external query for some data in your Rails application, like search results from Yahoo or Google, or some feed, and it could take anywhere from 1 to 5 seconds to fetch it. Of course, it makes sense to cache it, especially if the pages that use this external data need to be accessed frequently. One obvious solution would be to fragment cache it (using memcache store), but it could be not optimal if some of the following cases apply:

  • you need to cache a LOT of data, like millions of results, and it would take up too much memory;
  • you only need to update this data like once per month, so it doesn't make sense to keep it hanging in memory for that long;
  • you don't want this data to be displaced with something else when memcached runs out of memory or lose everything when you're restarting memcached;
  • you want to be able to change the look and feel of your external data on the page without expiring all your fragment cache related to this data.

That was exactly the case in our project, so I came up with a very simple but effective solution: store everything in the filesystem cache and fetch results using Rails cache mechanism wrapped in a small class.

First, we need to be able to auto-expire file cache entries. It can be easily achieved by extending ActiveSupport::Cache::FileStore:

class FileStoreWithExpiration < ActiveSupport::Cache::FileStore

  def read(name, options = nil)
    expires_in = options.is_a?(Hash) && options.has_key?(:expires_in) ? options[:expires_in] : 0
    file_path = real_file_path(name)
    return if expires_in > 0 && File.exists?(file_path) && (File.mtime(file_path) < (Time.now - expires_in))
    super
  end
  
end

Then we just plug this store into our external data fetch wrapper:

class ExternalSearch
   RESULTS_CACHE      = FileStoreWithExpiration.new("tmp/cache/external_searches")

  def initialize(query)
    @query = query
  end

  def results
    RESULTS_CACHE.fetch("#{cache_key}", :expires_in => 30.days) do
      parse_search_results("http://example.com/fetch.xml?query=#{URI.encode(@query)}")
    end
  end

  def parse_search_results(source)
    # Here some logic to fetch the required data
  end
end

Now we only need to build a good cache key for our @query. As always, we need to keep in mind that having too many files in a directory could be a performance killer, so one good approach would be to take any hash function of the search key and use a first few characters as a filesystem hierarchy for it:

  def cache_key
    key = Digest::MD5.hexdigest(@query)
    "#{key[0..1]}/#{key[2..3]}/#{key[4..-1]}"
  end  

Then the resulting class can be used as shown below:

data = ExternalSearch.new('some keywords').results

All subsequent queries for 'some keywords' will be served from cache within 30 days of this first call.

No comments: