Dan Wood: The Eponymous Weblog (Archives)

Dan Wood Dan Wood is co-owner of Karelia Software, creating programs for the Macintosh computer. He is the father of two kids, lives in the Bay Area of California USA, and prefers bicycles to cars. This site is his older weblog, which mostly covers geeky topics like Macs and Mac Programming. Go visit the current blog here.

Useful Tidbits and Egotistical Musings from Dan Wood

Categories: Business · Mac OS X · Cocoa Programming · General · All Categories

Sun, 28 Jan 2007

File Caching Schemes

I recently added an on-disk data cache in order to speeed up a slow aspect of my application. I've come up with caching solutions before, but I thought I'd take a fresh look at the problem.

If you look in the contents of ~/Library/Caches/ you will see that a lot of applications have some sort of cache in there. It turns out that many of them are actually just generated by NSURLRequest.

Figuring that Apple probably has given some thought to cache file organization, I took a close look at the structure of a typical application's cache folder.

A typical cache file sub-path is something like this:

13/02/4281199315-1410279211.cache

It appears that the first two directories are given a name of decimal 00 through 15; the actual file name is comprised of two decimal thirty-two-bit unsigned integers.

Therefore, each cache entry is indexed by, essentially, 2^72 or a nine-byte number. A hash function that reduces to 72 bits is obviously going to have a fair number of collisions, so the cache file needs to store the source URL in the file itself. I'm guessing that a cache file of this type stores multiple data for each hash value, rather than just punting if a hash doesn't match, but I haven't tried digging into the binary format to confirm this.

What I did for my needs was to start out similarly to Apple's approach, creating two levels of directories based on the first eight bits of the hash value. I'm not that familiar with the intracacies of file systems, but I am guessing that Apple used this technique for a reason. (If any reader has any further insight on this, I'd be curious to learn more about it in the comments.)

For the hashing, I decided to use an SHA1 hash on the source URL, which hashes to 20 bytes (160 bits), way less likely to have a collision than Apple's 72-bit hash. (So much so, in fact, that I just didn't deal with the possibility of collision; it's not that important. I just store the actual binary data in the cache file.) The first byte of the hash determines the two enclosing folders, the last nineteen bytes get converted to hexadecimal for the long file name. So a typical path looks like this:

05/07/f1e53115eb37b4e4df9e1879bc530daa65666e.tiff

I'm not married to this particular approach; in fact it's not being used in any production code yet. I'm curious if any readers have come up with similar or different approaches, or even found (or written) a nice API around the notion of cache files. I'm happy to throw away this technique if somebody else has come up with a better solution!

Update: Just to clarify, the reason I'm coming up with my own caching scheme is that the data that I want to cache do not come from URL requests. They are generated thumbnails of images and quicktime movies. It would be great if apple exposed something that would allow me to use their system but with my own data.