Dan Wood: The Eponymous Weblog (Archives)

Dan Wood Dan Wood is co-owner of Karelia Software, creating programs for the Macintosh computer. He is the father of two kids, lives in the Bay Area of California USA, and prefers bicycles to cars. This site is his older weblog, which mostly covers geeky topics like Macs and Mac Programming. Go visit the current blog here.

Useful Tidbits and Egotistical Musings from Dan Wood

Categories: Business · Mac OS X · Cocoa Programming · General · All Categories

Wed, 20 Dec 2006

Karelia's technique for building "Apple Help" files

One of the cool new features of Sandvox 1.1 is that we now have an Apple Help book.

Before, choosing "Sandvox Help" from the Help menu would just direct the user to view our online wiki. (Some people have wondered why we didn't chose Sandvox to build our help. The answer is that although it is technically possible, Sandvox just isn't the right tool for that kind of content.) We had originally chosen a wiki (MediaWiki) as the authoring system for our help because of the way that it is good at managing content; just by creating a link to a new page, it shows up on a list of orphan pages, so there's an automatic to-do list as you start authoring pages. It's pretty easy to create a site without links to nonexistant pages this way. We were also able to, during the early stages, rely on some user contributions to the site. As we started dedicating some resources to building the help, we were able to edit the site "live" so that the help got better and better as time went by.

At some point, we decided that we were ready to start documenting features that were not part of the current 1.0.x release. To do that, we made a clone of the wiki and put it on a new subdomain, where only the official authors of the site (Myself, Terrence, and Mike) could view and edit. A private wiki seems kind of strange, but we were documenting unreleased software, and I didn't want to confuse users of the released version.

Many users were probably more confused by being directed to a wiki than helped by it. There is a lot of extraneous information on a mediawiki page that really wasn't needed. So I decided to work on exporting the wiki to a simpler page look.

I built up some shell scripts to automate the process. The first step is to download the wiki, as rendered as HTML, onto the local computer. The second step, the bulk of the work, is to clean up the HTML so that only the essential content remains. The third step just merges the cleaned-up HTML into our source tree so it will be part of the application.

Getting the HTML, the first step, is fairly brute-force. There might be some cleverer way to extract information from the mediawiki database, but I wanted something simple. It boils down to a single line of wget: (Simplified here just a bit)

wget --domains=private_domain --level=2
	--no-parent --convert-links --html-extension
	--recursive --reject "*\?*"
	http://private_domain/Special:Allpages

wget is a great little utility; it's too bad it's not included with Mac OS X by default. One thing that I find annoying is that it seems to match the patterns after downloading a URL, rather than before. So it uses up a lot of time and bandwidth for the many special links on a mediawiki page that I don't actually want. The download takes a while, but at least it works for putting a static copy of the main pages of the wiki onto my local system, ready for processing.

The second step is just a big shell script that operates on the files. It performs the following tasks:

  1. Copy the downloaded wiki into a new directory for editing the files (so that I don't have to re-download the original files if my script isn't quite right)
  2. Remove certain pages and directories that I don't want for the user documentation (developer & designer pages, mediawiki "meta" pages, the top-level pages that are replaced in the Apple Help pages, etc.)
  3. Loop through all the pages and build up an index page, properly taking redirect pages into account
  4. Repair all links to "redirect" pages to point to the proper target pages
  5. Using the sips utility, try converting the PNG files from the wiki to JPEG 2000 format. Only the files that are actually smaller are kept as JP2. (We also have a hand-maintained blacklist of files to exclude from this process because some of the images look terrible when converted to JP2.) Overall, this technique shrinks the images from 8.4 megabytes to 4.3 megabytes!
  6. The remaining PNG files are run through optipng to shrink them down as much as possible. This shrinks the files down just a bit more.
  7. Using perl -pi -e 's/source/destination/g' ..., do a bunch of substitutions on the .html files to remove the junk we don't want, move keywords that we have explicitly defined at the bottom of each page into the <meta> tags, etc. There are actually a lot of sub-steps here that I won't get into, but I will note that in order for this to really work, I first changed \n to \r so that it would essentially treat the file as a single line of text. I just couldn't get perl to do its substitution across \n line breaks.
  8. Run tidy on the pages so the HTML is readable
  9. Do a check for dead links.

The final step carefully merges the edited files into our source tree, careful not to clobber subversion tags; it also leaves alone the hand-built, static pages (such as the initial page and "Discover Sandvox"). With the CSS there, the final web site looks and behaves a lot like many of the other help books that come from Apple.

I've replicated this process on our own server as well, so that the website docs.karelia.com contains our help pages as well. There are a few differences with the home page, and we don't remove the special categories such as the Developer and Designer pages. This allows us to link to specific pages on the web when communicating with Sandvox users who need help. (Now if Google would just index docs.karelia.com, that website will be searchable as well!)

On the pages for Apple help, the trickiest part was getting everything just right so that Apple Help would do the right things: list Sandvox in its list of applications, show the correct initial page, be searchable, and so forth. The documentation for Apple Help is sorely lacking, (Actually, this "preliminary" document from 2004 is probably the most useful) but their mailing list fills in the gaps.

OK, this post had nothing to do with Cocoa. Sosumi.