To Hell with UTF-16 ".strings" Files | Dan Wood: The Eponymous Blog

For years, we've followed Apple's documented recommendation that .strings files should be saved in UTF-16 format:

Note: It is recommended that you save strings files using the UTF-16 encoding, which is the default encoding for standard strings files. It is possible to create strings files using other property-list formats, including binary property-list formats and XML formats that use the UTF-8 encoding, but doing so is not recommended. For more information about Unicode and its text encodings, go to http://www.unicode.org/ or http://en.wikipedia.org/wiki/Unicode.

But this has caused so many headaches, that I finally decided it's time to defy this recommendation, and go UTF-8 for our .strings files.

UTF-8 is a file format that is consistent across all machines and byte-order architectures. There is never any ambiguity. It also works pretty well for file comparisons, especially when most of the characters are in the Roman alphabet. Yeah, maybe it's not as efficient at storing double-byte languages, but that's not really a big concern.

UTF-16, on the other hand, has turned into a big nightmare for us. We get localizations from people around the world (some using old PowerPC machines, some using Intel machines). It's never clear whether we are going to get:

Little-endian, without the Byte Order Marker (BOM), a.k.a. UTF-16LE
Big-endian, without the BOM, a.k.a UTF-16BE
Little-endian, with FF FE BOM, a.k.a. UTF-16 sometimes
Big-endian, with FE FF BOM, a.k.a. UTF-16 othertimes.

It's amazing how many times I've gotten an update from a localizer, and when I try to merge it into our source base, it looks like everthing has changed. Opening the file with SubEthaEdit (the best text editor I've found for dealing with character encoding issues), I often find that the looks like dense bunch of oriental characters. (The Japanese call this mojibake.) Then I have to fiddle with the encoding for a bit until it's a consistent, um, little-endian UTF-16. After re-saving, I can finally find out if any contents of the file really were changed.

No more! I've converted every .strings file to UTF-8! Everything seems to work even though I have dared to stand up to the forces of UTF-16.

The crazy thing is, I had to write my own little Cocoa tool to convert the strings files to UTF-8. I call it "UTF8ifier". Apple's built-in utility, iconv, has never been particularly useful. My utility, on the other hand, uses +[NSString stringWithContentsOfFile: usedEncoding: error:] for reading, and if the encoding wasn't already UTF-8, then it writes it back out.

Have I committed a cardinal sin here by using UTF-8? Or is this something that other developers have done? Any interest in making UTF8ifier public?

Update: @bwebster commented on Twitter that Xcode now converts strings files to UTF-16 in the built product, so that means I shouldn't have to worry about manipulating and storing them in my source tree as UTF-8. Another good reason to use UTF-8 in source files is that GitHub can actually show differences. (If they are UTF-16, it treats the files like binary blobs!)

There's one more hoop I realized I have to jump through. When we build an archive, we run genstrings, which sadly has no option for output in UTF-8. So I think I need to add in a line to all of our projects, after the genstrings — as these guys do — to convert the UTF-16 to UTF-8, so that only legitimate changes get checked back into our source tree.

I guess I should file a Radar, requesting genstrings to be ablet to output in UTF-8.

Update 2: @cocoanetics and @0xced suggested DTLocalizableStringScanner, which has re-built genstrings as open-source. I've added a "-utf8" output option in a branch.