|
YOUR FEEDBACK Did you read today's front page stories & breaking news?
SYS-CON.TV SYS-CON.TV WEBCASTS |
TOP COLDFUSION LINKS ColdFusion Do You Want Coffee with That Mojibake?
Character encodings and CFMX
By: Paul Hastings
Apr. 13, 2004 12:00 AM
This is the second in a series of articles on globalizing ColdFusion MX (CFMX) applications. This article examines character encodings and CFMX, BIDI (bidirectional text), the use of Cascading Style Sheets (CSS) in application globalization (G11N), and why we should all just use Unicode. Space is limited so I'm going to assume that you've read the first article, which covered globalization concepts and terminology. No, mojibake isn't a new kind of Krispy Kreme donut. Mojibake, or " Why do we have to worry about these sorts of things? Pornography aside, text is by far the most commonly used data type in Web applications. It should be obvious then that it's critical that people are able to understand the content your Web application is delivering (otherwise what's the point?). The key to this is making sure Web applications, Web servers, database back ends, and users' browsers are all in agreement regarding character encoding. With that in mind, the purpose of this article is to:
What Are Character Encodings? Characters, Glyphs, and Other Sea Creatures A character repertoire is simply a set of distinct abstract characters; some folks refer to this as a "character set." In practice, a character repertoire usually corresponds to an alphabet (your ABCs) or a symbol set (musical notation, for instance). Note that a character repertoire can contain characters that look the same in some presentations, such as Latin uppercase A and Cyrillic uppercase A, but which are in fact logically distinct. Once again, you need to separate the way a character looks from what it actually represents. A character code is a mapping from a set of abstract characters to a set of nonnegative (but not necessarily consecutive) integers - the abstract character made real to computers, if you will. Each abstract character's mapped integer is called its "code point." For example, in Unicode the code point for "A" is 65; the code point for the first letter of the Thai A character encoding is a method or algorithm for presenting characters in a digital form by mapping sequences of code numbers of characters to sequences of bytes. For example, in the MS-874 (Thai) encoding " The visual representations of characters are called glyphs. You need to understand that text presentations, such as fonts, are applied to glyphs and not to the abstract characters. A font is a collection of glyphs. In practical terms, a font is a numbered set of glyphs, the numbers corresponding to code positions of the characters (represented by the glyphs). A font, at least in this sense, is entirely dependent on character code. It's this dependence that often causes the appearance of boxes (c) or other strange characters in text streams - a browser's fonts simply can't render the requested character code because it's not in that font, or more rarely, it violates some display rule. It's therefore important to fully understand which character encodings are contained in which fonts. Test, don't simply assume. A script is a collection of related characters required to represent text in a particular language, for instance, Latin, Greek, Thai, Japanese, or Arabic. Note that one script might also be used in several languages. For example, Arabic is used in Pashto, Urdu, Farsi (Persian), and of course Arabic. A writing system is composed of a set of characters from one or more scripts that are used to write a particular language. A writing system also includes the rules that govern character presentation. For example, Thai has what are affectionately referred to as "jumping vowels" such as " Quite often the choice of a character repertoire, code, or encoding is presented as the choice of a language, even though a language setting is quite distinct from character issues. There are, however, some more or less "natural" relationships between languages and character encodings. Table 1 shows a partial list of these. There are several things to note from Table 1:
Common Character Encoding Issues Variety also means choice. Languages with more than one character encoding are especially troublesome, as it's generally impossible to forensically determine which encoding was originally used. You can end up with text data encoded in one character encoding but displayed in another. This happens quite often with text data that has passed through many sets of hands and the original character encoding metadata has been lost along the way. It can also occur when no character encoding "hint" is included in a Web page and a browser's default doesn't match the original character set. In HTML the hint is normally provided by the charset property in HTTP Content-Type header: <META http-equiv=Content-Type content="text/html; charset=caveDwellingCodePage"> where caveDwellingCodePage is the character encoding you require. This should be declared as early as possible in the header section of your Web page. You should also note that the W3C has chosen to use charset as a synonym for character encoding. For XHTML compliance you would simply add a slash to the end of that tag: <META http-equiv=Content-Type content="text/html; charset=caveDwellingCodePage" /> While CFMX will happily ignore this meta-header, I would still urge you to include it for the sake of spidering robots and other content-indexing programs, as well as accessibility software. It also provides a hardcoded artifact as to what the original character encoding intentions were. In CFMX the CFPROCESSINGDIRECTIVE, CFCONTENT, and CFHEADER tags (and the SETENCODING function) provide this hinting. XML hinting is usually done with an encoding pseudo-attribute in the XML declaration at the start of a document: <?xml version="1.0" encoding="UTF-8" ?> There is also another, more subtle, character encoding pitfall. Some character encodings masquerade as related but when examined in detail are in fact not related. For example, the Windows Latin-1 character encoding is quite often mislabeled by Web developers as ISO-8859-1 on the Internet, but in actual fact it is a superset of ISO-8859-1. The extra characters provided by the Windows superset will confuse browsers that actually treat it as ISO-8859-1, whether you told them via charset hinting or it is simply handled as a default character encoding. It's not just the Windows OS; the Mac OS also has a few similar issues. Its Roman character encoding is quite often labeled as ISO-8859-1 even though it predates that ISO encoding by several years. It does not have exactly the same character repertoire, and many of the characters it does share with ISO-8859-1 actually have different code points. Even the Mac Latin-1 or Mac Mail character encoding, an attempt at aligning the Mac OS Roman repertoire with ISO-8859-1, is not quite equivalent but it is very often labeled as if it were. Finally, as most of the character encodings listed in Table 1 are codepage encodings and can contain only 256 code points, you cannot mix languages within the same text stream, as these encodings overlap (commonly in the last 128 code points). If you think this isn't a common occurrence, just look at this article; so far it has mixed Japanese, Thai, and English. Another point to consider, from an I18N perspective, is the good practice of allowing users to manually swap languages and to show their language choice (Thai) in that language ( ) It's the little things that count, after all. BIDI Concepts First off, why is it bidirectional? Aren't Arabic and Hebrew scripts written in just the one direction? No, actually they're not. Numbers embedded in these scripts are in fact written LTR just as in Western European text (the most significant digit is first or left-most; 100 is not written as 001 in BIDI scripts) even though the remainder of the text is written RTL. You will very often also find languages written in LTR scripts mixed in with RTL scripts (transliteration of proper or place names can be confusing and sometimes impossible; more often than not these aren't localized and are simply dumped "as is" into the RTL text stream). This is what makes the whole page BIDI. For further spice, note that in Arabic mathematical expressions are written from RTL, even though numbers within the equations are still written from LTR. As you can see, BIDI text handling is quite complicated, so much so that Flash and several other products still don't properly handle BIDI text at all or need to handle it as a special case (PDF document creation, for example with the excellent iText Java PDF library www.lowagie.com/iText). Next, in what scripts (recall that languages don't have direction) would you normally write BIDI text? Table 2 shows a few examples, though you will most likely encounter only Arabic and/or Hebrew (the rest are included mainly to show off how well-read the author is). Table 3 shows a larger list of commonly localized languages, their scripts, and the script's direction. Ideograph languages (Chinese, Japanese, Korean, or CJK, for instance) are often quite "flexible" in their script's direction. For the most part these are written LTR or TTB (top-to-bottom); you might also find them written RTL (and very often when TTB). Chinese-language newspapers are classic examples of this kind of directional elasticity, one page may combine LTR, TTB (with the vertical columns RTL), and RTL text. Makes my head spin. In Arabic and Hebrew scripts there are three conventions for the order in which text is encoded, two of which are most commonly encountered: In HTML the DIR attribute specifies the base direction (LTR, RTL) of directionally neutral text (which Unicode defines as text not having an inherent directionality). For example <HTML DIR="RTL">. You can also specify direction for several other HTML elements, including <TABLE>, <BODY>, <P>, etc. Tex Texin's Web site (www.i18nGuy.com/markup/right-to-left.html) has an excellent set of tips for writing RTL text in markup which can be summarized as: </div>) instead of the Unicode bidirectional control characters (LRE, RLE, etc.), which need to be embedded in the text stream and are somewhat harder to use.
While we're on the BIDI topic, it's also important to understand that the BIDI concept applies to the whole Web page layout, not just the text content. Visual page flow will also need to be RTL to convey the same meaning to BIDI users. In LTR languages, the most important information is usually placed in the upper-left corner of the screen/page, in RTL it would be the upper-right that's most important. Perhaps more important and very often overlooked, graphics - especially navigation graphics - will be understood by these same users to have an RTL meaning. Figure 1 provides a simple example of this. In RTL languages, which button in Figure 1 do you think will skip to the end? Remember that RTL graphics should be mirror images of their LTR counterparts. How CSS Can Help CSS is often used to control changes in fonts, font sizes, and line heights when language changes in a G11N application. As a real-world example of this, consider Simplified versus Traditional Chinese. Users tend to prefer different fonts for each character encoding, even though they may be using many of the same characters. In theory there are four ways of accomplishing this (see WC3 FAQ: Styling using the lang attribute: www.w3.org/International/questions/qa-css-lang.html): 1. the :lang() pseudo-class selector (XHTML) I'm going to ignore the first three methods, simply because most browsers currently do not fully support them. For future reference, the W3C recommends the first method, CSS2 language pseudo-class selector :lang() method. Since we have to make do with the world as we find it, let's examine an example of the generic class or id selector approach shown in Figure 2. Use the following styling: body {font-family: "Times New Roman", serif; } Note: The xml:lang and lang are added to allow for expected future support. The concept is simple. We add a generic class for each language we want to support. We can then easily "tune up" or "skin" the text presentation per language using fonts, sizes, etc. Besides the extra code required this method also has the disadvantage of having to explicitly define each and every possible language/locale we wish to support. If we wanted to supply larger-size fonts for Australians (en-AU) and Canadians (en-CA) we would have to exactly define two classes for that, otherwise they would inherit text properties from the BODY selector. If the lang |= "..." selector, which matches the beginning value for an attribute, actually worked in all browsers, we could simply define a class, en-BadEyesight, which would then match en-AU and en-CA. As with any application of CSS, you might encounter issues with browser versions, but I count these as trivial compared to trying to handle this using normal HTML formatting elements. Just Use Unicode The preceding section on character encoding should have put you off your G11N feed. The only surefire cure for that is of course Unicode. Using Unicode simplifies things tremendously. You only have to deal with one encoding (UTF-8) on the front end and back end. Mojibake might then indeed become another type of Krispy Kreme donut. Even BIDI issues become simpler, as Hebrew and Arabic characters have a direction and it becomes fairly straightforward to embed LTR text in RTL text streams, though directionally neutral characters between so-called "direction runs" (such as Furthermore, standards bodies like the World Wide Web Consortium (W3C) now expect all new RFCs to use Unicode for text encoding. National governments, for example India's, also back Unicode. I think it's also important to point out how short this section on using Unicode is in comparison with all the "stuff and nonsense" dealing with code page encodings. Just using Unicode does actually simplify things a great deal, and I for one could do with a lot more simplification. Not that everything's beer and pretzels with Unicode; there is some controversy surrounding it. At one time, Unicode was branded as a Western imperialist cultural plot because of its attempts to consolidate CJK characters, the so-called "Han consolidation." (In all fairness nobody was suggesting consolidating all the "A"s spread across various languages; I think perhaps because they were spread across various languages might be one reason not to, though the rule "Thou shall not disturb existing encodings" might also be at work). This has lately shifted to the idea that Unicode is some sort of Microsoft world-domination conspiracy, though these people should then find it curious that companies like Sun and Oracle considered Microsoft's "mortal enemies," are also Unicode Consortium members. I've always liked to think of Unicode like the Borg: "resistance is futile," so why bother? Conclusion The rubber meets the road in the next article, entitled "In the Year 2525: Cultural Aspects of G11N." This article will deal with handling date, currency, numeric formatting, calendars, collation (sorting), and so on. If you've been looking for some CF code, you'll see it in this article. YOUR FEEDBACK
CFDJ LATEST STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||