You will be redirected in 30 seconds or close now.

ColdFusion Authors: Yakov Fain, Jeremy Geelan, Maureen O'Gara, Nancy Y. Nee, Tad Anderson

Related Topics: ColdFusion, Adobe Flex

ColdFusion: Article

Do You Want Coffee with That Mojibake?

Character encodings and CFMX

This is the second in a series of articles on globalizing ColdFusion MX (CFMX) applications. This article examines character encodings and CFMX, BIDI (bidirectional text), the use of Cascading Style Sheets (CSS) in application globalization (G11N), and why we should all just use Unicode. Space is limited so I'm going to assume that you've read the first article, which covered globalization concepts and terminology.

No, mojibake isn't a new kind of Krispy Kreme donut. Mojibake, or "  " in Japanese, literally meaning "ghost characters" or "disguised characters," is a term that has crept into the G11N field and is often used to indicate gibberish text that has become corrupted because of bad or missing character encoding. For instance, " ; " becomes the mojibake "c$BJ8;z2=$1c(J" when the character encoding is messed up (I plucked this example from some e-mail correspondence). Yes, this issue arises often enough that somebody coined a term for it.

Why do we have to worry about these sorts of things? Pornography aside, text is by far the most commonly used data type in Web applications. It should be obvious then that it's critical that people are able to understand the content your Web application is delivering (otherwise what's the point?). The key to this is making sure Web applications, Web servers, database back ends, and users' browsers are all in agreement regarding character encoding. With that in mind, the purpose of this article is to:

  • Explain what character encodings are
  • Provide some background to the more common character-encoding issues
  • Explain the sometimes tricky business of BIDI text
  • Indicate how CSS can help you develop G11N Web applications
  • To convince you, come hell or high water, to just use Unicode rather than try to deal with all the various character encodings on a case-by- case basis.
One thing that makes understanding these issues difficult is the plethora of oftentimes conflicting terminology in use today, even within some of the "standards" bodies. While I'm reasonably certain that someone, somewhere, will object, I think the terms I chose to use here are common and plausible enough for the purposes of this article.

What Are Character Encodings?
Let's begin by dissecting, in a simple-minded way, human language into its component parts, beginning with the simplest, characters.

Characters, Glyphs, and Other Sea Creatures
I suppose it might be useful to think of a character as an "atom" within a "molecule" of text content like a word. But you really have to think of a character in the abstract, as an entity without regard to its appearance ("a", "a", "a", or "a" - it's still an "a"). The Unicode Consortium (www.unicode.org) defines an abstract character as a unit of information used for the organization, control, or representation of textual data. The Unicode Consortium's "character encoding model" (Unicode Technical Report 17, www.unicode.org/unicode/reports/tr17) defines three basic concepts:

A character repertoire is simply a set of distinct abstract characters; some folks refer to this as a "character set." In practice, a character repertoire usually corresponds to an alphabet (your ABCs) or a symbol set (musical notation, for instance). Note that a character repertoire can contain characters that look the same in some presentations, such as Latin uppercase A and Cyrillic uppercase A, but which are in fact logically distinct. Once again, you need to separate the way a character looks from what it actually represents.

A character code is a mapping from a set of abstract characters to a set of nonnegative (but not necessarily consecutive) integers - the abstract character made real to computers, if you will. Each abstract character's mapped integer is called its "code point." For example, in Unicode the code point for "A" is 65; the code point for the first letter of the Thai  alphabet is 3585.

A character encoding is a method or algorithm for presenting characters in a digital form by mapping sequences of code numbers of characters to sequences of bytes. For example, in the MS-874 (Thai) encoding " " has a code point of 161; that same code point is assigned to "¡" in the Latin-1 encoding (that's an inverted exclamation point by the way) and  (Cyrillic capital letter short U) in the Windows Cyrillic encoding.

The visual representations of characters are called glyphs. You need to understand that text presentations, such as fonts, are applied to glyphs and not to the abstract characters. A font is a collection of glyphs. In practical terms, a font is a numbered set of glyphs, the numbers corresponding to code positions of the characters (represented by the glyphs). A font, at least in this sense, is entirely dependent on character code. It's this dependence that often causes the appearance of boxes (c) or other strange characters in text streams - a browser's fonts simply can't render the requested character code because it's not in that font, or more rarely, it violates some display rule. It's therefore important to fully understand which character encodings are contained in which fonts. Test, don't simply assume.

A script is a collection of related characters required to represent text in a particular language, for instance, Latin, Greek, Thai, Japanese, or Arabic. Note that one script might also be used in several languages. For example, Arabic is used in Pashto, Urdu, Farsi (Persian), and of course Arabic. A writing system is composed of a set of characters from one or more scripts that are used to write a particular language. A writing system also includes the rules that govern character presentation. For example, Thai has what are affectionately referred to as "jumping vowels" such as "  " (sala a) as used in the Thai word for nothing "  " (transliterated into English as plao), which jumps in front of the consonant "  " (por pla) but is pronounced as if it didn't (that is "plao" instead of "aopl"). Another example is the writing direction (left-to-right or right-to-left, for instance) a particular script uses - languages, don't have a direction; only the scripts used in their writing systems do.

Quite often the choice of a character repertoire, code, or encoding is presented as the choice of a language, even though a language setting is quite distinct from character issues. There are, however, some more or less "natural" relationships between languages and character encodings. Table 1 shows a partial list of these.

There are several things to note from Table 1:

  • The sheer number of character encodings
  • The fact that the same character encoding is used in several languages.
  • Many languages, Japanese for example, might be referenced by more than one character encoding - which can help compound the confusion provided by the previous point. I especially enjoy this one on projects with tight deadlines.
To me these all just spell trouble in a G11N application. It's this kind of exuberant variety that causes mojibake and other headaches.

Common Character Encoding Issues
The large variety of character encodings means globalized applications based on them need to go the extra mile in order to implement them. This management effort must by necessity extend from the back-end database through to the pages delivered to the client's browser. This is quite a daunting task that could become expensive as well. In many instances character encoding-based applications deny the possibility of back-end database consolidation, that is to say that rather than a small number of databases to manage you could well end up with one database per character encoding. Depending on the database technology used it could also mean rolling out one Web server per character encoding (a common occurance with desktop databases such as MS Access, for example). Obviously, economics eventually forces a database change, but often too late.

Variety also means choice. Languages with more than one character encoding are especially troublesome, as it's generally impossible to forensically determine which encoding was originally used. You can end up with text data encoded in one character encoding but displayed in another. This happens quite often with text data that has passed through many sets of hands and the original character encoding metadata has been lost along the way. It can also occur when no character encoding "hint" is included in a Web page and a browser's default doesn't match the original character set. In HTML the hint is normally provided by the charset property in HTTP Content-Type header:

<META http-equiv=Content-Type content="text/html; charset=caveDwellingCodePage">

where caveDwellingCodePage is the character encoding you require. This should be declared as early as possible in the header section of your Web page. You should also note that the W3C has chosen to use charset as a synonym for character encoding. For XHTML compliance you would simply add a slash to the end of that tag:

<META http-equiv=Content-Type content="text/html; charset=caveDwellingCodePage" />

While CFMX will happily ignore this meta-header, I would still urge you to include it for the sake of spidering robots and other content-indexing programs, as well as accessibility software. It also provides a hardcoded artifact as to what the original character encoding intentions were. In CFMX the CFPROCESSINGDIRECTIVE, CFCONTENT, and CFHEADER tags (and the SETENCODING function) provide this hinting. XML hinting is usually done with an encoding pseudo-attribute in the XML declaration at the start of a document:

<?xml version="1.0" encoding="UTF-8" ?>

There is also another, more subtle, character encoding pitfall. Some character encodings masquerade as related but when examined in detail are in fact not related. For example, the Windows Latin-1 character encoding is quite often mislabeled by Web developers as ISO-8859-1 on the Internet, but in actual fact it is a superset of ISO-8859-1. The extra characters provided by the Windows superset will confuse browsers that actually treat it as ISO-8859-1, whether you told them via charset hinting or it is simply handled as a default character encoding. It's not just the Windows OS; the Mac OS also has a few similar issues. Its Roman character encoding is quite often labeled as ISO-8859-1 even though it predates that ISO encoding by several years. It does not have exactly the same character repertoire, and many of the characters it does share with ISO-8859-1 actually have different code points. Even the Mac Latin-1 or Mac Mail character encoding, an attempt at aligning the Mac OS Roman repertoire with ISO-8859-1, is not quite equivalent but it is very often labeled as if it were.

Finally, as most of the character encodings listed in Table 1 are codepage encodings and can contain only 256 code points, you cannot mix languages within the same text stream, as these encodings overlap (commonly in the last 128 code points). If you think this isn't a common occurrence, just look at this article; so far it has mixed Japanese, Thai, and English. Another point to consider, from an I18N perspective, is the good practice of allowing users to manually swap languages and to show their language choice (Thai) in that language ( ) It's the little things that count, after all.

BIDI Concepts
BIDI, or bidirectional text, can be somewhat difficult to understand, especially for folks used to text in one, usually left-to-right (LTR), direction. I can only skim the surface of this complex subject here (for instance, I'm going to skip clean over Arabic script's special ligature and shaping features, the so-called "national" or "Hindi" digit shapes, Hebrew's five "Final Form" consonants, and directionally neutral characters such as spaces and punctuation, among other things), but hopefully it will be enough for a basic grasp of BIDI issues. Why bother if it's so complicated? Because more than 500 million people in the Middle East, Central/South Asia, and Africa use languages with bidirectional scripts. These languages include Arabic, Farsi (Persian), Azerbaijani, Urdu, Punjabi, Pushto, Hebrew, and Yiddish. If you recall from the first article, the Middle East region is also experiencing more than 100% growth in Internet usage.

First off, why is it bidirectional? Aren't Arabic and Hebrew scripts written in just the one direction? No, actually they're not. Numbers embedded in these scripts are in fact written LTR just as in Western European text (the most significant digit is first or left-most; 100 is not written as 001 in BIDI scripts) even though the remainder of the text is written RTL. You will very often also find languages written in LTR scripts mixed in with RTL scripts (transliteration of proper or place names can be confusing and sometimes impossible; more often than not these aren't localized and are simply dumped "as is" into the RTL text stream). This is what makes the whole page BIDI. For further spice, note that in Arabic mathematical expressions are written from RTL, even though numbers within the equations are still written from LTR. As you can see, BIDI text handling is quite complicated, so much so that Flash and several other products still don't properly handle BIDI text at all or need to handle it as a special case (PDF document creation, for example with the excellent iText Java PDF library www.lowagie.com/iText).

Next, in what scripts (recall that languages don't have direction) would you normally write BIDI text? Table 2 shows a few examples, though you will most likely encounter only Arabic and/or Hebrew (the rest are included mainly to show off how well-read the author is). Table 3 shows a larger list of commonly localized languages, their scripts, and the script's direction.

Ideograph languages (Chinese, Japanese, Korean, or CJK, for instance) are often quite "flexible" in their script's direction. For the most part these are written LTR or TTB (top-to-bottom); you might also find them written RTL (and very often when TTB). Chinese-language newspapers are classic examples of this kind of directional elasticity, one page may combine LTR, TTB (with the vertical columns RTL), and RTL text. Makes my head spin.

In Arabic and Hebrew scripts there are three conventions for the order in which text is encoded, two of which are most commonly encountered:

  • Logical order: Text is stored in memory in the same order it would be spoken or typed. Characters have an inherent direction attribute that is used by a display algorithm to determine the most likely display order for the corresponding glyphs.
  • Visual order: Text is stored line-by-line in left-to-right display order (that is, the Arabic and Hebrew nonnumeric text is encoded in reverse order). This is characteristically found in text data created by older systems.

    In HTML the DIR attribute specifies the base direction (LTR, RTL) of directionally neutral text (which Unicode defines as text not having an inherent directionality). For example <HTML DIR="RTL">. You can also specify direction for several other HTML elements, including <TABLE>, <BODY>, <P>, etc. Tex Texin's Web site (www.i18nGuy.com/markup/right-to-left.html) has an excellent set of tips for writing RTL text in markup which can be summarized as:

  • Use the HTML element, not the BODY element, to set the overall document direction.
  • Use character encodings that employ logical, not visual, ordering, such as Unicode, Windows-1255, Windows-1256, ISO-8859-6-i, ISO 8859-8-i. Don't use the visually ordered ISO 8859-6, ISO 8859-8, ISO-8859-6-e, and ISO-8859-8-e. See RFC 1555 for more information.
  • Use markup (the dir option such as in <div dir="ltr" lang="th" > </div>) instead of the Unicode bidirectional control characters (LRE, RLE, etc.), which need to be embedded in the text stream and are somewhat harder to use.

    While we're on the BIDI topic, it's also important to understand that the BIDI concept applies to the whole Web page layout, not just the text content. Visual page flow will also need to be RTL to convey the same meaning to BIDI users. In LTR languages, the most important information is usually placed in the upper-left corner of the screen/page, in RTL it would be the upper-right that's most important. Perhaps more important and very often overlooked, graphics - especially navigation graphics - will be understood by these same users to have an RTL meaning. Figure 1 provides a simple example of this. In RTL languages, which button in Figure 1 do you think will skip to the end? Remember that RTL graphics should be mirror images of their LTR counterparts.

    How CSS Can Help
    I'm assuming you have a basic understanding of CSS mechanics (because that's about what I have). While CSS is something of a hot issue these days, the G11N world has long looked to CSS in developing global Web applications. Take for example the HTML <FONT> element - knowing now what you know about character encoding and fonts, doesn't it make sense to use a few semantically appropriate CSS selectors instead of a wheelbarrow full of <FONT> elements?

    CSS is often used to control changes in fonts, font sizes, and line heights when language changes in a G11N application. As a real-world example of this, consider Simplified versus Traditional Chinese. Users tend to prefer different fonts for each character encoding, even though they may be using many of the same characters. In theory there are four ways of accomplishing this (see WC3 FAQ: Styling using the lang attribute: www.w3.org/International/questions/qa-css-lang.html):

    1.  the :lang() pseudo-class selector (XHTML)
    2.  a [lang |= "..."] selector that matches the beginning of the value of a language attribute
    3.  a [lang = "..."] selector that exactly matches the value of a language attribute
    4.  a generic class or id selector

    I'm going to ignore the first three methods, simply because most browsers currently do not fully support them. For future reference, the W3C recommends the first method, CSS2 language pseudo-class selector :lang() method. Since we have to make do with the world as we find it, let's examine an example of the generic class or id selector approach shown in Figure 2.

    Use the following styling:

    body {font-family: "Times New Roman", serif; }
    .ar {font-family: "Traditional Arabic", serif; font-size: 12px;}
    .zht {font-family: PMingLiU,MingLiU, serif;}
    .zhs {font-family: SimSum-18030;SimHei, serif;}
    .din {font-family: "Doulos SIL", serif;}
    .th {font-family: " "Angsana New""; font-size: 14px;}

    Note: The xml:lang and lang are added to allow for expected future support.

    The concept is simple. We add a generic class for each language we want to support. We can then easily "tune up" or "skin" the text presentation per language using fonts, sizes, etc. Besides the extra code required this method also has the disadvantage of having to explicitly define each and every possible language/locale we wish to support. If we wanted to supply larger-size fonts for Australians (en-AU) and Canadians (en-CA) we would have to exactly define two classes for that, otherwise they would inherit text properties from the BODY selector. If the lang |= "..." selector, which matches the beginning value for an attribute, actually worked in all browsers, we could simply define a class, en-BadEyesight, which would then match en-AU and en-CA.

    As with any application of CSS, you might encounter issues with browser versions, but I count these as trivial compared to trying to handle this using normal HTML formatting elements.

    Just Use Unicode
    At one time, after a particularly frustrating week dealing with character encoding issues, I was going have this section's title, "Just Use Unicode" tattooed on my forehead. My wife and kids couldn't quite see the sense in that so I had to forgo the experience. Nonetheless, I can't put this any straighter: Just Use Unicode.

    The preceding section on character encoding should have put you off your G11N feed. The only surefire cure for that is of course Unicode. Using Unicode simplifies things tremendously. You only have to deal with one encoding (UTF-8) on the front end and back end. Mojibake might then indeed become another type of Krispy Kreme donut.

    Even BIDI issues become simpler, as Hebrew and Arabic characters have a direction and it becomes fairly straightforward to embed LTR text in RTL text streams, though directionally neutral characters between so-called "direction runs" (such as  , which is LTR RTL LTR) still require some inline markup to make clear.

    Furthermore, standards bodies like the World Wide Web Consortium (W3C) now expect all new RFCs to use Unicode for text encoding. National governments, for example India's, also back Unicode.

    I think it's also important to point out how short this section on using Unicode is in comparison with all the "stuff and nonsense" dealing with code page encodings. Just using Unicode does actually simplify things a great deal, and I for one could do with a lot more simplification.

    Not that everything's beer and pretzels with Unicode; there is some controversy surrounding it. At one time, Unicode was branded as a Western imperialist cultural plot because of its attempts to consolidate CJK characters, the so-called "Han consolidation." (In all fairness nobody was suggesting consolidating all the "A"s spread across various languages; I think perhaps because they were spread across various languages might be one reason not to, though the rule "Thou shall not disturb existing encodings" might also be at work). This has lately shifted to the idea that Unicode is some sort of Microsoft world-domination conspiracy, though these people should then find it curious that companies like Sun and Oracle considered Microsoft's "mortal enemies," are also Unicode Consortium members. I've always liked to think of Unicode like the Borg: "resistance is futile," so why bother?

    You can find a simple example of some the things discussed in the article in Listing 1. There is a bewildering variety of character encodings in use today, which can often lead to gibberish text due to bad or missing encodings information. This situation can be greatly simplified by using Unicode.

    The rubber meets the road in the next article, entitled "In the Year 2525: Cultural Aspects of G11N." This article will deal with handling date, currency, numeric formatting, calendars, collation (sorting), and so on. If you've been looking for some CF code, you'll see it in this article.

  • More Stories By Paul Hastings

    Paul Hastings, who after nearly 20 years of IT work is now a perfectly fossilized geologist, is CTO at Sustainable GIS, an agile consulting firm specializing in Geographic Information Systems (GIS) technology, ColdFusion Internet and intranet applications for the environment and natural resource markets, and of course globalization. Paul is based in Bangkok, Thailand, but says that's not nearly as exciting as it sounds.

    Comments (2)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

    IoT & Smart Cities Stories
    Moroccanoil®, the global leader in oil-infused beauty, is thrilled to announce the NEW Moroccanoil Color Depositing Masks, a collection of dual-benefit hair masks that deposit pure pigments while providing the treatment benefits of a deep conditioning mask. The collection consists of seven curated shades for commitment-free, beautifully-colored hair that looks and feels healthy.
    The textured-hair category is inarguably the hottest in the haircare space today. This has been driven by the proliferation of founder brands started by curly and coily consumers and savvy consumers who increasingly want products specifically for their texture type. This trend is underscored by the latest insights from NaturallyCurly's 2018 TextureTrends report, released today. According to the 2018 TextureTrends Report, more than 80 percent of women with curly and coily hair say they purcha...
    The textured-hair category is inarguably the hottest in the haircare space today. This has been driven by the proliferation of founder brands started by curly and coily consumers and savvy consumers who increasingly want products specifically for their texture type. This trend is underscored by the latest insights from NaturallyCurly's 2018 TextureTrends report, released today. According to the 2018 TextureTrends Report, more than 80 percent of women with curly and coily hair say they purcha...
    We all love the many benefits of natural plant oils, used as a deap treatment before shampooing, at home or at the beach, but is there an all-in-one solution for everyday intensive nutrition and modern styling?I am passionate about the benefits of natural extracts with tried-and-tested results, which I have used to develop my own brand (lemon for its acid ph, wheat germ for its fortifying action…). I wanted a product which combined caring and styling effects, and which could be used after shampo...
    The platform combines the strengths of Singtel's extensive, intelligent network capabilities with Microsoft's cloud expertise to create a unique solution that sets new standards for IoT applications," said Mr Diomedes Kastanis, Head of IoT at Singtel. "Our solution provides speed, transparency and flexibility, paving the way for a more pervasive use of IoT to accelerate enterprises' digitalisation efforts. AI-powered intelligent connectivity over Microsoft Azure will be the fastest connected pat...
    There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
    Codete accelerates their clients growth through technological expertise and experience. Codite team works with organizations to meet the challenges that digitalization presents. Their clients include digital start-ups as well as established enterprises in the IT industry. To stay competitive in a highly innovative IT industry, strong R&D departments and bold spin-off initiatives is a must. Codete Data Science and Software Architects teams help corporate clients to stay up to date with the mod...
    At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
    Druva is the global leader in Cloud Data Protection and Management, delivering the industry's first data management-as-a-service solution that aggregates data from endpoints, servers and cloud applications and leverages the public cloud to offer a single pane of glass to enable data protection, governance and intelligence-dramatically increasing the availability and visibility of business critical information, while reducing the risk, cost and complexity of managing and protecting it. Druva's...
    BMC has unmatched experience in IT management, supporting 92 of the Forbes Global 100, and earning recognition as an ITSM Gartner Magic Quadrant Leader for five years running. Our solutions offer speed, agility, and efficiency to tackle business challenges in the areas of service management, automation, operations, and the mainframe.