Web 2.0
It's Not Called The 'World Wide Web' for Nothing
Build and deploy your CF applications for a global market
Mar. 9, 2004 12:00 AM
Much of the future growth of the Internet will occur in areas where the dominant language is not English. The convergence of this growing area of opportunity with the new features offered by CFMX make CFMX an excellent choice for globalized applications.
This is the first in a series of articles on globalizing ColdFusion MX (CFMX) applications. Why CFMX? CFMX is really the first version of CF that allows developers to "go to town" with globalized applications. Let me list the three main reasons why I believe this is so:
- It's based on Java, so we can now finally use Unicode for text data (don't worry about what that means right now; just understand that it's really darned important - I'll get to what it means in a bit).
- It's based on Java, so when required we can simply reach down and make use of native Java classes to supplement CF's functionality. This is especially important for locales outside the officially supported CF subset (again, don't worry about what a "locale" is, I'll get to that shortly).
- It's based on Java, so we can more easily make use of the widely available Java globalization libraries like IBM's excellent ICU4J (http://oss.software.ibm.com/icu4j) for things like non-Gregorian calendars, holidays, locale specifics, and so on.
Why now? There are an ever-increasing number of communications on e-mail lists, developer forums, and blogs requesting information about globalization. Quite often the people requesting this information don't know they are dealing with globalization issues; sometimes all they want is for their CFMX application to start speaking French instead of "Martian." These communications have risen enough above the "background noise" level that
CFDJ executive editor Jamie Matusow and technical editors Ray Camden and Simon Horwith thought it was a good idea to do some communicating of their own.
The purpose of this article is to:
- Present you with the "why" of globalizing CFMX applications (basically, I hope to entice you into jumping aboard the globalization bandwagon);
- Introduce you to the terminology used in this field (so you too can jargonize your proposals and pepper your e-mail with acronyms and abbreviations);
- Start you thinking about some of the more important concepts of CFMX globalization, so you can begin planning your own launch onto the world's stage. Globalization does indeed take a fair bit of planning and consideration to save yourself long-term grief.
Since this is more of an introductory and informational article, don't expect much in the way of CF code. I'll reserve the code for later articles that will deal with the specifics of globalizing CFMX applications.
Why Globalize?
The primary answer to that question, like most things in life, is of course money. Let me lay out some facts and figures. It's been predicted that online transactions in 2004 will reach $6.8 trillion with the U.S. having about 47% of that. While 47% is quite a decent chunk, that's actually a significant (and continuing) decline in share from 74% in 2000. That means there are $3.6 trillion in online transactions going on outside the U.S. - and that amount is growing. More important, in my opinion, those transactions aren't being conducted in one monolithic market. They're spread out among dozens, perhaps hundreds, of countries, most of which don't use English and almost none of which use U.S. dollars as their currency.
Since the financial volume of online transactions has a lot to do with per capita income (i.e., Americans simply have more income to spend online), which will eventually become more evenly distributed, it might be useful to examine raw global Internet usage (or eyeballs). Figure 1 shows Internet usage by region. As you can see, Internet users outside North America number more than double those within that region. The Asian region is the current Internet user champion.
If we examine Internet usage growth from 2000 to 2003 (see Figure 2), we see a similar trend - the fastest growth is now found outside North America. The Middle East, of all places, has seen more than 100% growth during this period. Perhaps this is something to keep an eye on; I'm fairly sure the marketing mavens are doing so.
Figure 3 illustrates Internet penetration as a percentage of total population by region. From this data it's clear that while North America is fast approaching total saturation (I can attest to that fact from personal experience; even my long-suffering dear old Mom now makes regular use of the Internet), there's still huge potential in the rest of the world, even in Europe. I'm no marketing expert, but it seems to me this is something significant to consider.
If this information doesn't exactly bring tears to your eyes, perhaps you might consider this simple fact: there are more Spanish speakers living in the U.S. than there are Canadians in Canada. The 38 million Hispanics in the U.S. (estimated by the Census Bureau) have a purchasing power of around $675 billion. You don't have to go to Hong Kong to understand the need for globalization; just take a look around your back yard.
Let's look at one last Internet statistic that further illustrates this point, weblogs (blogs). The NITLE Blog Census shows similar trends in the blog space. Of the 1.7 million blogs surveyed, about 62% were considered to be in English. I find it particularly interesting that the next most popular blogging languages are Portuguese, Farsi (Persian), and Polish. I'm even more surprised to see that there are over 3,000 blogs in Esperanto.
Summing up, it appears that most future Internet growth will be occurring outside the North American region and will involve heterogeneous locales. To me, this just cries out for globalized application development. As CF developers, we all understand how important the Internet is; we just need to focus on where it's important.
Terminology
Now that we've covered the need for globalized applications, it's time I defined some of the jargon I've been tossing around and introduce some of those acronyms I promised.
Internationalization
Internationalization (a.k.a. I18N, an abbreviation for the 18 letters between the "I" and "N" in internationalization) refers to the design and development of a CFMX application so that its core functionality is not based on a single locale or language. You can think of I18N as making an application locale or language neutral by removing all obstacles to it being deployed in more than one locale.
Localization
Localization, (a.k.a. L10N, an abbreviation for the 10 letters between the "L" and "N" in localization) is the post-I18N process of adapting an application to a specific locale without any changes to its source code. L10N is the process of applying a locale or language "skin" to an already I18N CFMX application.
Globalization
Globalization (a.k.a. G11N, an abbreviation for the 11 letters between the "G" and "N" in globalization) is sometimes used as a synonym for I18N (mainly by Microsoft), but to me it's L10N implemented across several locales after I18N.
Locale
Locale (nope, no nifty abbreviation) refers to the most elementary part of globalization. Locales are languages and other cultural norms (calendars; date, number, and currency formatting; spelling; measurement systems; page sizes; and so forth) specific to a geographic region. HTML and XML both rather plainly define locales as "(language, country)," that is, primarily as a language identifier. Java and ColdFusion (in a roundabout fashion) include other cultural information that is specific to locale.
Codepage Encodings
Codepage encodings date from a time when software developers lived in caves and did all their coding by torchlight (in my opinion at least), so I am loath to even mention them, though for the sake of thoroughness I will. A codepage is also known by various other names, including encoding, charset, character set, coded character set (CCS), graphic character set, and character map. Microsoft defines a codepage encoding as a list of selected character codes in a certain order.
Codepages are defined for specific languages or groups of languages that share common writing systems. A codepage (in Windows at least) can contain only 256 code points; because of this codepage codes overlap (usually in the upper 128 code points) from one language to the next. You simply cannot mix languages (say Thai and French) in the same text stream because of this overlap.
Things become even more complicated when dealing with Asian languages such as Chinese, Japanese, and Korean (which you will often see abbreviated as CJK) since these languages contain more than 256 characters. DBCS (double-byte character sets) were created to handle these types of languages (which still need to be based within the 256-character limitation). Each DBCS character is represented by a pair (double-byte) of code points, which rendering systems need to interpret as a single character. Even with all this mess, DBCS codepages still overlapped. All of this severely limits the usefulness of codepage encoding for globalized CFMX applications.
Unicode
Unicode is an encoding standard that provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode is especially well suited to the Internet, since the global nature of the Internet requires applications/solutions that work in any language. The World Wide Web Consortium (W3C) has recognized this fact and now expects all new RFCs (Requests for Comments) to use Unicode for text. Unicode is the de facto character encoding standard for all major computer companies, while ISO 10646 is the corresponding worldwide standard approved by all ISO member countries. Not to worry, these two standards have identical character stocks and binary representations. Unicode: don't leave home without it.
Coordinated Universal Time
Oddly enough, Coordinated Universal Time is abbreviated as "UTC." For all intents and purposes it is the same as Greenwich Mean Time (GMT), which is the global standard for time established in October 1884 at the International Meridian Conference when delegates from 25 nations (including Hawaii and Liberia) met in Washington, DC, and agreed on a system. GMT/UTC refers to time kept on the Greenwich meridian (longitude zero). This time scale is kept by several time laboratories around the world (for instance, the U.S. Naval Observatory) and is determined using highly precise atomic clocks, rather than by observing astronomical phenomena.
CFMX Globalization Concepts
The final section of this article will introduce some of the more important concepts associated with developing G11N CFMX applications. Later articles will more fully develop these ideas with CF code examples.
Unicode
I really can't state this any more plainly; it's critical that every aspect of your CFMX application be Unicode capable - anything less will eventually be fatal. You simply cannot expect to efficiently deploy and manage a globalized CFMX application using codepage encodings. Your database should be fully Unicode capable; if it's not, dump it. This, however, should not be a major issue, as most of the modern big-iron databases such as MS SQL Server, Oracle, and their brethren - as well as some of the more popular desktop databases such as MS Access 2000 - fully support Unicode. Finally, since UTF-8 is CFMX's default encoding, it's just common sense to use Unicode. In practical terms, this means you won't have to worry about inserting/retrieving text data from your database, displaying it, or transferring to other formats (XML, etc.).
Locale
It's vital that your CFMX application be able to transparently determine a user's locale - though it's good practice to always provide a method for a user to manually swap locales - and provide locale-specific functionality such as text translations and object (date/time/numeric/currency) formatting. Locales are, after all, the basic building blocks of globalized CFMX applications. Your application needs to get a user's locale right.
While CFMX supports a hefty subset of the available Java locales, it's just that, a subset. For instance, it doesn't support the locale in which I live (Thailand, th_TH). Because of this, your application should be prepared to make use of Java locale-based classes to supplement the missing locales. For the sake of consistency you should therefore also use Java-style locale notation such as "en_US" (two-letter language code followed by two-letter country code, sometimes followed by a two-letter variant) instead of the CF style of "English (US)". Since switching between CF and Java locale-specific functions requires extra overhead and management, I suggest using the Java-based functions until CFMX supports all Java locales.
String Handling
If you're used to building applications for one locale, you probably have hard-coded application text and its presentation right alongside your CF code. While that might work for one locale, this approach will surely fail once your application needs to handle several locales.
Without a strict separation of your application text from text presentation and CF code, you will end up with one set of source code per locale - a management nightmare. Translation management is another often overlooked G11N task that benefits from the full separation of application text; handing a translator a file full of CF code is just asking for trouble. The normal (i.e., Java) method for handling this is through the use of resource bundles (per locale files holding key/value pairs of application text). You can see an example of this technique in Ray Camden's article, "Add Localization to Your Web Site" (CFDJ, Vol. 6, issue 2).
Calendars
Most CF developers are familiar with the Gregorian calendar (though perhaps unaware of its designation). While it's used in most English-speaking countries, it's certainly not the only calendar in use. A fully globalized CFMX application must handle non-Gregorian calendars such as Buddhist, Chinese, Hijri, Japanese, and Hebrew calendars. Besides day and month names varying from calendar to calendar, year numbering will also vary. For instance, the Gregorian calendar year of 2004 is 2547 in the Buddhist Era in Thailand. The day a week starts with will also vary from calendar to calendar, as will the actual length of a "month," making date calculations problematic. If this all sounds a bit too much, I'm in complete agreement. I frequently use and strongly recommend using IBM's ICU4J (http://oss.software.ibm.com/icu4j) to handle calendars as well as holidays (calculating the Easter holiday, for instance, isn't trivial).
Time Zones
Your application should store date/time data as UTC. This gives your application flexibility if you need to distribute it across many time zones or you have the need to "cast" data to time zones besides the CF server's.
Conclusion
The Internet's future growth will most likely take place in regions outside North America, which is fast approaching total saturation. The user base in these growth areas is not one homogeneous locale but consists of dozens - perhaps hundreds - of locales. A globalized CFMX application will therefore need to efficiently handle locales and provide locale-specific functionality to be successful in these "new" international markets.
The next article in this series will examine char encodings and CFMX in detail. It aims to deal with the different kinds of encoding, BIDI (bidirectional text), and the use of CSS in G11N applications, and to go into further detail on why we should all just use Unicode.
Resources
Forrester Projects $6.8 Trillion for 2004: www.forrester.com/ER/Press/Release/0,1769,277,FF.html
Internet World Usage Stats: www.internetworldstats.com/stats.htm
Bilingual ads hit airwaves (San Antonio Express-News): http://news.mysanantonio.com/story.cfm?xla=saen&xlb=110&xlc=1078719
NITLE Weblog Census: www.blogcensus.net
Character sets and codepages: www.microsoft.com/typography/unicode/cscp.htm
What is Unicode: www.unicode.org/standard/WhatIsUnicode.html
National Weather Service Climate Glossary: www.cpc.ncep.noaa.gov/products/outreach/glossary.html
What is Universal Time?: http://aa.usno.navy.mil/faq/docs/UT.html
About Paul HastingsPaul Hastings, who after nearly 20 years of IT work is now a perfectly fossilized geologist, is CTO at Sustainable GIS, an agile consulting firm specializing in Geographic Information Systems (GIS) technology, ColdFusion Internet and intranet applications for the environment and natural resource markets, and of course globalization. Paul is based in Bangkok, Thailand, but says that's not nearly as exciting as it sounds.