Chapter 15. Internationalization

This chapter includes the following sections:

As a developer, you belong to an international community. Whatever your nation, you share with other developers the problems of adapting software and applications for your own culture and others. Accommodating the needs of users in different cultures is called localization; making software easily localized is called internationalization.

Tools.h++ is made in the United States. It is internationalized in the sense that it provides the framework you need to localize fundamental aspects of different cultures, such as alphabets, languages, currencies, numbers, and date- and time-keeping notations. With Tools.h++, you write a single application you can ship to any country. When your application is executed, it will be able to process times, dates, strings, and currency in the native format.

While some aspects of internationalization are limited, a useful feature of Tools.h++ is that it imposes no policy. Tools.h++ gives you the freedom and flexibility to design your application to meet the needs of your clients' cultures and your own.

Localizing Alphabets with RWCString and RWWString

Localizing alphabets begins with allowing them to be represented. As mentioned in "Eight-bit Clean" in Chapter 2, Tools.h++ code is "8-bit clean" to accommodate the extended character set. All of the English alphabet is described in 7 bits, leaving the eighth free for umlauts, cedillas, and other diacritical marks and special characters. And because even 8 bits often isn't enough to represent all the character glyphs of various languages, Tools.h++ also allows two kinds of extensions: multibyte and wide-character encodings.

Multibyte encodings use a sequence of one or more bytes to represent a single character. (Typically the ASCII characters are still one byte long.) These encodings are compact, but may be inconvenient for indexing and substring operations. Wide character encodings, in contrast, place each character in a 16- or 32-bit integral type called a wchar_t, and represent a string as an array of wchar_t. Usually it is possible to translate a string encoded in one form into the other.

Tools.h++ two efficient string types, RWCString and RWWString, were discussed in Chapter 3. RWCString represents strings of 8-bit chars, with some support for multibyte strings. RWWString represents strings of wchar_t. Both provide access to Standard C Library support for local collation conventions with the member function collate() and the global function strXForm(). In addition, the library provides conversions between wide and multibyte representations. The wide- and multibyte-character encodings used are those of the host system.

But representation of alphabets can be even more complex. For example, is a character upper case, lower case, or neither? In a sorted list, where do you put the names that begin with accented letters? What about Cyrillic names? How are wide-character strings represented on byte streams? Standards bodies and corporate labs are addressing these issues, but the results are not yet portable. For the time being, Tools.h++ strives to make best use of what they provide.

Localizing Messages

To accommodate a user's language, a program must display titles, menu choices, and status messages in that language. Usually such text is stored in a message catalog or resource file, separate from program code, so it may be easily edited or replaced. Tools.h++ does not display titles or menus directly, but does return status messages when errors occur. By default, Tools.h++ makes no attempt to localize these messages. Instead, it provides an optional facility that allows error messages to be retrieved from your own catalog.

The facility can be used in one of four modes:

Mode Define

No messaging RW_NOMSG

Use catgets()RW_CATGETS

Use gettext()RW_GETTEXT

Use dgettext()RW_DGETTEXT

These localization techniques and their documentation are specific to your platform. Once you discover what your system provides, you specify that mode for Tools.h++ by setting the appropriate switch in <rw/compiler.h> before compiling the library. If you have object code, this choice has already been made for you.

Function catgets() uses both a message set number and a message number within that set to look up a localized version of a message. The number for the message set to use is defined in the macro RW_MESSAGE_SET_NUMBER found in <rw/compiler.h>. Function gettext() uses the message itself. The messages and their respective message numbers are given in Appendix C.

You will find information on using catgets(), gettext(), and dgettext()in the documentation that comes with your compiler.

Challenges of Localizing Currencies, Numbers, Dates, and Times

If you write applications for cultures other than your own, you will soon confront the challenges of representing currencies, numbers, dates, and times. Currencies vary in both unit value and notation. Numbers are written differently; for example, Europe and the United States use periods and commas in opposite ways. Often a program must display values in notations customary to both vendor and customer.

Scheduling, a common software function, involves time and calendar calculations. Local versions of the Gregorian calendar use different names for days of the week and months, and different ordering for the components of a date. Time may be represented according to a 12- or 24-hour clock, and further complicated by time zone conventions, like daylight-saving time (DST), that vary from place to place, or even year to year.

The Standard C Library provides <locale.h> to accommodate some of these different formats, but it is incomplete. It offers no help for conversion from strings to these types, and almost no help for conversions involving two or more locales. Common time zone facilities, such as those defined in POSIX.1 (see the Appendix), are similarly limited, usually offering no way to compute wall clock time for other locations, or even for the following year in the same location.

RWLocale and RWZone

Tools.h++ addresses these problems with the abstract classes RWLocale and RWZone. If you have used RWDate, you have already used RWLocale, perhaps unknowingly. Every time you convert a date or time to or from a string, a default argument carries along an RWLocale reference. Unless you change it, this is a reference to a global instance of a class derived from RWLocale at program startup to provide the effect of a C locale. Remove mention of RWLocaleDefault To use RWLocale explicitly, you can construct your own instance and pass it in place of the default. Similarly, when you manipulate times, you can substitute your own instance for the default RWZone reference.

You can also install your own instance of RWLocale or RWZone as the global default. Many streams even allow you to install your RWLocale instance in the stream so that dates and times transferred on and off that stream are formatted or parsed accordingly, without any special arguments. This is called imbuing the stream, a process described in more detail in the next section.

In the following sections, let us look at some examples of how to localize various data using RWLocale and RWZone. Let us begin by constructing a date, today's date:

RWDate today = RWDate::now();

We can display it the usual way using ordinary C-locale conventions:

cout << today << endl;

But what if you're outside your home locale? Or perhaps you have set your environment variable LANG to fr[26], because you want French formatting. To display the date in your preferred format, you construct an RWLocale object:

RWLocale& here = *new RWLocaleSnapshot("");

Class RWLocaleSnapshot is the main implementation of the interface defined by RWLocale. It extracts the information it needs from the global environment during construction with the help of such Standard C Library functions as strftime() and localeconv(). The most straightforward way to use RWLocaleSnapshot is to pass it directly to the RWDate member function asString()[27]:

cout << today.asString('x', here) << endl;

There is, however, a more convenient way. You can install here as the global default locale so the insertion operator will use it:

RWLocale::global(&here);
cout << today << endl;

Dates

Now suppose you are American and want to format a date in German, but don't want German to be the default. Construct a German locale:

RWLocale& german = *new RWLocaleSnapshot("de");  //See footnote 1

You can format the same date for both local and German readers as follows:

cout << today << endl
     << today.asString('x', german) << endl;

See the definition of x in the entry for RWLocale in the Class Reference.

Would you like to read in a German date string? Again, the straightforward way is to call everything explicitly:

RWCString str;
cout << "enter a date in German: " << flush;
str.readLine(cin);
today = RWDate(str, german);
if (today.isValid())
   cout << today << endl;
 

Sometimes, however, you would prefer to use the extraction operator >>. Since the operator must expect a German-formatted date, and know how to parse it, you pass this information along by imbuing a stream with the German locale.

The following code snippet imbues the stream cin with the German locale, reads in and converts a date string from German, and displays it in the local format:

german.imbue(cin);
cout << "enter a date in German: " << flush;
cin >> today;  // read a German date!
if (today.isValid())
  cout << today << endl;

Imbuing is useful when many values must be inserted or extracted according to a particular locale, or when there is no way to pass a locale argument to the point where it will be needed. By using the static member function RWLocale::of(ios&), your code can discover the locale imbued in a stream. If the stream has not yet been imbued, of() returns the current global locale.[28]

The interface defined by RWLocale handles more than dates. It can also convert times, numbers, and monetary values to and from strings. Each has its complications. Time conversions are complicated by the need to identify the time zone of the person who entered the time string, or the person who will read it. The mishmash of daylight-saving time jurisdictions can magnify the difficulty. Numbers are somewhat messy to format because their insertion and extraction operators (<< and >>) are already defined by <iostream.h>. For money, the main problem is that there is no standard internal representation for monetary values. Fortunately, none of these problems is overwhelming with Tools.h++.

Time

Let us first consider the time zone problem. We can easily see that there is no simple relationship between time zones and locales. All of Switzerland shares a single time zone, including daylight-saving time (DST) rules, but has four official languages: French, German, Italian, and Romansch. On the other hand, Hawaii and New York share a common language, but occupy time zones five hours apart—sometimes six hours apart, because Hawaii does not observe DST. Furthermore, time zone formulas have little to do with cultural formatting preferences. For these reasons, Tools.h++ uses a separate time zone object, rather than letting RWLocale subsume time zone responsibilities.

In Tools.h++, the class RWZone encapsulates knowledge about time zones. It is an abstract class, with an interface implemented in the class RWZoneSimple. Three instances of RWZoneSimple are constructed at startup to represent local wall clock time, local Standard time, and Universal time (GMT). Local wall clock time includes any DST in use. Whenever you convert an absolute time to or from a string, as in the class RWTime, an instance of RWZone is involved. By default, the local time is assumed, but you can pass a reference to any RWZone instance.

It's time for some examples! Imagine you had scheduled a trip from New York to Paris. You were to leave New York on December 20, 1993, at 11:00 p.m., and return on March 30, 1994, leaving Paris at 5:00 a.m., Paris time. What will the clocks show at your destination when you arrive?

First, let's construct the time zones and the departure times:

RWZoneSimple newYorkZone(RWZone::USEastern, RWZone::NoAm);
RWZoneSimple parisZone  (RWZone::Europe,    RWZone::WeEu);
RWTime leaveNewYork(RWDate(20, 12, 1993), 23,00,00, newYorkZone);
RWTime leaveParis  (RWDate(30,  3, 1994), 05,00,00, parisZone);

The flight is about seven hours long each way, so:

RWTime arriveParis  (leaveNewYork + long(7 * 3600));
RWTime arriveNewYork(leaveParis   + long(7 * 3600));

Now let's display the arrival times and dates according to their respective local conventions, French in Paris and American English in New York:

RWLocaleSnapshot french("fr");      // or vendor specific
cout << "Arrive' au Paris a`"
     << arriveParis.asString('c', parisZone, french)
     << ", heure local." << endl;
cout << "Arrive in New York at "
     << arriveNewYork.asString('c', newYorkZone)
     << ", local time." << endl;

The code works even though your flight crosses several time zones and arrives on a different day than it departed; even though, on the day of the return trip in the following year, France has already begun observing DST, but the U.S. has not. None of these details is visible in the example code above—they are handled silently and invisibly by RWTime and RWZone.

All this is easy for places that follow Tools.h++ built-in DST rules for North America, Western Europe, and "no DST". But what about places that follow other rules, such as Argentina, where spring begins in September and summer ends in March? RWZoneSimple is table-driven; if the rule is simple enough, you can construct your own table of type RWDaylightRule, and specify it as you construct an RWZoneSimple. For example, imagine that DST begins at 2 a.m. on the last Sunday in September, and ends the first Sunday in March. Simply create a static instance of RWDaylightRule:

static RWDaylightRule sudAmerica =
   { 0, 0, TRUE, {8, 4, 0, 120}, {2, 0, 0, 120}};

(See the documentation for RWZoneSimple for details on what the numbers mean.) Then construct an RWZone object:

RWZoneSimple  ciudadSud( RWZone::Atlantic, &sudAmerica );

Now you can use ciudadSud just like you used paris or newYork above.

But what about places where the DST rules are too complicated to describe with a simple table, such as Great Britain? There, DST begins on the morning after the third Saturday in April, unless that is Easter, in which case it begins the week prior! For such jurisdictions, you might best use standard time, properly labeled. If that just won't do, you can derive from RWZone and implement its interface for Britain alone. This strategy is much easier than trying to generalize a case to handle all possibilities including Britain, and it's smaller and faster besides.

The last time problem we will discuss here is that there is no standard way to discover what DST rules are in force for any particular place. In this the Standard C Library is no help; you must get the information you need from the local environment your application is running on, perhaps by asking the user.

One example of this problem is that the local wall clock time RWZone instance is constructed to use North American DST rules, if DST is observed at all. If the user is not in North America, the default local time zone probably performs DST conversions wrong, and you must replace it. If you are a user in Paris, for example, you could solve this problem as follows:

RWZone::local(new RWZoneSimple(RWZone::Europe, RWZone::WeEu));

If you look closely into <rw/locale.h>, you will find that RWDate and RWTime are never mentioned. Instead, RWLocale uses the Standard C Library type struct tm. RWDate and RWTime both provide conversions to this type, and you may prefer using it directly rather than using RWTime::asString(). For example, suppose you must write out a time string containing only hours and minutes; e.g.,12:33. The standard formats defined for strftime()and implemented by RWLocale don't include that option, but you can fake it. Here's one way:

RWTime now = RWTime::now();
cout << now.hour() << ":" << now.minute() << endl;

Without using various manipulators, this code might produce a string like 9:5. Here's another option:

RWTime now = RWTime::now();
cout << now.asString('H') << ":" << now.asString('M') << endl;

This produces 09:05.

In each of the previous examples, now is disassembled into component parts twice, once to extract the hour and once to extract the minute. This is an expensive operation. If you expect to work often with the components of a time or date, you may be better off disassembling the time only once:

RWTime now = RWTime::now();
struct tm tmbuf;
now.extract(&tmbuf);
const RWLocale& here = RWLocale::global();       // the default
                                                 // global locale
cout << here.asString(&tmbuf, 'H') << ":"
     << here.asString(&tmbuf, 'M'); << endl;

Please note that if you work with years before 1901 or after 2037, you can't use RWTime because it does not have the required range.[29] You can use RWLocale to perform conversions for any time or date because struct tm operations are not so restricted.

Numbers

RWLocale also provides you with an interface for conversions between strings and numbers, both integers and floating point values. RWLocaleSnapshot implements this interface, providing the full range of capabilities defined by the Standard C Library type struct lconv. The capabilities include using appropriate digit group separators, decimal "point", and currency notation. When converting from strings, RWLocaleSnapshot allows and checks the same digit group separators.

Unfortunately, stream operations of this class are clumsier than we might like, since the standard iostream library provides definitions for number insertion and extraction operators which cannot be overridden. Instead, we can use RWCString functions directly:

RWLocaleSnapshot french("fr");
double f = 1234567.89;
long i = 987654;
RWCString fs = french.asString(f, 2);
RWCString is = french.asString(i);
if (french.stringToNum(fs, &f) &&
   french.stringToNum(is, &i))           // verify conversion
  cout << "C:\t" << f << "\t" << i << endl
       << "French:\t" << fs << "\t" << is << endl;

Since the French use periods for digit group separators, and commas to separate the integer from the fraction part of a number, this code might display as:

C:      1.234567e+07     987654
French: 1.234.567,89     987.654

You will notice that numbers with digit group separators are easier to read.

Currency

Currency conversions are trickier than number conversions, mainly because there is no standard way to represent monetary values in a computer. We have adopted the convention that such values represent an integral number of the smallest unit of currency in use. For example, to represent a balance of $10.00 in the United States, you could say:

double sawbuck = 1000.;

This representation has the advantages of wide range, exactness, and portability. Wide range means you can exactly represent values from $0.00 up to and beyond $10,000,000,000,000.00—larger than any likely budget. Exactness means that, representing monetary values without fractional parts, you can perform arithmetic on them and compare the results for equality:

double price = 999.;                                     // $9.99
double penny = 1.;                                       //  $.01
assert(price + penny == sawbuck);

This would not be possible if the values were naively represented, as in price = 9.99;.

Portability means simply that double is a standard type, unlike common 64-bit integer or BCD representations. Of course, you can perform financial calculations on such other representations, but because you can always convert between them and double, they are all supported. In the future, RWLocale may directly support some other common representations as well.

Consider the following examples of currency conversions:

const RWLocale& here = RWLocale::global();
double sawbuck = 1000.;
RWCString tenNone  = here.moneyAsString(sawbuck, RWLocale::NONE);
RWCString tenLocal = here.moneyAsString(sawbuck,RWLocale::LOCAL);
RWCString tenIntl  = here.moneyAsString(sawbuck, RWLocale::INTL);
if (here.stringToMoney(tenNone,  &sawbuck) &&
    here.stringToMoney(tenLocal, &sawbuck) &&
    here.stringToMoney(tenIntl,  &sawbuck))   // verify conversion
  cout << sawbuck  << "  " << tenNone << "  "
       << tenLocal << "  " << tenIntl << "  " << endl;

In a United States locale, the code would display as:

1000.00000  10.00  $10.00  USD 10.00

A Note on Setting Environment Variables

As mentioned in the section on class RWTime, some compilers and operating systems, including the Windows operating systems, require you to set certain environment variables in order for a locale feature to work. Failure to do this can lead to great difficulties.

If you use Borland, MetaWare, Microsoft, Symantec, or Watcom, you must set your environment variable TZ to the appropriate time zone:

set TZ=PST8PDT

Check the documentation for your compiler and operating system for information on setting environment variables.



[26] Despite the existing standard for locale names, many vendors provide variant naming schemes. Check your vendor's documentation for details.

[27] The first argument of the function asString()is a character, which may be any of the format options supported by the Standard C Library function strftime().

[28] You can restore a stream to its unimbued condition with the static member function RWLocale::unimbue(ios&); note that this is not the same as imbuing it with the current global locale.

[29] Of course, if you are working on a 64-bit system, there is no practical upper limit to the dates you can use.