Angelika Langer - Training & Consulting
HOME | COURSES | TALKS | ARTICLES | GENERICS | LAMBDAS | IOSTREAMS | ABOUT | CONTACT | Twitter | Lanyrd | Linkedin
 
HOME 

  OVERVIEW

  BY TOPIC
    JAVA
    C++

  BY COLUMN
    EFFECTIVE JAVA
    EFFECTIVE STDLIB

  BY MAGAZINE
    JAVA MAGAZIN
    JAVA SPEKTRUM
    JAVA WORLD
    JAVA SOLUTIONS
    JAVA PRO
    C++ REPORT
    CUJ
    OTHER
 

GENERICS 
LAMBDAS 
IOSTREAMS 
ABOUT 
CONTACT 
The Standard Facets

The Standard Facets
The Standard Facets

C++ Report, November/December 1997
Klaus Kreft & Angelika Langer


 
 

Internationalization is building into a program the potential for worldwide use. Nowadays, it is a common task in almost every product development process. Internationalization is supported in various forms by programming languages, operating systems, and development tools. Traditionally, internationalization is done by means of the standard C library or equivalent C APIs such as the Win32 NLSAPI on Microsoft platforms or the X/Open NLS support on Unix platforms. Naturally, the C++ standards committee did not want to stand back and included internationalization support into the standard C++ library: a locale class was added and its use was demonstrated by internationalizing the standard iostreams. In our last contribution to this column (see /1/) we discussed the architecture of standard locales. Here is a brief recap:

The standard C++ library provides an extensible framework for support of internationalization. Its main elements are locales and facets . A locale is a class that represents a container of facets; a facet is a class that contains information and provides functionality related to a certain aspect of internationalization. Access to a facet that is contained in a locale is via a template function called use_facet<facet>(loc) . The template argument facet is a facet class, and the function argument loc is a locale object; returned is a constant reference to the object of class facet contained in the locale.

Last time we described the locale framework’s architecture in detail and discussed the design of the locale and facet classes. The standard library does not only provide the locale framework, but also contains a number of facet classes. In this article we explain which facets the standard already provides and what functionality they have. In a subsequent article we will demonstrate how one can use the locale framework to build and integrate a new user-defined, special-purpose facet.

Before we delve into the details of a certain standard facet, please have a look at the overview of the internationalization aspects the committee found important enough to be standardized. They are summarized in Table 1 .
 
Facet Functionality
ctype<charT>, 
ctype_byname<charT>
character classification and conversion
collate<charT>, 
collate_byname<charT>
string collation
codecvt<internT,externT,stateT>, codecvt_byname<internT,externT,stateT> code conversion
messages<charT>,
messages_byname<charT>
retrieving localized message strings from message catalogs
numpunct<charT>, numpunct_byname<charT> information about the format and punctuation of numeric and Boolean expressions
num_get<char,InputIterator> parsing of character sequences that represent a numeric or Boolean value
num_put<charT,OutputIterator> generation of formatted character sequences that represent a numeric or Boolean value
moneypunct<charT,Inter>,
moneypunct_byname<charT,Inter>
information about the format and punctuation of monetary expressions
money_get<charT,InputIterator> parsing of character sequences that represent a monetary value
money_put<charT,OutputIterator> generation of formatted character sequences that represent a monetary value
time_get<char,InputIterator>,
time_get_byname<char,InputIterator>
parsing of character sequences that represent a date and/or time 
time_put<charT,OutputIterator>,
time_put_byname<charT,OutputIterator>
generation of formatted character sequences that represent a date and/or time 
Table 1: Standard Facets

Now let us see how these facets help to cope with cultural differences. The following sections discuss problem areas related to differences in language and alphabet and tasks concerning culture-dependent representations of numbers, monetary amounts, date and time. We take a look at the problem domain first and then describe how the standard facets address these problems.

Language and Alphabet.

Different ethnic groups use different languages . Hence the language is one of the most apparent differences between cultures. Even within a single country people might prefer different languages. The Swiss for example use French, Italian, and German.

Languages also differ in the alphabet they use. Here are a couple of examples of languages and their respective alphabets:
 
US English:  a-z, A-Z, and punctuation 
German:  a-z, A-Z, punctuation, and äöü ÄÖÜ ß
Greek:  a - w , A - W , and punctuation
Japanese:  US English characters, ten of thousands of  Kanji characters, Hiragana, and Katakana

We want to spare you the details about character encodings and codesets that can be used to represent different alphabets. However, we want to discuss, at least briefly, the different possibilities to represent alphabets with a large number of characters. There are two possible approaches for encoding large alphabets: character encodings that mix characters of different size ( multibyte character encodings ); or character encodings where all characters are of the same size ( wide character encodings ). It is common practice when handling large alphabets to use wide character encodings inside the program and multibyte character encodings outside on the external device.

  • The internal character set inside the program has to allow fast and arbitrary access to each character in a sequence. This is a functionality that comes with wide character encodings.
  • The external character set is used for storing text data in a file or any other kind of external device. The main purpose is to keep the file size small. This is a functionality typically provided by multibyte character encodings.
A typical example is computer software for the Japanese market. A "Japanese" program might want to handle multibytes text files, encoded in JIS (= Japanese Industry Standard) for instance. The program would internally use a wide character encoding, such as Unicode. Hence the program would need to convert between the Unicode and the JIS encoding whenever it performs an input or output operation. The interpretation of some multibyte encodings is relatively complicated, because their encoding is state dependent . JIS is an example; it uses special character sequences, so-called escape sequences, that switch between one- and two-byte modes as well as between different character sets. For this reason not only the encoding but also the current conversion state are needed for determining a character representation.

Different languages have different rules for sorting characters and words. These rules are called collating sequence . The collating sequence specifies the ordering of individual characters and other rules for ordering. In software development the order of characters is often determined by the numeric value of the byte(s) representing a character. This is what we call ASCII rules in the examples below. This kind of ordering does not meet the requirements of any language’s dictionary sorting. Here is an example for ASCII collation compared to language dictionary sorting:
 
English rule  ASCII rules
alien  American
American  Zulu
zebra  alien
Zulu  zebra

In an ASCII encoding the numerical values of upper letters are smaller than the values of lower letters. For this reason, all words with capital letters appear at the beginning of a list sorted according to ASCII rules.

In some languages certain groups of characters are clustered and treated like a single character for the purpose of sorting characters. In other languages it is the other way round; one character is treated as if it were actually two characters. Here is an example for one character treated as two:
 
German dictionary rules  ASCII rules
Muße  Musselin
Musselin  Muße

The German character ß, called sharp s, is treated as if it were two characters, namely ss.

Character classification and conversion.

Lets start our detailed examination of the standard facets with the ctype facet, defined by template <class charT> class ctype . Among other services, it provides the functionality to classify the characters of a character set. Criteria for this classification are provided as an enumerated bit set type, which is called mask. It is a nested type in ctype_base, the public base class of the ctype facet template. The values of mask and their semantics are listed in Table 2 .
 
 

mask value semantics
alpha alphabetic character
digit one of the character that represents the decimal digits 0 - 9
cntrl control character
lower lower case character
print printable character
punct punctuation characters
space white space characters
upper upper case characters
xdigit one of the characters that represents the hexadecimal digits 0 - f
alnum alpha or digit
graph alnum or punct

Table 2: Character Classification Criteria

Member functions of the ctype facet provide the functionality

  • to check if a certain character conforms to a certain criteria,
  • to determine all criteria each character from a range of characters conforms to,
  • to find the first element in a range of characters that conforms or does not conform to a certain criteria.
Additional to character classification the ctype facet provides means for character conversion. One type of character conversion is supported by the overloaded member functions toupper() and tolower(). They allow conversion of single characters or character ranges to their corresponding upper or lower case representation.

Another type of conversion that ctype supports is the conversion between ctype’s template character type charT and the built-in character type char . This functionality is provided by the member functions narrow() and widen(). For each function two overloaded versions exist; one that converts single characters and one that converts character ranges.

For efficiency reasons the standard requires that ctype<char> must be provided as a template specialization. Its implementation must be based on a table where the character encoding is the key and the value is a bit mask value of type ctype_base::mask. The bit mask values indicate all criteria to which the character conforms. For example, a lower case letter such as ‘k’ is associated to the bit mask value:

ctype_base::alpha | ctype_base::lower | ctype_base::print .

This table driven approach allows to implement most of the member functions as simple and efficient bit operations.
 
 

String collation.

The collate facet, defined by template <class charT> class collate , supports the comparison of strings according to language specific rules. Its member function:

int compare(const charT* low1, const charT* high1,
const charT* low2, const charT* high2) const;

returns an integer value that indicates the order of two character sequences [low1, high1) and [low2, high2) respectively:

1, indicates that the character sequence [low1,high1) is greater than the character sequence [low2,high2),

-1, indicates that the first sequence is less than the second, and

0, indicates that both sequences are equal.

The collate facet also supports the functionality to determine a hash value from a character sequence and a way to speed up the comparison of one character sequence against many other.

Note that the standard string operations are not internationalized. A string compare operation in class basic_string is a character-by-character comparison, which for instance is not sufficient for the interpretation of a single character as two characters, that is required in some languages. The consequence is that for internationalized programs the respective member functions of the basic_string template class cannot be used for comparison of strings; instead the functionality of the collate facet is needed.
 
 

Code conversion.

The codecvt facet, defined by template <class internT, class externT , class stateT> class codecvt , supports conversion between two character codesets. This is needed when the internal and the external character set of a program differ. The template parameters are:

  • internT , which is the character type, that is associated with the internal code set;
  • externT , which is the character type, that is associated with the external code set;
  • stateT , which is the state type, that is capable of holding the conversion state. It must be maintained during a conversion from external to the internal character set and vice versa.
The codecvt facet contains two types of member functions: those that provide information about the code conversion, and those that perform the conversion. An example of the first category is encoding(), which indicates if a conversion is state dependent; if the ratio between external characters consumed and internal characters produced is fix, and if it is fix, what it is. Other operations of this category indicate if conversion is necessary at all, or they help to determine the length of the character sequence resulting from a certain input character sequence.

The member function in() is used for the conversion from the internal to the external character set, out() for the conversion from the external to the internal. Both functions take an input character sequence and convert it to an output character sequence. For state dependent conversions they also maintain the conversion state.
 
 

Message Catalogs.

The messages facet, defined by template <class charT> class messages , supports the retrieval of user-defined localized messages from message catalogs. Its interface allows to open and close a message catalog identified by a catalog name and to retrieve a message from an open catalog.

The upcoming C++-standard describes how message catalogs can be used via the messages facet’s interface. The syntax of message catalogs, as well as the way message catalogs have to be installed and maintained, are beyond the scope of the standard and implementation-specific.
 
 

Representation of Numbers, Monetary Amounts, Date and Time.

Numbers are represented according to cultural conventions. For example, the symbol used for separation of the integer portion of a number from the fractional portion, the so-called radix character , can differ from country to country. In American English, this character is a period; in most European countries, it is a comma. Conversely, the symbol that groups numbers with more than three digits, the so-called thousands separator , is a comma in American English, and a period in much of Europe.

Even the grouping of digits varies. In American English, digits are grouped by threes. In Nepal for instance, the first group has three digits, all subsequent groups have two digits.
 
USA: 10,000,000.00
Germany:  10.000.000,00
Nepal:  1,00,00,000.00

Similarly units of currency are represented in different cultures in different ways. The currency symbol can vary, its placement, as well as the format of negative currency values. For example, there are two different ways of representing the same amount in US dollars:
 
domestic:  $ 99.99
international: USD 99.99

Here is an example that shows different cultural conventions for placing the currency symbol:
 
Germany:  49,99 DM
Japan:  ¥ 100

Obviously the representation of time and date depend on cultural conventions. The names and abbreviations for days of the week and months of the year vary with the language. Also, some countries use a 24-hour clock; others use a 12-hour clock. Even calendars differ; they are based on historical, seasonal, and astronomical events. The official Japanese calendar, for instance, is based on a historical event, the beginning of the reign of the current Emperor. Many countries, especially in the Western World, use the Gregorian calendar instead.

Here are examples of representations of the same date in different countries. They differ in order of day, month, and year, the separators between those items, and the use or omission of item such as the weekday in the long form of the date in Hungarian.
 
Short Form Long Form
USA: 10/14/97 Tuesday, October 14, 1997
Germany: 14.10.97 Dienstag, 14. Oktober 1997
Italy:  14/10/97 martedì 14 ottobre 1997
Greece: 14/10/1997 Trith , 14 Oktwbriou 1997
Hungary: 1997.10.14 1997. október 14.

Numeric and Boolean values.

The localization information and functionality related to numeric and Boolean expressions is handled by three standard facets:

  • numpunct , defined by template<class charT> class numpunct .
  • num_put , defined by template<class charT, class OutputIterator = ostreambuf_iterator<charT> > class num_put.
  • num_get , defined by template<class charT, class InputIterator = istreambuf_iterator<charT> > class num_get.
The numpunct facet contains the information about the format and punctuation of numeric and Boolean expressions. For example, the member functions decimal_point() and thousands_sep() return the characters that represent the radix separator and the thousands separator respectively. Other member functions provide the strings that represent the Boolean values true and false , or a pattern describing how digits are grouped when they form a number.

Based on the information contained in the numpunct facet the facets num_put and num_get provide the functionality to generate a formatted character sequence from a numeric or Boolean value, and the reverse functionality: parsing of a character sequence to extract a numeric or Boolean value. num_put does the formatting, num_get the parsing. The second parameter of the num_put and num_get template, OutputIterator and InputIterator respectively, is used to specify the character sequence. num_put provides overloaded versions of its member function put() for formatting of the following types: bool , long , unsignedlong , double , longdouble , void*. num_get provides overloaded versions of get() for storage of the extracted value into the types: bool , long , unsignedlong , unsignedint , unsignedshort , float , double , longdouble , void*. At first it might look as though versions for int or float were missing. But the intention was to keep the interface of the standard library concise, and a value of type int can be handled by the version for long .

One of the parameters to put() and get() is a reference to an ios_base object. The format flags contained in this object are used to determine the format specifications for formatting and parsing. The semantics of the flags is the same as in standard iostreams. In fact, the formatting layer of iostreams uses the num_put and num_get facets for its formatting and parsing. An otherwise necessary type conversion is avoided because the iostreams operations pass the format specifications to the facets in form of an ios_base object. The general use of num_put and num_get, however, is not limited by these design decisions; they may well be used in a context other than iostreams. The benefit of the integration of facets into standard iostreams is that i/o-operations for Boolean or numeric values are already internationalized.
 
 

Monetary values.

The localization information for monetary values is organized in a similar fashion as the localization information for numeric values. There is one facets that holds the localization-dependent information. Based on this facet are two facets that provide the functionality for formatting and parsing of character sequences that represents monetary values. The facets are:

  • moneypunct , defined by template<class charT, bool Inter = false> class moneypunct ,
  • money_put , defined by template<class charT, class OutputIterator = ostreambuf_iterator<charT> > class money_put ,
  • money_get , defined by template<class charT, class InputIterator = istreambuf_iterator<charT> > class money_get ,
The value template parameter Inter of type bool for numpunct ndicates if the currency symbol used is the international ( Inter=true ) or the domestic ( Inter=false ). money_put and money_put do not have this template parameter, because their member functions have a paramter that indicates during runtime which currency symbol should be used.

Like numpunct, moneypunct’s member functions provide the information about grouping of numeric value, about the characters used as radix separator and as thousands separator. Additionally moneypunct can tell how many digits are represented after the radix separator, which string forms the currency symbol, and how a negative and positive monetary amount is structured.

money_put contains two overloaded versions of the put() member function. One allows to format a value of type longdouble to a representation of a monetary value, the other takes a references to basic_string<charT> . money_put’s overloaded member function get() does the reverse operation: it parses a character sequence that represents a monetary amount and stores the extracted value in either a long double or a basic_string<charT>.
 
 

Date and time values.

Two facets handle the localization functionality for date and time:

  • time_put , defined as template<class charT, class InputIterator = istreambuf_iterator<charT> > class time_get , and
  • time_get , defined as template<class charT, class OutputIterator = ostreambuf_iterator<charT> > class time_put .
time_put() allows formatting of information provided in a struct tm to a character sequence that represents a date and/or time. The formatting is performed according to format specifiers that have the same semantics as those used for strftime().

time_get() provides several member functions that can parse a character sequence and return each specific date and time components in a struct tm . Examples for member functions are: get_month() or get_weekday() , which extract from the character sequence a value representing a month or a weekday respectively and store it in a struct tm .
 
 

byname Facets

In the sections above we structured the description of the facets according to the way they address a certain localization aspect. However, there is another way to structure them:

  • There are facets that are independent of a certain localization environment, because they provide only functionality. The num_put, num_get, money_put, money_get facets fall into this category.
  • On the other hand a facet like numpunct depends on the localization environment, because it provides localization dependent information: the character representing the radix separator, the character representing the thousands separator, and so on.
For those facets that provide information rather than functionality the standard defines so-called byname facets. A byname facet is a derived facet that provides the same interface as its base class, but has a constructor that takes an additional const char* argument. This argument is the name that specifies a certain localization environment; hence the term "byname" facet. Syntax and semantics of these names are not defined by the standard, but implementation specific. For example, the name "De_CH" on a X/Open system denotes the same localization environment as "german-swiss" on a Microsoft platform. Such names cannot only be used to create facets; the locale, too, has a constructor that receives a name as an argument. This constructor creates a locale that contains byname facet objects constructed with this name. For example:

Locale myLocale("En_US");

creates a locale that represents the US localization environment, and we can be sure that in the code shown below rs will be initialized with ‘.’ :

char rs = use_facet< numpunct<char> >(myLocale).decimal_point();
 
 

Base Class Facets

After this discussion of the behavior of the byname facets, which are derived facet types, lets have a look at the behavior of the base class facets.

No further explanation is needed for num_put, num_get, money_put, money_get. As described above they define functionality rather than holding locale-sensitive information; and the base classes implement this functionality.

Some base class facets provide classic "C" behavior. Classic "C" means the way C functions used to behave before internationalization was added to the C standard. Facet base classes with classic "C" behavior are: ctype, collate, numpunct. Obviously, classic "C" does only describe a behavior for the character type char . As we will see below, these three facets need to be provided for the character types char and wchar_t . The behavior for wchar_t is analogous to the classic "C" behavior for char . For instance, numpunct<wchar_t>::decimal_point() returns L‘.’ where numpunct<char>:: decimal_point() returns ‘.’ .

The base classes of the following facets have implementation defined behavior: messages, moneypunct, time_get and time_put. This is because the standards committee, as an international forum, did not want to dictate one nation’s preference as a default for all other nations. For instance, there is no universally accepted pattern to represent a monetary amount. Therefore; they did not define a base class behavior.

In the case of code conversion two codecvt base class facets must be provided by a standard compliant library. The facet codecvt<char,char,mbstate_t> is a degenerated one; it implements "no conversion", so that in() and out() behave very similar to a memcpy(). The behavior of codecvt<char,wchar_t,mbstate_t> is implementation defined.

Usually, interfaces with implementation defined behavior have to be avoided by users who strive for portability of their programs. Hence, one might wonder whether it is a problem that the base class behavior is implementation defined for some facets. The answer is: No, not really. In an internationalized application one will usually use the byname facets, because they provide localized information and functionality dependent on a specified cultural context. The behavior of a base class facet is of interest only when a new derived facet with a new behavior shall be implemented for an existing facet interface, and the existing base class behavior shall be reused, if possible. The byname facet objects are powerful and already provide support for all common localization environments. So, only when an exotic behavior is needed, the derivation of a new facet type is necessary at all. In such a case it is very likely that the new functionality must be implemented from scratch and cannot be built reusing the base class behavior. Hence the base class behavior is almost irrelevant because most likely it will be overwritten anyway.

Speaking of derivation and overwriting functions: all standard facets follow the idiom that a non-virtual public member function calls a virtual protected member function, which implements the functionality. A derived class must then redefine the protected function, not the public one. The rationale behind this idiom is that a vendor might place code for system specific functionality in the public member function. A user, who derives from such a class, need not know and bother with the system specific issues, but can simply provide the new functionality by overwriting the protected member function. An example for system specific functionality put into the public member function is the use of a mutex for multi-thread support.
 
 

The standard requires that the facet classes and class templates shown in Table 3 must be provided by a standard compliant implementation. It is up to the vendor how they are provided: as templates, or as (partial) specializations.
 
ctype<char>, ctype<wchar_t>, ctype_byname<char>, ctype_byname<wchar_t>
collate<char>, collate<wchar_t>, collate_byname<char>, collate_byname<wchar_t>
messages<char>, messages<wchar_t>, messages_byname<char>, messages_byname<wchar_t>
codecvt<char,char,mbstate_t>, codecvt<wchar_t,char,mbstate_t>
numpunct<char>, numpunct<wchar_t>, numpunct_byname<char>, numpunct_byname<wchar_t>
template <class C, class InputIterator> num_get<C,InputIterator>
template <class C, class OutputIterator> num_put<C,OutputIterator>
template <bool Inter> moneypunct<char>, template <bool Inter> moneypunct<wchar_t>, 
template <bool Inter> moneypunct_byname <char>, 
template <bool Inter> moneypunct_byname <wchar_t>
template <class InputIterator> money_get<char>, 
template <class InputIterator> money_get<wchar_t>
template <class OutputIterator> money_put<char>, 
template <class OutputIterator> money_put<wchar_t>
template <class InputIterator> time_get<char>, 
template <class InputIterator> time_get<wchar_t>,

template <class InputIterator> time_get_byname<char>, 
template <class InputIterator> time_get_byname<wchar_t>

template <class OutputIterator > time_ put <char>, 
template <class OutputIterator > time_ put <wchar_t>,

template <class OutputIterator > time_ put _byname<char>, 
template <class OutputIterator > time_ put _byname<wchar_t>

Table 3: Mandatory Facets in the Standard C++ Library
 
 
 

Summary

A standard compliant C++ library does not only provide a framework for internationalization support, consisting of locale and facet classes, but also provides a number of standard facet classes. This article gave an overview of the functionality of the standard facets along with an idea of the problem domain addressed by that functionality. A subsequent article will show how the locale framework can be extended by adding new, non-standard facets types.

References
 
/1/  Klaus Kreft & Angelika Langer
The Locale Framework
C++ Report, September 1997
URL: < http://www.AngelikaLanger.com/Articles/C++Report/LocaleFramework/LocaleFramework.html >

 
 
 
 

If you are interested to hear more about this and related topics you might want to check out the following seminar:
Seminar
 
Effective STL Programming - The Standard Template Library in Depth
4-day seminar (open enrollment and on-site)
IOStreams and Locales - Standard C++ IOStreams and Locales in Depth
5-day seminar (open enrollment and on-site)
 

  © Copyright 1995-2007 by Angelika Langer.  All Rights Reserved.    URL: < http://www.AngelikaLanger.com/Articles/C++Report/StandardFacets/StandardFacets.html  last update: 10 Aug 2007