5.1.6.Internationalization & Unicode

National strings are best represented as Unicode (NCHAR/LONG NVARCHAR) columns. There is no guarantee that values stored inside narrow (VARCHAR/LONG VARCHAR) columns will get correctly represented. If the client application is also Unicode then no internationalization conversions take place. Unfortunately, most current applications still use narrow characters.

The national character set defines how strings will get converted from narrow to wide characters and back throughout Virtuoso. A character set is an array of 255 (without the zero) Unicode codes describing the location of each character from the narrow character set in the Unicode space. It has a "primary" or "preferred" name and a list of aliases.

Character sets in Virtuoso are kept inside the system table SYS_CHARSETS. Its layout is :

CREATE TABLE SYS_CHARSETS (
    CS_NAME varchar,                    -- The "preferred" charset name
    CS_TABLE long nvarchar,             -- the mapping table of length 255 Wide chars
    CS_ALIASES long varchar             -- serialized vector of aliases
);

The CS_NAME and CS_ALIASES columns are SELECTable by PUBLIC. To simplify retrieval of all official and unofficial names of character sets, Virtuoso provides the following function:

charsets_list()

There are a number of character set definitions preloaded in the SYS_CHARSETS table. Currently these are:

GOST19768-87
IBM437, IBM850, IBM855, IBM866, IBM874
ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15
KOI-0, KOI-7, KOI8-A, KOI8-B, KOI8-E, KOI8-F, KOI8-R, KOI8-U
MAC-UKRAINIAN
MIK
WINDOWS-1250, WINDOWS-1251, WINDOWS-1252, WINDOWS-1257

New character sets can be defined using the following function:

charset_define()

User-defined character sets can be dropped by deleting the row from the SYS_CHARSETS table and restarting the server.

Virtuoso performs all translations in accordance with a "current charset". This is a connection attribute. It gets its value as follows:

1. If the client supplies a CHARSET ODBC Connect string attribute either from the DSN definition or as an argument to a SQLDriverConnect() call, Virtuoso searches for the name in SYS_CHARSETS and, if there is a match, that character set becomes the default.
2. If the database default character set ('Charset' parameter in the 'Parameters' section of virtuoso.ini) is defined, it becomes the default.
3. If neither of these conditions is met, then Virtuoso uses ISO-8859-1 as the default character set; this maps the narrow chars as wide using equality.

At any time the user can explicitly set the character set either with a call to

SQLSetConnectAttr (HDBC, SQL_CHARSET (=5002), CharacterSetString, StringLength)

or by executing the interactive SQL command:

SET CHARSET='<name>|<alias>'

The current character set "preferred" name (as a string) is returned by the following system function:

current_charset()

Virtuoso has a default character set that gets used if the client does not supply its own and in some special cases, like XML Views and FOR XML AUTO statements.

The HTTP character set can be changed during an HTTP session using:

SET HTTP_CHARSET='<name>|<alias>'

Example:

     <?vsp
         set http_charset = 'ISO-CELTIC';
     ?>
     <html><body><h1>Cén chaoi 'bhfuil tú?</h1></body></htm
    

Virtuoso supports the following types of translations from Unicode characters to narrow characters:

  • String translation:

    If the Unicode represents a part of the US-ASCII (0-127) character set then its value gets used;
    If the Unicode has a mapping to narrow in the character set then use it;
    If neither of the above then the narrow '?' is returned.
  • Command translation:

    If the Unicode represents a part of the US-ASCII (0-127) character set then its value gets used;
    If the Unicode has a mapping to narrow in the character set then use it;
    If neither of the above then the Unicode gets escaped using the form \xNNNN (hexadecimal).
  • HTTP/XML translation:

    If the Unicode represents a part of the US-ASCII (0-127) character set then its value gets used after replacing the special symbols (<, >, & etc.) with their entity references;
    If the Unicode has a mapping to narrow in the character set then use it. The narrow char is then checked to see if needs to be escaped;
    If none of the above then the Unicode gets escaped using the form &#DDDDDD; (decimal)

Character Set Use in ODBC/UDBC/CLI Clients

This section describes where a translation is done in the case of an ODBC/UDBC/CLI client. These are described as solution because the Virtuoso CLI is the same as the ODBC/UDBC interface.

For the functions SQLPrepareW() , SQLExecDirectW() , and SQLNativeSQLW() any Unicode arguments will become narrow strings by using the command translation described above.

When doing the bindings

SQL_C_WCHAR -> SQL_xxx

and

SQL_Nxxx -> SQL_C_xxx (except SQL_C_WCHAR)

Virtuoso converts Unicode strings to narrow strings using the string translation described above.

Character Set Use in the ODBC/UDBC/CLI Server

The server uses the character set in the CAST operator when converting NCHAR/LONG NVARCHAR to any other type.

Character Set Use in the HTTP Server

The HTTP server appends a

charset=xxxx

attribute to the

Content-Type:

HTTP header field when returning the HTTP header to the client. This can be overridden by calling functions such as http_header() .

The HTTP server uses the character set mainly to format correctly values using the http_value() function or its VSP equivalent <?= ...>. In these cases wide values and XML entities - the result of XML processing function like xpath_contains() - get represented using the HTTP/XML translation rules described above. The same rules apply for results returned by the FOR XML directive, by XML Views, and for WebDAV content.

Character Set Use in the XML Processor

The Virtuoso embedded XML parser correctly processes all encodings defined in the SYS_CHARSETS table and UTF8.

Generation of SQL

The xpath() and xpath_contains() functions translate their expressions as follows:

Input Processing
Narrow strings are these get translated to Unicode as per the character set and then to UTF-8, which is the internal encoding used by the Virtuoso XML tools.
SQL Views and FOR XML directives take their values from narrow columns by firstly converting them to Unicode based on the database character set and then to UTF-8.
Output Processing
Almost all the XML processors and generators return their values as type DV_XML_ENTITY (__tag() 230). If such a value's character representation is requested either by CAST or by http_value() then Virtuoso converts it to narrow characters using the HTTP/XML translation rules given above.
XPath expressions that return string values are returned as NCHAR values to the clients, which then convert them to narrow character if needed.