5.1.6. Internationalization & Unicode
National strings are best represented as Unicode (NCHAR/LONG NVARCHAR) columns. There is no guarantee that values stored inside narrow (VARCHAR/LONG VARCHAR) columns will get correctly represented. If the client application is also Unicode then no internationalization conversions take place. Unfortunately, most current applications still use narrow characters.
The national character set defines how strings will get converted from narrow to wide characters and back throughout Virtuoso. A character set is an array of 255 (without the zero) Unicode codes describing the location of each character from the narrow character set in the Unicode space. It has a "primary" or "preferred" name and a list of aliases.
Character sets in Virtuoso are kept inside the system table SYS_CHARSETS. Its layout is :
CREATE TABLE SYS_CHARSETS ( CS_NAME varchar, -- The "preferred" charset name CS_TABLE long nvarchar, -- the mapping table of length 255 Wide chars CS_ALIASES long varchar -- serialized vector of aliases );
The CS_NAME and CS_ALIASES columns are SELECTable by PUBLIC. To simplify retrieval of all official and unofficial names of character sets, Virtuoso provides the following function:
There are a number of character set definitions preloaded in the SYS_CHARSETS table. Currently these are:
|IBM437, IBM850, IBM855, IBM866, IBM874|
|ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15|
|KOI-0, KOI-7, KOI8-A, KOI8-B, KOI8-E, KOI8-F, KOI8-R, KOI8-U|
|WINDOWS-1250, WINDOWS-1251, WINDOWS-1252, WINDOWS-1257|
New character sets can be defined using the following function:
User-defined character sets can be dropped by deleting the row from the SYS_CHARSETS table and restarting the server.
Virtuoso performs all translations in accordance with a "current charset". This is a connection attribute. It gets its value as follows:
|1. If the client supplies a CHARSET ODBC Connect string
attribute either from the DSN definition or as an argument to a
|2. If the database default character set ('Charset' parameter in the 'Parameters' section of virtuoso.ini) is defined, it becomes the default.|
|3. If neither of these conditions is met, then Virtuoso uses ISO-8859-1 as the default character set; this maps the narrow chars as wide using equality.|
At any time the user can explicitly set the character set either with a call to
SQLSetConnectAttr (HDBC, SQL_CHARSET (=5002), CharacterSetString, StringLength)
or by executing the interactive SQL command:
The current character set "preferred" name (as a string) is returned by the following system function:
Virtuoso has a default character set that gets used if the client does not supply its own and in some special cases, like XML Views and FOR XML AUTO statements.
The HTTP character set can be changed during an HTTP session using:
<?vsp set http_charset = 'ISO-CELTIC'; ?> <html><body><h1>Cén chaoi 'bhfuil tú?</h1></body></htm
Virtuoso supports the following types of translations from Unicode characters to narrow characters:
If the Unicode represents a part of the US-ASCII (0-127) character set then its value gets used; If the Unicode has a mapping to narrow in the character set then use it; If neither of the above then the narrow '?' is returned.
If the Unicode represents a part of the US-ASCII (0-127) character set then its value gets used; If the Unicode has a mapping to narrow in the character set then use it; If neither of the above then the Unicode gets escaped using the form \xNNNN (hexadecimal).
If the Unicode represents a part of the US-ASCII (0-127) character set then its value gets used after replacing the special symbols (<, >, & etc.) with their entity references; If the Unicode has a mapping to narrow in the character set then use it. The narrow char is then checked to see if needs to be escaped; If none of the above then the Unicode gets escaped using the form &#DDDDDD; (decimal)
Character Set Use in ODBC/UDBC/CLI Clients
This section describes where a translation is done in the case of an ODBC/UDBC/CLI client. These are described as solution because the Virtuoso CLI is the same as the ODBC/UDBC interface.
For the functions
SQLExecDirectW() , and
SQLNativeSQLW() any Unicode arguments will become
narrow strings by using the command translation described
When doing the bindings
SQL_C_WCHAR -> SQL_xxx
SQL_Nxxx -> SQL_C_xxx (except SQL_C_WCHAR)
Virtuoso converts Unicode strings to narrow strings using the string translation described above.
Character Set Use in the ODBC/UDBC/CLI Server
The server uses the character set in the CAST operator when converting NCHAR/LONG NVARCHAR to any other type.
Character Set Use in the HTTP Server
The HTTP server appends a
attribute to the
HTTP header field when returning the HTTP header to the client.
This can be overridden by calling functions such as
The HTTP server uses the character set mainly to format
correctly values using the
http_value() function or its VSP equivalent
<?= ...>. In these cases wide values and XML entities - the
result of XML processing function like
xpath_contains() - get represented using the
HTTP/XML translation rules described above. The same rules apply
for results returned by the FOR XML directive, by XML Views, and
for WebDAV content.
Character Set Use in the XML Processor
The Virtuoso embedded XML parser correctly processes all encodings defined in the SYS_CHARSETS table and UTF8.
Generation of SQL
xpath_contains() functions translate their
expressions as follows:
|Narrow strings are these get translated to Unicode as per the character set and then to UTF-8, which is the internal encoding used by the Virtuoso XML tools.|
|SQL Views and FOR XML directives take their values from narrow columns by firstly converting them to Unicode based on the database character set and then to UTF-8.|
|Almost all the XML processors and generators return their
values as type DV_XML_ENTITY (__tag() 230). If such a value's
character representation is requested either by CAST or by
|XPath expressions that return string values are returned as NCHAR values to the clients, which then convert them to narrow character if needed.|