9.5. Wide Character Identifiers

All Virtuoso schema columns are confined to 8-bit character fields. This will remain for backwards compatibility and performance reasons, however, there are two options available for support of non-ASCII identifier names as follows:

Maintain an 8-bit system. Pass all 8-bit codes that enter the system and read them back according to the current database character set. This has the convenience of a 1-to-1 correspondence between the character lengths of an identifier and their representation, so it's a subject to like single character wildcards etc.

This works well only for languages that do have single bit encodings (like western-european languages and cyrillic). But this does not work at all for the far-east languages. It also depends on the database character set and does not allow identifiers to be composed from multiple character sets.
Store all identifiers as UTF-8 encoded unicode strings. This would allow seamless storage and retrieval of ANY character within the unicode character space. This, however, has the disadvantage of the varying character length representation which should be taken into account when comparing identifier names with LIKE.

Virtuoso supports the above cases which are switchable through the "SQL_UTF8_EXECS" = 1/0 flag in the [Client] section of the Virtuoso INI file. Setting SQL_UTF8_EXECS = 1 enables UTF-8 identifier storage and retrieval, whereas setting SQL_UTF8_EXECS = 0 disables it. The default setting is 0: disabled for backwards compatible option.

	Note:
	Once a non-ASCII identifier gets stored using a particular setting for the "SQL_UTF8_EXECS" flag and the flag is subsequently changed this will make the stored identifiers unreadable by normal means (but can be read by special means).

When an SQL statement comes into the driver(s) it is expanded into unicode (using either the current database character set if it is a narrow string like in SQLExecDirect, or taking it verbatim as in SQLExecDirectW). The unicode string is then encoded into UTF-8 passed to the SQL parser. The SQL parser knows that it will receive UTF-8 so it takes that into account when parsing the national character literals (N'<literal>') and the "normal" literals ('<literal>'). It will however return identifier names in UTF-8, these will then get stored into the DBMS system tables or compared against them depending on the type of statement.

All returned identifiers will be translated from UTF-8 to Unicode when returned to the client, so the client should never actually see the UTF-8 encoding of the identifiers.

Representing a string in UTF-8 will not change the identifier parsing rules or the SQL applications logic since the SQL special characters - like dot, quote, space etc - are ASCII symbols and they will get represented as a single byte sequence in UTF-8.

The upper/lower functions should be used with care when applied to identifiers: they will get narrow strings in UTF-8, so applying an upper/lower to them may cause damage to the UTF-8 encoding. That is why the identifiers should be converted explicitly to wide strings using the charset_recode function, changed to upper or lower case and then translated back to UTF-8 using the charset_recode function again.

Using single character LIKE patterns against identifiers stored as narrow strings in system tables will generally not work, as a single character may be represented with up to 6 bytes in UTF-8. An exception to that is when using single character pattern to match an ASCII character.

9.5.1. UTF-8 Implementation Notes For ODBC

All wide functions which do return an identifier, like SQLDescribeColW and friends, will return the correct wide literal. For their narrow counterparts, such as SQLDescribeCol, the UTF-8 string will first be converted to a wide string and then to a narrow string using the current database character set. However, an extension to the ODBC standard has been implemented instructing all result set returning meta-data functions, such as SQLTables and SQLTablesW, to return SQL_NVARCHAR instead of SQL_VARCHAR columns. This is not a problem for most applications since all they do is to map the result to SQL_C_CHAR on retrieval which will convert the wide string to the appropriate narrow string inside the driver using the current database character set. This will cause problems with narrow applications like MS Query, trying to get identifiers not representable in the current narrow character set, because all they will get is the "untranslatable char" mark (currently a question mark).

Prefix	IRI
schema	http://schema.org/
n5	http://creativecommons.org/licenses/by/4.0/
rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
n4	http://www.openlinksw.com/#
xsdh	http://www.w3.org/2001/XMLSchema#
n2	http://docs.openlinksw.com/virtuoso/wideidentifiers/

Prefix	URI
xmlns:schema	http://schema.org/
xmlns:n5	http://creativecommons.org/licenses/by/4.0/
xmlns:rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
xmlns:n4	http://www.openlinksw.com/#
xmlns:xsdh	http://www.w3.org/2001/XMLSchema#
xmlns:n2	http://docs.openlinksw.com/virtuoso/wideidentifiers/

Prefix

URI

xmlns:schema

http://schema.org/

xmlns:n5

http://creativecommons.org/licenses/by/4.0/

xmlns:rdf

http://www.w3.org/1999/02/22-rdf-syntax-ns#

xmlns:n4

http://www.openlinksw.com/#

xmlns:xsdh

http://www.w3.org/2001/XMLSchema#

xmlns:n2

http://docs.openlinksw.com/virtuoso/wideidentifiers/

Subject	Predicate	Object
n2:	rdf:type	schema:APIReference
n2:	rdf:type	schema:TechArticle
n2:	schema:name	9.5.ÃÂ Wide Character Identifiers
n2:	schema:copyrightHolder	_:vb82624
n2:	schema:datePublished	2016-09-09 16:16:54
n2:	schema:headline	9.5.ÃÂ Wide Character Identifiers
n2:	schema:keywords	OpenLink,Virtuoso,database,RDBMS,relational,SQL,RDF,triple store,linked data,linked open data,Big Data,ODBC
n2:	schema:license	n5:deed.en_US
n2:	schema:publisher	_:vb82623
n2:	schema:url	n2:
_:vb82623	rdf:type	schema:Organization
_:vb82623	schema:name	OpenLink Software
_:vb82623	schema:url	n4:this
_:vb82624	rdf:type	schema:Organization
_:vb82624	schema:name	OpenLink Software
_:vb82624	schema:url	n4:this

Subject

Predicate

Object

n2:

rdf:type

schema:APIReference

n2:

rdf:type

schema:TechArticle

n2:

schema:name

9.5.ÃÂ Wide Character Identifiers

n2:

schema:copyrightHolder

_:vb82624

n2:

schema:datePublished

2016-09-09 16:16:54

n2:

schema:headline

9.5.ÃÂ Wide Character Identifiers

n2:

schema:keywords

OpenLink,Virtuoso,database,RDBMS,relational,SQL,RDF,triple store,linked data,linked open data,Big Data,ODBC

n2:

schema:license

n5:deed.en_US

n2:

schema:publisher

_:vb82623

n2:

schema:url

n2:

_:vb82623

rdf:type

schema:Organization

_:vb82623

schema:name

OpenLink Software

_:vb82623

schema:url

n4:this

_:vb82624

rdf:type

schema:Organization

_:vb82624

schema:name

OpenLink Software

_:vb82624

schema:url

n4:this

Prev	Up	Next
9.4. Identifier Case & Quoting	Home	9.5.2. UTF-8 Implementation Notes In JDBC

9.5. Wide Character Identifiers

9.5.1. UTF-8 Implementation Notes For ODBC

Namespace Prefixes

Statements

Namespace Prefixes

Statements