A Few Preliminary Conventions for Loadable Encodings

alpha version, October 2000

Note: this document will be heavily edited yet. Anyway, the only services interesting for an application programmer are:

+CXStringEncoding *encodingWithUid(XStringEncoding uid);

Returns an encoding object for the XStringEncoding number (got presumably from some other class' service).

+CXStringEncoding *encodingWithMIBenum(int mibEnum);

Returns an encoding object for the MIB number given, or nil if there is not any.

+CXStringEncoding *encodingWithCodePage(int codePage);

Returns an encoding object for the Microsoft code page number given, or nil if there is not any.

+CXStringEncoding *encodingWithGlobalName(CXString *name);

Returns an encoding object for the global name given, or nil if there is not any. The global names are the standard ones , used in MIME, HTML or WML protocols for specifying encodings.

+CXArray *availableEncodingLocalizedNamesAndNumbersFor(CXArray *list,id notFound=nil);

Returns an array in the same format as the CXString's availableEncodingLocalizesNamesAndNumbers does. You can specify the encodings wanted by appropriate strings in the list: "#<number>" UID, "$<number>" MIBenum, "CP<number>" code page, otherwise global name. If notFound!=nil, is used whenever no encoding can be found. Nil list means "all".

-XStringEncoding uid;

For an encoding object found by any of the previously described methods returns the appropriate XStringEncoding number, to be used in the CXString services.

-int mibEnum;

The MIB number for an encoding object. Can be zero if the appropriate MIB number was not known.

-int codePage;

The Microsoft code page number for an encoding object. Can be zero if the appropriate code page number was not known or does not exist.

-CXString *preferredGlobalName;

The preferred global name for the encoding given. You can put it out when a global name of the current encoding should be used (like in generated HTML, MIME or WML data).

-CXString *localizedName;

A human-readable name of the encoding. Can be used to present the encoding to the user.


Encoding Attributes

Each encoding -- regardless it is hardcoded into a framework (like the Epoc Latin1 and perhaps a few other trivial ones) or loaded from an encoding plugin -- has the following attributes:

namemeaning
UIDAn unique number of the encoding. The MIBenum (see below) could be used had it be so from the very beginning; now it is impossible, for it would be incompatible with the current X.soft and Enfour UIDs.
X.soft reserves all the UIDs from zero up to 10000. The higher UIDs can be freely used by Enfour; other developers, who want to implement their own encodings, should register with Enfour for free UIDs. Since the encoding UIDs are used for encodings only, they need not be registered with Symbian, and they can freely and harmlessly equal to any Epoc UID.
MIBenumThe MIBenum number of the encoding (see ftp.isi.edu/in-notes/iana/assignments/character-sets). It can be used by application developers to select the appropriate encoding programmatically.
In case the MIBenum is unknown, this value can be zero.
codepageSince the Microsoft "code pages" are used quite frequently, the code page number of the encoding is available. It can be used by application developers to select the appropriate encoding programmatically.
In case the encoding does not equal to any Microsoft code page, the value would be zero.
global namesA list of globally known names of the encoding. They can be used by application developers to select the appropriate encoding programmatically, in applications like HTML, MIME or WML browser.
The first of the names has a special position of a preferred one--developers would use it to represent the encoding in case a code like HTML, WML or MIME should be generated.
For list of the standard names for many encoding see again the ftp.isi.edu/in-notes/iana/assignments/character-sets.
In case there are no such names for some encoding, the list can be empty.
character sizeThe size of each character in the encoding in bytes. In case different characters are encoded in different number of bytes, the value is zero.
Programmers can use this information to eg. select all 8-bit encoding.
maximum character sizeThe maximum size of the longest character representation in bytes. In a (quite hypothetical) case of an encoding which can represent charater by an arbitrary number of bytes, the value is zero.
localized nameThe human-readable name of the encoding. It will be used in user interface, whenever the user is about to select an encoding manually.
Programmers should never rely on any particular contents of this value: whilst each encoding has some default setting for it, it can be freely changed by the system-wide locale support.

Encoding Services

Note: THIS API WILL BE CHANGED, since we recently developed a new, *WAYS* better stream-based interface with Enfour (who provides the Far East encodings).

In principle, the encoding can convert characters to and from Unicode. So as to allow for greater flexibility, a few more services are provided:

Conversions and character sizes

These services are character-based in principle, though they can work over a general buffer. The reason is they are quite low-level; the upper-level services (like "convert a buffer from one encoding to another") can be easily implemented using these ones.

That's also why the plain buffer/index access is used instead of the Epoc descriptors: for such low-level services it seems to be quite reasonable.

Note: on the other hand, should the more luxurious whole-buffer-at-once services be implemented directly on the encoding level, they could be a tad more efficient. I guess the difference would be negligible, but it is open to discussion?

-int characterSizeInBytesInBufferatIndex(const void *buf,int index);

The bytes at the address buf+index are checked, and the number of them which represents one character in the encoding is returned. In case of a malformed (uninterpretable) data an exception is generated (thus, the method will never return a zero). If so, the exception's userInfo will contain the index of the first "bad" byte as XNUM2OBJ.

Naturally, 8-bit encodings would check nothing and just return 1; analogially Unicode would just return 2.

-unichar convertToUnicodeCharInBufferatIndex(const void *buf,int &index);

The bytes at the address buf+index are checked, and the Unicode character which they represent is returned. In case of a malformed (uninterpretable) data an exception is generated (see above).

-int characterSizeInBytesForUnicodeChar(unichar cc);

Tells "how big buffer would I need to convert the cc to the encoding". Could return zero, which means the character given is not convertible to the encoding.

-int convertUnicodeChartoBufferatIndex(unichar cc,void *buf,int &index);

Actually converts the character cc into the encoding. The result is placed to the address buf+index; it is presumed the free space's enough. The index is moved to point to the first free byte after the stored character, and the number of stored bytes is returned.

In case the character is not convertable to the encoding, the method does nothing and returns zero.

-int convertUnicodeChartoDescriptor(unichar cc,TDes &des);

As an exception, a descriptor-based API variant is available here: the converted characters are appended to the descriptor given, and the number of the appended bytes is returned (again, it can be zero in case of an inconvertable character).

The reason is that in this case (and only in this case) a conversion between buffers and descriptors is non-trivial, and it is possible a direct implementation of this method might be more effective than the default implementation, which internally uses the convertUnicodeChartoBufferatIndex one.

Access to characters

Since a character can be encoded by a variable number of bytes, a set of methods which helps to find the character boundaries is helpful. They all can be implemented using the characterSizeInBytesInBufferatIndex (and actually such a default implementation is available), but it is quite possible to implement them directly much more effectively.

-int numberOfCharactersInBufferlength(const void *buf,int len);

The service just counts the number of characters in a given buffer. In case the buffer is formed well, but at the end is an "unfinished" character (like if there is a Unicode buffer with odd length), the number of all the complete characters is returned as a negative number. This way,

Note that an exception can still be generated, in case the buffer contains a malformed character.

-int indexInBufferlengthofCharacterNo(const void *buf,int len,int charno);

An index of the first byte of the charno-th character is returned (zero means the 1st character, etc.). In case there is less number of characters than charno+1 in the buffer -1 is returned. The len value can be zero, in which case it is supposed the buffer's long enough.

In case the len value is nonzero, the charno value can be negative, which means the characters are counted from end (charno -1 means the last character in the buffer, -2 the previous one, and so forth). In case the buffer contains an unifinished character at the end, it is ignored, and the previous (last complete) one is considered to be the last one.

Caveat: do not use this service unless really needed. The character counting, especially the backward one, can be with some encodings extremely slow.

-int indexInBufferofCharacterBeforeIndex(const void *buf,int index);

Given a nonzero byte index into a buffer, this service tries to find the rightmost character, which begins at a lesser index. Or, using other words, in case the byte index points to a character, the previous one's position is returned; in case it points inside a character, this character position is returned.

Read please the "Caveat" above, which fully applies to this service.

Loadable Encoding Plugins

In principle, there are three types of encoding plugin:

So as to support this all, the following mechanism is used:

(a) The encoding plugins are placed to "/System/XConversion/Encodings" folder on any Epoc disk.

(b) Any file with a suffix "table" is checked; it should contain exactly 512 bytes, which form a table from the 8-bit encoding to (little-endian) Unicode. Besides, the file name should have the following format:

<localizedName>.<decimal UID>[@<global name>]*[#<decimal code page>][$<decimal MIBenum>].table

The [] means optionality, the asterisk means "zero or more". That way, the encoding attributes can be fully set.

(c) Any file with a suffix "dll" and UID2 0x10000e25 should contain an Enfour encoding converter DLL (the API provided by Enfour). Again, the attributes are encoded in the file name:

<localizedName>.[@<global name>]*[#<decimal code page>][$<decimal MIBenum>].dll

Note that this time the UID need not be encoded in the name, for it is part of the DLL (as the UID3).

(d) Finally, to support the Enfour converters in Enfour folders, and at the same time to offer an alternative way for the information encoded in filenames, there is the fourth possibility: any file with a suffix "link" should contain exactly two lines: the first one containing the encoding attributes, specified exactly the same way as in a file name described above (only the suffixes .table or .dll can be omitted). The second line then should contain a path to the file, whose contents define the encoding.

(e) Potentially any DLL with a UID2 different from 0x10000e25 can be, in future, used for the X.soft encoding DLLs, when the API is fully specified.

Preliminary version from Sep 16th, 2000, by oc

Copyright © 1999-2000 X.soft, all rights reserved