[JDEV] about encoding

Thu Mar 8 03:27:01 CST 2001

On Thu, Mar 08, 2001 at 04:28:23PM -0800, Jau-Lung Huang wrote:
> Hi...All:
> 
>       Recently I try to use jabber in Chinese(Big5 Encoding) Environment.
> After some testing,winjab and JabberIM works fine in Big5.But
> JabberApplet and KVM is not.I ever think that maybe we can change all to 
> Unicode Environment to solve this problem. But It Still Can't solve all
> problem.Because Some Environment still are Big5 or their native language
> encoding. So we think that maybe add such attribute to Jabber Protocol
> to solve the problem.
> like this:
> <message encoding="big5">msg content</message>

  No this won't work !

    http://www.w3.org/TR/REC-xml#charencoding

--------------
It is a fatal error if an XML entity is determined (via default, encoding
declaration, or higher-level protocol) to be in a certain encoding but
contains octet sequences that are not legal in that encoding. It is also
a fatal error if an XML entity contains no encoding declaration and its
content is not legal UTF-8 or UTF-16.
--------------

> is it a good way to integrate  the Unicode and other native encoding ?
> welcome any idea or suggestion
> Thanks.

  Strictly speaking this is reaching the limits of the XML specification.
>From a theorical point of view an XML entity cannot mix encoding. Which
implies that you cannot pass a message in a different encoding than 
the rest of the Jabber session. At least I garantee that all implementations
based on a conformant parser and using a single instance of that parser
for the full session will choke (expat/libxml/... included) with an XML
well formedness error. That's one of the limitation of the approach taken
in the Jabber design. Note also that this limitation is not just for
Asian character sets, but also applies to all the series of the ISO Latin
encodings, for example if the session starts with

<?xml version="1.0"?>

i.e. without specifying an encoding, it wil then be assumed to be UTF8 or
UTF16 by the parser and

<message>là</message>

i.e. using an ISO-8859-1 char must break with a well formedness error too.

Practically encoding all the session in UTF8 is what makes the most sense,
you don't have to infringe the XML specification, you stay within reasonable
bounds, all parsers are supposed to handle UTF8 and this doesn't add an 
extra penalty for all the protocol related messages like UTF16, and ensure
cheaper processing on the servers (not to be neglected !).

What it means is that while you can accept user input in the most reasonable
charset, you must encode it to UTF8 before including it in a Jabber message.
Fortunately, UTF8 can cover all the Unicode range and it is easy to
find encoders/decoders for UTF8, for example iconv() is available on 
UNIX system and a library can be used on Windows, I have a pointer in 
libxml FAQ:
   http://xmlsoft.org/FAQ.html#Compilatio

Daniel

P.S.: the fact that some client accept to mix the encoding within a Jabber
      session is a clear violation of the XML specification, it's non
      compliant and expose their authors to some possible very negative
      feedback from the XML community (this happened to WAP, the Jabber
      community should make sure that their applications don't violate
      the spec).

-- 
Daniel Veillard      | Red Hat Network http://redhat.com/products/network/
veillard at redhat.com  | libxml Gnome XML toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/