[JDEV] [INFO] i18n? (fwd)

Sun Jan 2 19:30:01 CST 2000

Eliot Landrum wrote:

> Might someone have a more technical / authoritative answer than what I can
> give?
>
> ---------- Forwarded message ----------
> Date: Thu, 30 Dec 1999 15:32:54 -0500
> From: Constantin Riabitsev <tech at nicodemusproject.com>
> To: info at jabber.org
> Subject: [INFO] i18n?
>
> Hi guys!
>
> Just found out about Jabber, spent all evening looking through the
> docs and DTD's and realized that there's no trace of any
> internationalization stuff. People communicate in more than one
> encoding, and I think it would be wise to incorporate the standard
> i18n features into the DTD's. You know, attributes like
> charset="koi8-r" or dir="ltr"...

[SNIP]

> charset         #IMPLIED        "us-ascii"
> dir             #IMPLIED        "ltr"
>
> ).
>
> The reason why this is important is because there are sometimes
> several typeset standards for some language. E.g. Russian Cyrillic
> has two widespread standards -- win1251 (windows platforms) and
> koi8-r (*nix platforms) and it is sometimes impossible to use IM
> clients between these two unless the client can re-code from one
> into another.
>
> Using the i18n parameters, the client will know which encoding the
> messages come in and it will be able to recode them (if this
> capability is built into it).
>
> Example of an <iq> query reply:
>
> <iq from="user at server.ru" type="result">
>   <query xmlns="jabber:iq:info">
>         <name>Ivan Petrov</name>
>         <email>petrov at server.ru</email>
>         <i18n charset="win-1251" dir="ltr"/>
>   </query>
> </iq>
>
> This will tell my Linux client that before I can understand what
> Ivan Petrov writes me, it will need to apply the win1251->koi8-r
> recoding routines.
>
> Hope this is useful.. :)
> Let me know what you think about this idea.

I personally think this is a really poor way to go. I think this had come up
some before, but I'll do a quick mention of a few things.

First of all, I really don't think sending in an arbitrary charset is good
for a robust system. The data should be all converted to Unicode (UTF-8 or
UTF-16) by the client when data is sent, then decoded by the recieving
client. The main concern about this is speed, but the even the standard
Windows system calls to convert to Unicode are so fast that for a
communications protocol the difference will not matter. Making programs &
communications protocols more complex pretty much increases the likelyhood of
bugs and more problems.

Also, this turns it from a one-to-one problem to a many-to-many. If all data
is sent in Unicode, then any given client only needs to understand Unicode
plus it's own encoding. With your proposal, all clients potentially need to
understand all encodings for all languages. Ouch!

Imaging I'm chatting with someone in Japan, and you are chatting with the
same person. I speak English and you speak Russian. With your proposal you
and I might need to decode Shift-JIS, or he might need to convert from
CP-1252, ISO-8859-1, KOI8-R, CP-1251, ShiftJIS, JIS... What about when some
people start communicating in Klingon; do we need to get all clients patched
and re-released?

However, if all communication is in Unicode, my client just needs to do
Unicode<-->ISO-8859-1, you just need to do Unicode<-->KOI8-R, and he just
needs to do Unicode<-->Shift-JIS. Or even better yet, I might just process
everything in Unicode and thus be able to see your Russian and his Japanese
with minor problems. MS Office does this, so does any Java application.

Now, what probably is needed is Language. For example, you can be sending
Russian encoded in Unicode, and I might have a font with Cyrilic characters
set to display for Russian language data.

A few points to remember:

Using XML specifies i18n to some degree. Anything claiming to do XML needs to
at the very least handle Unicode (UTF-8 or UTF-16). So by specifying XML,
Jabber has already addressed i18n somewhat.

Microsoft switched to Unicode internally for all processing for their entire
Office suite some time ago. It was just better and cheaper to do so.

Microsoft even switch to Unicode for COM, even way back. They realized the
problems of trying to do it any other way. All COM BSTRS are Unicode strings.

Java is all Unicode-based for it's strings. XML and Java go hand-in-hand all
over the place.

The majority of all currently active languages in the world are already in
Unicode. Thus you have a single encoding that most people can use (Klingon is
not officially in yet, but has been submitted to the process).

XML also has i18n in it. Look through the spec for things like Unicode, the
'lang' attribute, etc.

By using Unicode, one does not have to change the client software every time
an encoding anywhere in the entire world changes (e.g. adding the Euro
character). It only needs to be updated when it's own local encoding changes,
and then only if it is doing the conversion to Unicode itself and not
counting on the local OS to do that.

I'm sure this will spark a flurry of different opinions, but it can be a
simple thing.

--
"My new computer's got the clocks, it rocks
But it was obsolete before I opened the box" - W.A.Y.