[JDEV] charsets (was: Protocol extension?)

Jon A. Cruz joncruz at geocities.com
Thu Jul 29 23:51:19 CDT 1999


"Lindsay F. Marshall" wrote:

>
> I suspect that the only place where anything is set out for what cannot
> be non-ascii is hostnames which have a limited character set (RFC
> anyone?) I mean, I would get pretty hacked off if my name used
> non-ascii characters and you told me I couldn't use it. Computers are
> meant to assist people, right? Ever looked in on a Japanese IRC
> Channel? If a client can't/won't cope then so what? It could (like
> some mailers) display a message saying that the message is encoded in
> a character set that is not supported and allow various options for
> viewing the data. (In fact very often 7-bit ascii works anyway) The
> really nasty things start when you get a community of overlapping
> groups that use different encodings - this certainly happens with
> Russian and with Japanese. Flagged encodings would be a joy for
> sorting out this kind of mess! Allow encoding attributes on anything
> that can have CDATA, and ignore them if you want (so long as this fact
> is documented that's OK by me). (when I say "you", I mean the abstract
> you that writes clients of course)
>
> >* typical; my netscape on Linux actually displays Kanji correctly.
>
> Yup, Linux does a pretty good job with Japanese and Korean. Damn site
> better than other systems I wont name.

Please, please, please, please, please, please don't get into multiple
encodings.

Please.

There are all sorts of potential problems. Specifying UTF-8 as "the"
official encoding would simplify things greatly.

First of all, From the XML spec http://www.w3.org/TR/1998/REC-xml-19980210
section 2.2 states:
"All XML processors must accept the UTF-8 and UTF-16 encodings of 10646;"

Thus, since we are basing things on XML, we can declare that if a client
can't handle UTF-8, then it has a broken XML implementation. (if not, I
can write them code. It's not too hard   :-)

Any arbitrary client has all the information needed to convert it's local
encoding to Unicode and thus UTF-8. But a different client might not have
the needed information to decode that.  e.g. in talking to someone from
Japan, my English Windows knows how to display Japanese Unicode, but then
it is missing the conversions for EUC-JP, and would not be able to display
his Japanese correctly.

In doing the current version of COM, Microsoft realized this and made the
declaration that all COM strings are Unicode. Period. It doesn't even
matter that Windows95 pretty much otherwise does not support Unicode (for
the most part, only two Win95 Unicode calls actually work). COM Strings
are Unicode.

Also, remember that this is not just an international issue. To people in
the US, one on a Mac and one on a Windows machine will start to hit the
same problem (anyone remember the early web days where quotes would be
missing from web pages?). Throw Linux/Unix in the mix and you have three
different encodings for US English right off the bat. Ouch.


Seriously, I want to ask if there are any reason for not using
Unicode/UTF-8 as the one encoding?




Now, as some people pointed out, encoding itself does not carry enough
information. You also need language, and specified in such a way as to
keep pertinent info.

Thankfully the smart people who came up with the XML standard thought of
that. Section 2.12 addresses this by specifying special attribute named
"xml:lang".
http://www.w3.org/TR/1998/REC-xml-19980210#sec-lang-tag

BTW, this (lang-country-region) is what Taligent did for adding into Java
for 1.1 (they had many good engineers, including Dr. Mark Davis, president
of the Unicode Consortium). The same technology but for C/C++ is now put
out by IBM in a free library, the IBM Classes for Unicode.

--
"My new computer's got the clocks, it rocks
But it was obsolete before I opened the box" - W.A.Y.






More information about the JDev mailing list