[JDEV] request for ideas: RFC822 to JID mapping

Matthias Wimmer m at tthias.net
Sun Jul 28 12:07:05 CDT 2002


Hi David!

David Waite wrote:

>> Yeah, but I think this is a minor problem as most people won't use 
>> node with more then 64 bytes. BTW: I think nodes can have up to 256 
>> bytes and I think it's strange to limit it based on bytes instead of 
>> characters ... *g*)
>
> Oh, it's more complicated than that. One character can be composed of 
> multiple codepoints, which (in UTF-8 encoding) can be composed of 
> multiple bytes. What you probably meant was codepoints, which is even 
> a weirder place to stand than bytes - the computer has difficulty 
> using fixed-length fields for the username, and clients still have to 
> figure out how many characters can be represented based on the # of 
> codepoints.

Yes ... I meant codepoints. It's just a problem with my limited ability 
to express everything correctly when I write English.
I am German and we have letters with modifieres to. E.g. the letter "Ü" 
can be encoded as U+00DC or as U+0055 U+0308.
It would be nice to have a logic in the Jabber server that notices that 
these two encodings are the same letter and treat nodes that contain one 
of the two encodings as identical. But this is hard work to be 
implemented and I won't volunteer for that job. ;)

> (A hopefully correct example)
> 1 word in US7ASCII could be 8 bytes , 8 codepoints and would be 8 
> characters.
> 1 word in some asian languages could be 1 character, 3 codepoints and 
> 12 bytes.
>
> The Chinese speaker has used 1/8th the # of characters as the English 
> speaker, but has conveyed the same amount of information.

That's true for Chinese. But there are other alphabets too. E.g. the 
Thai alphabet that has codepoints U+0E?? (4 bytes per character in 
UTF-8) and uses (AFAIK) about the same number of letters per word.
An other problem is if you transfer Jabber in an other encoding, let's 
say UCS-2. (I know that this is not valid due to the Jabber specs, but 
there are people that would like to do this and I can imagine that this 
meight be done some time). To check if a node is valid (not to long) 
they have to convert the JID to UTF-8 first. It's probably to late to 
change it but I still think it is a strange definition if you see it 
from the user's side. - And if you implement a Jabber client and you 
offer the user a text field where he can enter his node: the client has 
to update the valid length of this text box based on the characters that 
the user has already entered.

> At least with bytes, its (computationally) easy for everyone to figure 
> out what the limit is.

I think with codepoints it wouldn't have been much harder to implement 
the server and it would be easier to implement a user interface.
Also bytes are very C centric. If you use a language that supports 
unicode (e.g. Java) you will most likely use a unicode string to store a 
node and the language will convert the UTF-8 string when it is read from 
the network. When you then check the length of a node to decide if it is 
valid you will always have to convert it back to UTF-8.

> Finally - I think it would be interesting to be able to 'limit' a 
> server to a subset of the full JID scheme with a server setting; 
> perhaps a subset which corresponds with a subset of RFC 2822. I know I 
> would probably turn this on just to guarantee that I can migrate the 
> user information storage and authentication mechanisms around to 
> systems which do not support unicode.

Maybe this could be an interesting feature, but I don't think I would 
use it. Even here in Germany where there is no big need for unicode node 
names (because you only use modified latin characters) I have 20 users 
(out of about 14000) that have registered with non-ASCII characters in 
their ID (and even more with invalid IDs if you would limit it to RFC822).

> Finally RFC 2822 allows for quoted literals, and quotes are not legal 
> in JIDs - so even declaring a subset you still would still not be able 
> to get a 1:1 mapping without ammending  the JEP for JIDs. Since this 
> is for allowing email to users on the jabber server, having the local 
> JIDs as a subset of RFC2822 is fine.

For the mapping of JIDs to E-Mail-Adresses I don't care about mail 
addresses that can't be mapped to JIDs. - For mapping mail addresses to 
JIDs (a Jabber user has to address his e-mail) I can use other ways of 
quoting. For this direction it isn't that important to have nice JIDs 
because they can be translated by jabber:iq:gateway.


Tot kijk
    Matthias

-- 
Fon: +49-700 77007770		http://matthias-wimmer.de/
Fax: +49-89 312 88654		jabber://mawis@charente.de





More information about the JDev mailing list