[jdev] Re: Get the length of the utf-8 sequence in Java
Chris Mullins
cmullins at winfessor.com
Thu Sep 9 15:52:46 CDT 2004
The algorithm below misses out on the UTF8 encoded code points that are greater than 0xFFFF.
According to:
http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html
(although Java pre 1.5 doesn't support UTF32 codepoints, so I'm not sure what would happen here if someone send you one of those).
The algorithm found on that page looks like it'll take care of what you're looking for.
--
Chris Mullins
-----Original Message-----
From: Cedric Vivier [mailto:cedricv at neonux.com]
Sent: Thu 9/9/2004 2:19 AM
To: jdev at jabber.org
Cc:
Subject: [jdev] Re: Get the length of the utf-8 sequence in Java
I do not believe Java has a standard method for this in the standard
library, but you could implement yours :
public int byte_length(String s) {
int numchars = s.length();
int numbytes = 0;
for (int i = 0 ; i < numchars ; i++) {
int c = s.charAt(i);
if ((c >= 0x0001) && (c <= 0x007F)) numbytes++;
else if (c > 0x07FF) numbytes += 3;
else numbytes += 2;
}
return numbytes;
}
I have no idea if it would be faster than your current method though,
but it should be more memory-efficient at least.
--cedricv
_______________________________________________
jdev mailing list
jdev at jabber.org
https://jabberstudio.org/mailman/listinfo/jdev
More information about the JDev
mailing list