[jdev] Re: Get the length of the utf-8 sequence in Java

Chris Mullins cmullins at winfessor.com
Thu Sep 9 15:52:46 CDT 2004


The algorithm below misses out on the UTF8 encoded code points that are greater than 0xFFFF. 
 
According to:
http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html
 
(although Java pre 1.5 doesn't support UTF32 codepoints, so I'm not sure what would happen here if someone send you one of those). 
 
The algorithm found on that page looks like it'll take care of what you're looking for. 
 
-- 
Chris Mullins
 
 
 
-----Original Message----- 
From: Cedric Vivier [mailto:cedricv at neonux.com] 
Sent: Thu 9/9/2004 2:19 AM 
To: jdev at jabber.org 
Cc: 
Subject: [jdev] Re: Get the length of the utf-8 sequence in Java



	I do not believe Java has a standard method for this in the standard
	library, but you could implement yours :
	
	
	public int byte_length(String s) {
	     int numchars = s.length();
	     int numbytes = 0;
	
	     for (int i = 0 ; i < numchars ; i++) {
	       int c = s.charAt(i);
	       if ((c >= 0x0001) && (c <= 0x007F)) numbytes++;
	       else if (c > 0x07FF) numbytes += 3;
	       else numbytes += 2;
	     }
	
	     return numbytes;
	}
	
	
	I have no idea if it would be faster than your current method though,
	but it should be more memory-efficient at least.
	
	
	--cedricv
	
	_______________________________________________
	jdev mailing list
	jdev at jabber.org
	https://jabberstudio.org/mailman/listinfo/jdev
	



More information about the JDev mailing list