[JDEV] Why XML for everything?

Wed Sep 29 19:32:28 CDT 1999

--- John Price <linux-guru at gcfl.net> wrote:
> > > > But I don't understand the use of XML for ALL
> your server/client
> > > > messages.  I can understand it's use for the
> message itself.  It
> > > > would facilitate changes in font, color, etc.
> but why use it for
> > > > everything?  Seems like it would really
> increase the processing time
> > > > to parse all that...
> ...
> I still don't really have an answer to my original
> question (see above).
> Not trying to be a pest, but I just want to
> understand the ideas behide
> the decisions that have been made regarding Jabber.

There are two questions here: (1) What about the extra
parsing time? (2) What about the increased message
size?

(1) Increased parsing time
Exactly what will the increase be?  This is determined
by the complexity of the grammar, the number of input
tokens to be resolved, the average length of the input
tokens, and a few other factors.  XML can be parsed by
a simple push-down automaton that runs in linear time
with respect to total input tokens.  That is because
there is very little branching, and no ambiguity in
the grammar - basically you just process the tokens as
they come in, looking for 3 types: (a) open angle
bracket (b) close angle bracket (c) neither of the
above, i.e., a normal token.  Suppose you classify the
3 types with codes 0, 1, 2.  You can use a jump table
to implement the gramma - very fast.  Resolving the
stream of input characters into tokens is similarly
fast - it can all be done with lookups indexed by
character values and jump tables (handling 16 bit
characters could make this a little more complex, but
not much).  Looking up the tags in your symbol
dictionary could potentially dominate the process, but
not if you were to, say, use a tool to generate a
"perfect hash table", thus resolving the tags in
linear time with a very small "k".  The total number
of input characters that have to be processed has a
small effect on the final parsing time, but it's not
much at all - write a c program that just reads every
character in a file to find out just how fast it is. 
If it's still not fast enough for you (for some
hard-to-imagine reason), read the entire message into
a memory buffer with a single read operation and parse
out the charaters from memory using while (n--)
parse1(*p++); or something similar.  So parse time
just isn't really a problem, agreed?  BTW, this can
all be done using YACC and/or LEX, or equivalent. 
Check out JIKES (free from IBM) for an alternative,
ultra-modern approach.  Or just implement a
simple-minded, sloppy parser and you'll *still*
scarcely notice the parsing time, next to the time to
actually transfer the message across your network link
or modem interface, say.

(2) Increased message size
OK, maybe you have a point, but let's look at it
anyway.  If you go check out the formal defs of XML
you'll see, front and center, that compactness is
explicitly not a goal.  Why?  Hmmm.  Because if you
want compactness, use a compressor.  XML has other
(arguably more important) goals, like readability,
power and flexibility for example.  XML does compress
wonderfully - try it (I recommend bzip for your
tests). 
So, yes, perhaps we could design compression into the
jabber protocol - though I think the effort and
resulting increase in complexity would be hard to
justify... see below.

Even without compression, it's not a big deal.  To
convince yourself, go look at the examples provided in
the docs page.  Look at the unavoidable message
"payload", i.e., the text that actually gets sent plus
any required sequencing information, addressing, etc,
then divide by the sum of all the characters that were
sent to transfer the message.  What did you get?  A
factor of 3? 4? (To be honest, I don't know because I
haven't done it yet... ;)  Next question: does it
matter?  Did you send as many as 200 characters to say
"The quick brown fox jumps over the lazy dog"?  When
you sent that pic of yourself to the cute girl (err, I
mean cute *person*:) you sent ****17,000****
characters or so.  Being wrapped and unwrapped by
various (fixed size and bulky) headers along the route
the message took probably dwarfed that 4 fold factor
completely.  Remember, you still haven't played the
compression card.  Well, I'm beginning to ramble on a
bit... I'll stop now, but maybe you get my point.

In short: the overheads that some people worry about
with XML are actually miniscule when examined
critically.  Don't worry about.  Worry more about
flexibilty, power, robustness, readability (of the
messages), in other words, things that matter a whole
lot more in this application.  Also consider: many
encryption algorithms are close cousins of compression
algorithms (there are good theoretical reasons for
this) so your "bulky" XML message, after being
encrypted, will likely be scarcely different in size
from some different, cleverly coded, but
hard-to-implement-and-debug coding scheme.
__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com