[JDEV] Why XML for everything?

John Price linux-guru at gcfl.net
Wed Sep 29 21:54:15 CDT 1999


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 29 Sep 1999, Zoom Juice wrote:
> There are two questions here: (1) What about the extra
> parsing time? (2) What about the increased message
> size?
> 
> (1) Increased parsing time
> Exactly what will the increase be?  This is determined
> by the complexity of the grammar, the number of input
> tokens to be resolved, the average length of the input
> tokens, and a few other factors.  XML can be parsed by
> a simple push-down automaton that runs in linear time
> with respect to total input tokens.  That is because
> there is very little branching, and no ambiguity in
> the grammar - basically you just process the tokens as
> they come in, looking for 3 types: (a) open angle
> bracket (b) close angle bracket (c) neither of the
> above, i.e., a normal token.  Suppose you classify the
> 3 types with codes 0, 1, 2.  You can use a jump table
> to implement the gramma - very fast.  Resolving the
> stream of input characters into tokens is similarly
> fast - it can all be done with lookups indexed by
> character values and jump tables (handling 16 bit
> characters could make this a little more complex, but
> not much).  Looking up the tags in your symbol
> dictionary could potentially dominate the process, but
> not if you were to, say, use a tool to generate a
> "perfect hash table", thus resolving the tags in
> linear time with a very small "k".  The total number
> of input characters that have to be processed has a
> small effect on the final parsing time, but it's not
> much at all - write a c program that just reads every
> character in a file to find out just how fast it is. 
> If it's still not fast enough for you (for some
> hard-to-imagine reason), read the entire message into
> a memory buffer with a single read operation and parse
> out the charaters from memory using while (n--)
> parse1(*p++); or something similar.  So parse time
> just isn't really a problem, agreed?  BTW, this can
> all be done using YACC and/or LEX, or equivalent. 
> Check out JIKES (free from IBM) for an alternative,
> ultra-modern approach.  Or just implement a
> simple-minded, sloppy parser and you'll *still*
> scarcely notice the parsing time, next to the time to
> actually transfer the message across your network link
> or modem interface, say.

I guess I was comparing your XML messages to, say, a packet type
protocol, with a fixed header that had the type of message, maybe a
couple of other fields common to all packets, a 32-bit from ID
(dynamically assigned to each user on login), a 32-bit to ID, and a
variable-length data field.

So, for example, instead of a message like one of your example
messages:

<message>
        <from name='nickname'>fred</from>
        <thread>sdfa</thread>
        <priority>1</priority>
        <subject>Did you see that?</subject>
        <say>asdgf asdfkjasgoijqwert asdgaldgjkas</say>
</message>

You might have the following:

[packet type byte]
[1-2 other fixed bytes]
[4 bytes-from ID]
[4 bytes-to ID]
[data length byte]
Subject: Here's a subject of you want one.
Who needs threads?  I just want to send my buddy at message! :-)
Here you can have all kinds of XML (or HTML) attributes, fonts, etc...
asdgf asdfkjasgoijqwert asdgaldgjkas
<NULL>

Sure, it's somewhat more cryptic, but it's A LOT easier to parse
with a program than XML.  There is a defined order and length to
nearly every field.  You don't need a "perfect hash table," no token
parsing, and really only one jump table indexed by the packet type.

Plus, from what I can tell, there are no real "formal defs of XML."
It seems like a make-it-up-as-you-go protocol, which makes it VERY
expandable, but at the expense of added complication.  XML was
created as a solve-all-the-world's-problems structure.  You are try
to get a message from point A to point B.  Why complicate it to
death?

One of your objectives to keep the client simple...  To me, XML is
far from simple.

And XML complicates the server too.  If you are talking about
possibly thousands of messages being routed at a time, why make it
harder on the server?  With my example, the server doesn't hardly
parse the message at all.  It just looks at the packet type, then
looks at the To ID, and routes the packet to the other client (or
another server).  Can't get much simpler than that!


> (2) Increased message size
> OK, maybe you have a point, but let's look at it
> anyway.  If you go check out the formal defs of XML
> you'll see, front and center, that compactness is
> explicitly not a goal.  Why?  Hmmm.  Because if you
> want compactness, use a compressor.  XML has other
> (arguably more important) goals, like readability,
> power and flexibility for example.  XML does compress
> wonderfully - try it (I recommend bzip for your
> tests). 
> So, yes, perhaps we could design compression into the
> jabber protocol - though I think the effort and
> resulting increase in complexity would be hard to
> justify... see below.

Message size directly determines 1) how much bandwidth the server
requires, and 2) how much memory it needs for a given number of
messages routed per time unit.

If I'm an ISP wanting to provide a message gateway, I'm VERY
interested in how much bandwidth your server is going to be using on
my link, not if the messages are easy to read.  Only the developers
would care about that.

And who says a packet protocol can't be flexible?

I agree the bzip idea is not a bad one for compression and
encryption...


- -- 
John Price <linux-guru at gcfl.net>

PGP key at http://www.gcfl.net/~linux-guru/publickey.txt

John's FreeDOS page -> http://www.gcfl.net/FreeDOS

AIM ID "GCFL Owner"
ICQ 24079586

- -----BEGIN GEEK CODE BLOCK-----
Version: 3.21
GE d-> s++:+ a C++ UL++++ P+ L+++> E- W+++ N+ o+ K- W--- O- M-- V--
PS-- PE+ Y+ PGP++> t+ 5 X++ R- tv+ b+ DI+ D+ G+ e++> h r+++ y+++
- ------END GEEK CODE BLOCK------

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 5.0i for non-commercial use
Charset: noconv

iQA/AwUBN/LCyrR4GidzvvE7EQKm0wCfflrKS5F/l/Wqlv3HkvRFswfBX74AoJ4N
2nTln32XQj7k5f/3xZ6mmDt7
=/4jW
-----END PGP SIGNATURE-----





More information about the JDev mailing list