[jdev] parsing xml (xmpp) with ruby
Eric Will
rakaur at malkier.net
Sat Sep 27 14:28:21 CDT 2008
Hello World,
I am writing an XMPP (Jabber) server in Ruby. XMPP uses XML for its
protocol. This means I have to do a good deal of XML parsing, in Ruby.
Right now I am using REXML to parse the individual stanzas as they
come in. However, in order to do this without REXML complaining of
"multiple root elements" (that is, XMPP is streaming XML over a TCP
socket, so I only get the root element once) I have to wrap every
incoming chunk of XMPP with my own <root/> tag, and then ignore that
after REXML parses it. I am currently unhappy with this approach.
Another option is to use REXML's stream parsing. I don't really like
this idea. It seems the only benefit of using SAX(ish) parsing is when
you're dealing with huge documents that you don't want to load into
memory. This isn't the case. I get maybe 5-10 objects per parse. Most
of the people I've talked to in XMPP insist on using SAX (or something
like it, such as REXML's stream parsing). The other reason I don't
like REXML's stream parsing (or libxml's SAX) is because I have to
provide a class instance for it to use for the event-parsing, and this
class has to be a giant state machine, which seems wrong to me. I
don't want to have to write a complicated class to, in effect, parse
the XML myself when the XML parser should be doing this for me.
The other options include using hpricot to do the incoming parsing
(since it's C, and way faster than REXML) and continue to use REXML
for generating the outgoing XML (I can't seem to figure out how to do
this in hpricot, if it's even possible). Although, XMPP requires XML
well-formedness, and hpricot does not do validation (to the best of my
knowledge). I also like xml-simple, but it uses REXML underneath it
all, so I'm left with the same issues.
My real question is, is there a GOOD REASON to switch for the scheme I
currently use? A number of people seem to think it's the "Wrong Thing"
to do, but I'm not quite sure what the "Right Thing" is. I don't think
it's SAX.
Thanks for any feedback.
-- rakaur
More information about the JDev
mailing list