Using Python to parse XML is easier than it should be
A few months back when I was just starting to poke around with Python, I saw this XKCD comic come through my RSS feed (my apologies if this clashes with the right hand sidebar; maximizing your window might help):

XKCD
At the time, I thought it was sort of funny, more for the complete nerdiness of creating a pet from an Eee PC and a hamster ball than anything else. The kicker at the end about importing a soul was just icing.
I bring this up because in preparation for the Elsevier Article 2.0 Challenge coming up in September, I wanted to start spending more time learning how to handle XML files. Since Python has become my language of choice (ok, full honesty - it’s the only language I can speak at all really, and even then only in primitive grunts), I wanted to see how hard it would be to work up an XML parser. It’s really easy. You just have to import it.
import xml.etree.ElementTree as ET
I wrote a very very simple and short script just to make sure that it was as easy as I thought it was, and sure enough this is the case.
xmlparser.py
# Read in our XML file
infile = raw_input("Input XML file name > ")
xmltree = ET.ElementTree(file=infile)
rootelem = xmltree.getroot()
print "This should be root_element"
print rootelem.tag
print "This should print two subelement tags"
for subelement in rootelem:
print subelement.tag
print "This should print out the content of the sub elements"
for subelement in rootelem:
print subelement.text
And I used a self-generated test file, test.xml:
<root_element> <sub_element>This is a sub element</sub_element> <sub_element id="2">This is a sub element with the ID set to "2"</sub_element> </root_element>
and the output pretty much matches what you would guess:
Input XML file name > test.xml This should be root_element root_element This should print two subelement tags sub_element sub_element This should print out the content of the sub elements This is a sub element This is a sub element with the ID set to "2"
This took all of about 10 minutes to do… I’m still sort of stunned. I’m sure the programmers/Python jockies are laughing right now, but c’est la vie I suppose.
I mean, it’s really almost frighteningly simple. Let’s try playing with Hamlet, available online in XML format of course. We can write a quick script to count how often Rosencrantz speaks:
#!/usr/bin/python
# Initialization
import xml.etree.ElementTree as ET
# Read in our XML file
infile = raw_input("Input XML file name > ")
xmltree = ET.ElementTree(file=infile)
rootelem = xmltree.getroot()
i = 0
act_list = rootelem.findall('ACT')
for act in act_list:
scene_list = act.findall('SCENE')
for scene in scene_list:
speech_list = scene.findall('SPEECH')
for speech in speech_list:
speaker_list = speech.findall('SPEAKER')
for speaker in speaker_list:
if speaker.text == "ROSENCRANTZ":
i = i + 1
print i
It’s 49, in case you are wondering.
I’m pretty excited in my experimentation with ElementTree so far. As usual I’ve got a ton to learn, but it’s great to know that this powerful tool was lurking inside of python the whole time.


June 25th, 2008 at 8:34 am
ElementTree is pretty hot stuff. If you really want to be impressed with a Python module that can handle almost anything you throw at it, have a look at Mark Pilgrim’s fabulous Universal Feedparser. http://feedparser.org/
If you are doing any work with RSS/Atom/etc. this is a life saver.