Using Python to parse XML is easier than it should be

A few months back when I was just starting to poke around with Python, I saw this XKCD comic come through my RSS feed (my apologies if this clashes with the right hand sidebar; maximizing your window might help):
import soul
XKCD
At the time, I thought it was sort of funny, more for the complete nerdiness of creating a pet from an Eee PC and a hamster ball than anything else. The kicker at the end about importing a soul was just icing.

I bring this up because in preparation for the Elsevier Article 2.0 Challenge coming up in September, I wanted to start spending more time learning how to handle XML files. Since Python has become my language of choice (ok, full honesty - it’s the only language I can speak at all really, and even then only in primitive grunts), I wanted to see how hard it would be to work up an XML parser. It’s really easy. You just have to import it.

import xml.etree.ElementTree as ET

I wrote a very very simple and short script just to make sure that it was as easy as I thought it was, and sure enough this is the case.

xmlparser.py

# Read in our XML file
infile = raw_input("Input XML file name > ")
xmltree = ET.ElementTree(file=infile)
rootelem = xmltree.getroot()

print "This should be root_element"
print rootelem.tag

print "This should print two subelement tags"
for subelement in rootelem:
	print subelement.tag

print "This should print out the content of the sub elements"
for subelement in rootelem:
	print subelement.text

And I used a self-generated test file, test.xml:

<root_element>
	<sub_element>This is a sub element</sub_element>
	<sub_element id="2">This is a sub element with the ID set to "2"</sub_element>
</root_element>

and the output pretty much matches what you would guess:

Input XML file name > test.xml
This should be root_element
root_element
This should print two subelement tags
sub_element
sub_element
This should print out the content of the sub elements
This is a sub element
This is a sub element with the ID set to "2"

This took all of about 10 minutes to do… I’m still sort of stunned.  I’m sure the programmers/Python jockies are laughing right now, but c’est la vie I suppose.

I mean, it’s really almost frighteningly simple.  Let’s try playing with Hamlet, available online in XML format of course.  We can write a quick script to count how often Rosencrantz speaks:

#!/usr/bin/python

# Initialization
import xml.etree.ElementTree as ET

# Read in our XML file
infile = raw_input("Input XML file name > ")
xmltree = ET.ElementTree(file=infile)
rootelem = xmltree.getroot()

i = 0
act_list = rootelem.findall('ACT')
for act in act_list:
	scene_list = act.findall('SCENE')
	for scene in scene_list:
		speech_list = scene.findall('SPEECH')
		for speech in speech_list:
			speaker_list = speech.findall('SPEAKER')
			for speaker in speaker_list:
				if speaker.text == "ROSENCRANTZ":
					i = i + 1

print i

It’s 49, in case you are wondering.

I’m pretty excited in my experimentation with ElementTree so far.  As usual I’ve got a ton to learn, but it’s great to know that this powerful tool was lurking inside of python the whole time.

One Response to “Using Python to parse XML is easier than it should be”

  1. zack Says:

    ElementTree is pretty hot stuff. If you really want to be impressed with a Python module that can handle almost anything you throw at it, have a look at Mark Pilgrim’s fabulous Universal Feedparser. http://feedparser.org/

    If you are doing any work with RSS/Atom/etc. this is a life saver.

Leave a Reply