PythonでElementTreeを使ってXMLを処理する方法

ElementTreeはpythonでXMLを扱うためのライブラリです。バージョン2.5からはpythonに標準で内蔵されています。
このライブラリを使うと、XMLをパースしてプログラム内で利用しやすい形に変換したり、XMLファイルを生成したりすることが出来ます。そこで今回は、XMLをパースして利用する方法についてまとめたいと思います。

準備

Python2.5以降にはElementTreeが標準で内蔵されているため、パッケージを個別で入手する必要はありません。
以下のようにインポートするだけで利用できます。

from xml.etree.ElementTree import *

Python2.5より古いバージョンを利用している場合は、

http://effbot.org/zone/element-index.htm

からパッケージを入手し、以下のようにインポートして下さい。

from elementtree.ElementTree import *

利用データ

今回の説明では、以下の内容のXMLファイルをサンプルとして扱います。

<window width="1920">
	<title font="large" color="red">sample</title>
	<buttonbox>
		<button>OK</button>
		<button>Cancel</button>
	</buttonbox>
</window>

Elementの作成

ElementTreeを使う上でメインとなるのがElementというオブジェクトです。まずは、このElementの作成方法について以下に例を示します。

# 文字列から作成
xmlString = '<window width="1920"><title font="large" color="red">sample</title><buttonbox><button>OK</button><button>Cancel</button></buttonbox></window>'
elem = fromstring(xmlString) # ルート要素を取得(Element型)

#ファイルから作成
tree = parse("sample.xml") # 返値はElementTree型
elem = tree.getroot() # ルート要素を取得(Element型)

XMLを表す文字列からElementを作成するには、fromstringメソッドを呼び出します。WebAPIなどから取得したデータをパースしたい場合はこの方法を利用することが多いと思います。
また、XMLファイルからデータを読み込む場合はparseというメソッドを利用します。parseの返値はElementTree型となります。このElementTree型はXMLファイルからデータを読み込んだり、書き込んだりする際に利用されるラッパークラスです。ここでは、とりあえず上記の様にparseメソッドを呼んで、その後にgetrootメソッドを呼べばElement型のルート要素が取得できると覚えておけば問題ありません。

データの参照

タグ名、属性(attribute)を参照する方法。

# 要素のタグを取得
print elem.tag
# attributeの取得
print elem.get("width")
# デフォルトを指定してattributeを取得
print elem.get("height", "1200")
# attribute名のリスト取得
print elem.keys()
# (attribute, value)形式タプルのリスト取得
print elem.items()

実行結果
window
1920
1200
['width']
[('width', '1920')]

要素の検索

条件にマッチする要素を検索する方法。

# 条件に一番最初にマッチした要素を返す
print elem.find(".//buttonbox").tag
# 条件にマッチする要素をリストで返す
for e in elem.findall(".//button"):
    print e.tag
# 条件にマッチする一番最初の要素のテキストを返す
print elem.findtext(".//button")
# findtextを分けて書くと以下の通り
print elem.find(".//button").text

実行結果
buttonbox

button
button

OK
OK

find,findall,findtextの引数にはXPath形式を利用することが出来ます。詳細については以下を参照して下さい。
http://effbot.org/zone/element-xpath.htm

要素への順次アクセス

XML内の要素を1つずつ参照する方法。

# すべての要素にアクセス
for e in elem.getiterator():
    print e.tag
# 条件指定(tag名を指定)
for e in elem.getiterator("button"):
    print e.tag
# 子要素のリストを取得(再帰的ではなく、直系の子要素のみ)
for e in list(elem):
    print e.tag

実行結果
window
title
buttonbox
button
button

button
button

title
buttonbox

子要素から親要素へのアクセス

子要素と親要素の関係を保持する方法。公式のマニュアルではジェネレーターを定義する方法と、対応関係を保持するマップを作成してしまう方法が紹介されています。

# ジェネレーターを使った方法
def iterparent(elem):
    for parent in elem.getiterator():
        for child in parent:
            yield parent, child

for p,c in iterparent(elem):
    print c.tag+":"+p.tag

# (child, parent)形式のマップを生成する方法
parent_map = dict((c, p) for p in tree.getiterator() for c in p)
for k in parent_map.keys():
    print k.tag+":"+parent_map.get(k).tag

実行結果
title:window
buttonbox:window
button:buttonbox
button:buttonbox

title:window
buttonbox:window
button:buttonbox
button:buttonbox

以上です。

hikm's blog