Xpath解析本地Html文件

We work in the darkness to serve the light. ——《刺客信条》

我们鞠躬于黑暗,却向往着光明。

问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
提取如下html结构中的信息
<div class="list">
<div class="item">
<p><label>姓名:</label><span>坤坤</span></p>
<p><label>类型:</label><span>human</span></p>
<p><label>年龄:</label><span>24</span></p>
</div>
<div class="item">
<p><label>姓名:</label><span>伊娃</span></p>
<p><label>类型:</label><span>机器人</span></p>
<p><label>生产日期:</label><span>2019-01-01</span></p>
</div>
<div class="item">
<p><label>姓名:</label><span>豆豆</span></p>
<p><label>类型:</label><span>动物</span></p>
<p><label>年龄:</label><span>3</span></p>
</div>
<div class="item">
<p><label>姓名:</label><span>晗晗</span></p>
<p><label>类型:</label><span>人类</span></p>
<p><label>年龄:</label><span>21</span></p>
</div>
</div>
1
2
3
4
5
6
并生成如下数据结构:
{
'human': [{ age: 24, name: '坤坤' },{ age: 21, name: '晗晗' }],
'robot': [{ dateTime: '2019-01-01', name: '伊娃' }],
'animal': [{ age: 3, name: '豆豆' }],
}

解析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import json
from lxml import etree

f=open("index.html","rb")
content=f.read().decode('utf-8')
tree=etree.HTML(content)
content = {}
human = []
robot = []
animal = []
for item in tree.xpath('//div/div'):
name = item.xpath('./p[1]/span/text()')[0]
label = item.xpath('./p[2]/span/text()')[0]
age = item.xpath('./p[3]/span/text()')[0]
if 'human' in label or '人类' in label:
human.append({'age':age, 'name':name})
elif '机器人' in label:
robot.append({'dataTime':age, 'name':name})
else:
animal.append({'age': age, 'name': name})
content['human'] = human
content['robot'] = robot
content['animal'] = animal

res = json.dumps(content, indent=4, ensure_ascii=False)
print(res)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>>>{
"human": [
{
"age": "24",
"name": "坤坤"
},
{
"age": "21",
"name": "晗晗"
}
],
"robot": [
{
"dataTime": "2019-01-01",
"name": "伊娃"
}
],
"animal": [
{
"age": "3",
"name": "豆豆"
}
]
}

思路

1
2
3
4
open 读取本地HTML并进行编码
etree.HTML() 构造XPath解析对象
利用xpath提取需要的信息
json.dumps() 将字典转为json对象,indent 设置缩进字符个数,ensure_ascii=False 中文不被转为unicode