We work in the darkness to serve the light. ——《刺客信条》
我们鞠躬于黑暗,却向往着光明。
问题
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| 提取如下html结构中的信息 <div class="list"> <div class="item"> <p><label>姓名:</label><span>坤坤</span></p> <p><label>类型:</label><span>human</span></p> <p><label>年龄:</label><span>24</span></p> </div> <div class="item"> <p><label>姓名:</label><span>伊娃</span></p> <p><label>类型:</label><span>机器人</span></p> <p><label>生产日期:</label><span>2019-01-01</span></p> </div> <div class="item"> <p><label>姓名:</label><span>豆豆</span></p> <p><label>类型:</label><span>动物</span></p> <p><label>年龄:</label><span>3</span></p> </div> <div class="item"> <p><label>姓名:</label><span>晗晗</span></p> <p><label>类型:</label><span>人类</span></p> <p><label>年龄:</label><span>21</span></p> </div> </div>
|
1 2 3 4 5 6
| 并生成如下数据结构: { 'human': [{ age: 24, name: '坤坤' },{ age: 21, name: '晗晗' }], 'robot': [{ dateTime: '2019-01-01', name: '伊娃' }], 'animal': [{ age: 3, name: '豆豆' }], }
|
解析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| import json from lxml import etree
f=open("index.html","rb") content=f.read().decode('utf-8') tree=etree.HTML(content) content = {} human = [] robot = [] animal = [] for item in tree.xpath('//div/div'): name = item.xpath('./p[1]/span/text()')[0] label = item.xpath('./p[2]/span/text()')[0] age = item.xpath('./p[3]/span/text()')[0] if 'human' in label or '人类' in label: human.append({'age':age, 'name':name}) elif '机器人' in label: robot.append({'dataTime':age, 'name':name}) else: animal.append({'age': age, 'name': name}) content['human'] = human content['robot'] = robot content['animal'] = animal
res = json.dumps(content, indent=4, ensure_ascii=False) print(res)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| >>>{ "human": [ { "age": "24", "name": "坤坤" }, { "age": "21", "name": "晗晗" } ], "robot": [ { "dataTime": "2019-01-01", "name": "伊娃" } ], "animal": [ { "age": "3", "name": "豆豆" } ] }
|
思路
1 2 3 4
| open 读取本地HTML并进行编码 etree.HTML() 构造XPath解析对象 利用xpath提取需要的信息 json.dumps() 将字典转为json对象,indent 设置缩进字符个数,ensure_ascii=False 中文不被转为unicode
|