1.任务
把下面网页中的表格数据解析成pandas数据
https://en.wikipedia.org/wiki/Harvard_University
2.方法
- 获取数据
import requests
response = requests.get('https://en.wikipedia.org/wiki/Harvard_University')
- 获取表格
from lxml import etree
html = etree.HTML(response.text)
table = etree.xpath('//table[@class="wikitable"]')[0]
- 解析表格中的数据
tr_array = table.findall('tr')
texts = []
for tr in tr_array:
line = []
for c in tr.iterchildren():
line.append(c.text)
texts.append(line)
- 从文本中解析列名和索引
col_names = texts[0][1:]
index_names = [t[0] for t in texts[1:]]
- 数据转换
values = []
for line in texts[1:]:
row = []
for v in line[1:]:
v = v.strip()
if v == 'N/A':
v = None
elif v.endswith('%'):
v = int(v[:v.rfind('%')])
row.append(v)
values.append(row)
- 把数据转换为DataFrame
import pandas as pd
students = pd.DataFrame(values,columns=col_names,index=index_names)
- 对于数据问题
第三列Census数据中有NaN,而且这列的数据类型是浮点数
>students.dtypes
Undergraduate int64
Graduate int64
U.S. Census float64
dtype: object
把数据NAN转为0,并把数据类型转换为int
dfclearn = students.fillna(0).astype('int64')