十九、数据整理(3)
作者:Chris Albon
译者:飞龙
协议:CC BY-NC-SA 4.0
寻找数据帧中的唯一值
import pandas as pd
import numpy as np
raw_data = {'regiment': ['51st', '29th', '2nd', '19th', '12th', '101st', '90th', '30th', '193th', '1st', '94th', '91th'],
'trucks': ['MAZ-7310', np.nan, 'MAZ-7310', 'MAZ-7310', 'Tatra 810', 'Tatra 810', 'Tatra 810', 'Tatra 810', 'ZIS-150', 'Tatra 810', 'ZIS-150', 'ZIS-150'],
'tanks': ['Merkava Mark 4', 'Merkava Mark 4', 'Merkava Mark 4', 'Leopard 2A6M', 'Leopard 2A6M', 'Leopard 2A6M', 'Arjun MBT', 'Leopard 2A6M', 'Arjun MBT', 'Arjun MBT', 'Arjun MBT', 'Arjun MBT'],
'aircraft': ['none', 'none', 'none', 'Harbin Z-9', 'Harbin Z-9', 'none', 'Harbin Z-9', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk']}
df = pd.DataFrame(raw_data, columns = ['regiment', 'trucks', 'tanks', 'aircraft'])
# 查看前几行
df.head()
|
regiment |
trucks |
tanks |
aircraft |
0 |
51st |
MAZ-7310 |
Merkava Mark 4 |
none |
1 |
29th |
NaN |
Merkava Mark 4 |
none |
2 |
2nd |
MAZ-7310 |
Merkava Mark 4 |
none |
3 |
19th |
MAZ-7310 |
Leopard 2A6M |
Harbin Z-9 |
4 |
12th |
Tatra 810 |
Leopard 2A6M |
Harbin Z-9 |
# 通过将 pandas 列转换为集合
# 创建唯一值的列表
list(set(df.trucks))
# [nan, 'Tatra 810', 'MAZ-7310', 'ZIS-150']
# 创建 df.trucks 中的唯一值的列表
list(df['trucks'].unique())
# ['MAZ-7310', nan, 'Tatra 810', 'ZIS-150']
地理编码和反向地理编码
在使用地理数据时,地理编码(将物理地址或位置转换为经纬度)和反向地理编码(将经纬度转换为物理地址或位置)是常见任务。
Python 提供了许多软件包,使任务变得异常简单。 在下面的教程中,我使用 pygeocoder(Google 的 geo-API 的包装器)来进行地理编码和反向地理编码。
首先,我们要加载我们想要在脚本中使用的包。 具体来说,我正在为地理函数加载 pygeocoder,为数据帧结构加载 pandas,为缺失值(np.nan
)函数加载 numpy。
# 加载包
from pygeocoder import Geocoder
import pandas as pd
import numpy as np
地理数据有多种形式,在这种情况下,我们有一个 Python 字典,包含五个经纬度的字符串,每个坐标在逗号分隔的坐标对中。
# 创建原始数据的字典
data = {'Site 1': '31.336968, -109.560959',
'Site 2': '31.347745, -108.229963',
'Site 3': '32.277621, -107.734724',
'Site 4': '31.655494, -106.420484',
'Site 5': '30.295053, -104.014528'}
虽然技术上没必要,因为我最初使用 R,我是数据帧的忠实粉丝,所以让我们把模拟的数据字典变成数据帧。
# 将字典转换为 pandas 数据帧
df = pd.DataFrame.from_dict(data, orient='index')
# 查看数据帧
df
|
0 |
Site 1 |
31.336968, -109.560959 |
Site 2 |
31.347745, -108.229963 |
Site 3 |
32.277621, -107.734724 |
Site 4 |
31.655494, -106.420484 |
Site 5 |
30.295053, -104.014528 |
你现在可以看到,我们有了包含五行的数据帧,每行包含一个经纬度字符串。 在我们处理数据之前,我们需要1)将字符串分成纬度和经度,然后将它们转换为浮点数。以下代码就是这样。
# 为循环创建两个列表
lat = []
lon = []
# 对于变量中的每一行
for row in df[0]:
# 尝试
try:
# 用逗号分隔行,转换为浮点
# 并将逗号前的所有内容追加到 lat
lat.append(float(row.split(',')[0]))
# 用逗号分隔行,转换为浮点
# 并将逗号后的所有内容追加到 lon
lon.append(float(row.split(',')[1]))
# 但是如果你得到了错误
except:
# 向 lat 添加缺失值
lat.append(np.NaN)
# 向 lon 添加缺失值
lon.append(np.NaN)
# 从 lat 和 lon 创建新的两列
df['latitude'] = lat
df['longitude'] = lon
让我们看看现在有了什么。
# 查看数据帧
df
|
0 |
latitude |
longitude |
Site 1 |
31.336968, -109.560959 |
31.336968 |
-109.560959 |
Site 2 |
31.347745, -108.229963 |
31.347745 |
-108.229963 |
Site 3 |
32.277621, -107.734724 |
32.277621 |
-107.734724 |
Site 4 |
31.655494, -106.420484 |
31.655494 |
-106.420484 |
Site 5 |
30.295053, -104.014528 |
30.295053 |
-104.014528 |
真棒。这正是我们想要看到的,一列用于纬度的浮点和一列用于经度的浮点。
为了反转地理编码,我们将特定的经纬度对(这里为第一行,索引为0
)提供给 pygeocoder 的reverse_geocoder
函数。
# 将经度和纬度转换为某个位置
results = Geocoder.reverse_geocode(df['latitude'][0], df['longitude'][0])
现在我们可以开始提取我们想要的数据了。
# 打印经纬度
results.coordinates
# (31.3372728, -109.5609559)
# 打印城市
results.city
# 'Douglas'
# 打印国家/地区
results.country
# 'United States'
# 打印街道地址(如果可用)
results.street_address
# 打印行政区
results.administrative_area_level_1
# 'Arizona'
对于地理编码,我们需要将包含地址或位置(例如城市)的字符串,传入地理编码函数中。 但是,并非所有字符串的格式都是 Google 的 geo-API 可以理解的。 如果由.geocode().valid_address
函数验证有效,我们可以转换。
# 验证地址是否有效(即在 Google 的系统中)
Geocoder.geocode("4207 N Washington Ave, Douglas, AZ 85607").valid_address
# True
因为输出是True,我们现在知道这是一个有效的地址,因此可以打印纬度和经度坐标。
# 打印经纬度
results.coordinates
# (31.3372728, -109.5609559)
但更有趣的是,一旦地址由 Google 地理 API 处理,我们就可以解析它并轻松地分隔街道号码,街道名称等。
# 寻找特定地址中的经纬度
result = Geocoder.geocode("7250 South Tucson Boulevard, Tucson, AZ 85756")
# 打印街道号码
result.street_number
# '7250'
# 打印街道名
result.route
# 'South Tucson Boulevard'
你就实现了它。Python 使整个过程变得简单,只需几分钟即可完成分析。祝好运!
地理定位城市和国家
本教程创建一个函数,尝试获取城市和国家并返回其经纬度。 但是当城市不可用时(通常是这种情况),则返回该国中心的经纬度。
from geopy.geocoders import Nominatim
geolocator = Nominatim()
import numpy as np
def geolocate(city=None, country=None):
'''
输入城市和国家,或仅输入国家。 如果可以的话,返回城市的经纬度坐标,否则返回该国家中心的经纬度。
'''
# 如果城市存在
if city != None:
# 尝试
try:
# 地理定位城市和国家
loc = geolocator.geocode(str(city + ',' + country))
# 并返回经纬度
return (loc.latitude, loc.longitude)
# 否则
except:
# 返回缺失值
return np.nan
# 如果城市不存在
else:
# 尝试
try:
# 地理定位国家中心
loc = geolocator.geocode(country)
# 返回经纬度
return (loc.latitude, loc.longitude)
# 否则
except:
# 返回缺失值
return np.nan
# 地理定位城市和国家
geolocate(city='Austin', country='USA')
# (30.2711286, -97.7436995)
# 仅仅地理定位国家
geolocate(country='USA')
# (39.7837304, -100.4458824)
使用 pandas 分组时间序列
# 导入所需模块
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['german_army'] = np.random.randint(low=20000, high=30000, size=100)
df['allied_army'] = np.random.randint(low=20000, high=40000, size=100)
df.index = pd.date_range('1/1/2014', periods=100, freq='H')
df.head()
|
german_army |
allied_army |
2014-01-01 00:00:00 |
28755 |
33938 |
2014-01-01 01:00:00 |
25176 |
28631 |
--- |
--- |
--- |
2014-01-01 02:00:00 |
23261 |
39685 |
--- |
--- |
--- |
2014-01-01 03:00:00 |
28686 |
27756 |
--- |
--- |
--- |
2014-01-01 04:00:00 |
24588 |
25681 |
--- |
--- |
--- |
Truncate the dataframe
df.truncate(before='1/2/2014', after='1/3/2014')
|
german_army |
allied_army |
2014-01-02 00:00:00 |
26401 |
20189 |
2014-01-02 01:00:00 |
29958 |
23934 |
2014-01-02 02:00:00 |
24492 |
39075 |
2014-01-02 03:00:00 |
25707 |
39262 |
2014-01-02 04:00:00 |
27129 |
35961 |
2014-01-02 05:00:00 |
27903 |
25418 |
2014-01-02 06:00:00 |
20409 |
25163 |
2014-01-02 07:00:00 |
25736 |
34794 |
2014-01-02 08:00:00 |
24057 |
27209 |
2014-01-02 09:00:00 |
26875 |
33402 |
2014-01-02 10:00:00 |
23963 |
38575 |
2014-01-02 11:00:00 |
27506 |
31859 |
2014-01-02 12:00:00 |
23564 |
25750 |
2014-01-02 13:00:00 |
27958 |
24365 |
2014-01-02 14:00:00 |
24915 |
38866 |
2014-01-02 15:00:00 |
23538 |
33820 |
2014-01-02 16:00:00 |
23361 |
30080 |
2014-01-02 17:00:00 |
27284 |
22922 |
2014-01-02 18:00:00 |
24176 |
32155 |
2014-01-02 19:00:00 |
23924 |
27763 |
2014-01-02 20:00:00 |
23111 |
32343 |
2014-01-02 21:00:00 |
20348 |
28907 |
2014-01-02 22:00:00 |
27136 |
38634 |
2014-01-02 23:00:00 |
28649 |
29950 |
2014-01-03 00:00:00 |
21292 |
26395 |
# 设置数据帧的索引
df.index = df.index + pd.DateOffset(months=4, days=5)
df.head()
|
german_army |
allied_army |
2014-05-06 00:00:00 |
28755 |
33938 |
2014-05-06 01:00:00 |
25176 |
28631 |
2014-05-06 02:00:00 |
23261 |
39685 |
2014-05-06 03:00:00 |
28686 |
27756 |
2014-05-06 04:00:00 |
24588 |
25681 |
# 将变量提前一小时
df.shift(1).head()
|
german_army |
allied_army |
2014-05-06 00:00:00 |
NaN |
NaN |
2014-05-06 01:00:00 |
28755.0 |
33938.0 |
2014-05-06 02:00:00 |
25176.0 |
28631.0 |
2014-05-06 03:00:00 |
23261.0 |
39685.0 |
2014-05-06 04:00:00 |
28686.0 |
27756.0 |
# 将变量延后一小时
df.shift(-1).tail()
|
german_army |
allied_army |
2014-05-09 23:00:00 |
26903.0 |
39144.0 |
2014-05-10 00:00:00 |
27576.0 |
39759.0 |
2014-05-10 01:00:00 |
25232.0 |
35246.0 |
2014-05-10 02:00:00 |
23391.0 |
21044.0 |
2014-05-10 03:00:00 |
NaN |
NaN |
# 对每小时观测值求和来按天汇总
df.resample('D').sum()
|
german_army |
allied_army |
2014-05-06 |
605161 |
755962 |
2014-05-07 |
608100 |
740396 |
2014-05-08 |
589744 |
700297 |
2014-05-09 |
607092 |
719283 |
2014-05-10 |
103102 |
135193 |
# 对每小时观测值求平均来按天汇总
df.resample('D').mean()
|
german_army |
allied_army |
2014-05-06 |
25215.041667 |
31498.416667 |
2014-05-07 |
25337.500000 |
30849.833333 |
2014-05-08 |
24572.666667 |
29179.041667 |
2014-05-09 |
25295.500000 |
29970.125000 |
2014-05-10 |
25775.500000 |
33798.250000 |
# 对每小时观测值求最小值来按天汇总
df.resample('D').min()
|
german_army |
allied_army |
2014-05-06 |
24882.0 |
31310.0 |
2014-05-07 |
25311.0 |
30969.5 |
2014-05-08 |
24422.5 |
28318.0 |
2014-05-09 |
24941.5 |
32082.5 |
2014-05-10 |
26067.5 |
37195.0 |
# 对每小时观测值求中值来按天汇总
df.resample('D').median()
|
german_army |
allied_army |
2014-05-06 |
24882.0 |
31310.0 |
2014-05-07 |
25311.0 |
30969.5 |
2014-05-08 |
24422.5 |
28318.0 |
2014-05-09 |
24941.5 |
32082.5 |
2014-05-10 |
26067.5 |
37195.0 |
# 对每小时观测值取第一个值来按天汇总
df.resample('D').first()
|
german_army |
allied_army |
2014-05-06 |
28755 |
33938 |
2014-05-07 |
26401 |
20189 |
2014-05-08 |
21292 |
26395 |
2014-05-09 |
25764 |
22613 |
2014-05-10 |
26903 |
39144 |
# 对每小时观测值取最后一个值来按天汇总
df.resample('D').last()
|
german_army |
allied_army |
2014-05-06 |
28214 |
32110 |
2014-05-07 |
28649 |
29950 |
2014-05-08 |
28379 |
32600 |
2014-05-09 |
26752 |
22379 |
2014-05-10 |
23391 |
21044 |
# 对每小时观测值取第一个值,最后一个值,最高值,最低值来按天汇总
df.resample('D').ohlc()
|
german_army |
allied_army |
|
open |
high |
low |
close |
open |
high |
low |
close |
2014-05-06 |
28755 |
29206 |
20037 |
28214 |
33938 |
39955 |
23417 |
32110 |
2014-05-07 |
26401 |
29958 |
20348 |
28649 |
20189 |
39262 |
20189 |
29950 |
2014-05-08 |
21292 |
29786 |
20296 |
28379 |
26395 |
38197 |
20404 |
32600 |
2014-05-09 |
25764 |
29952 |
20738 |
26752 |
22613 |
39695 |
20189 |
22379 |
2014-05-10 |
26903 |
27576 |
23391 |
23391 |
39144 |
39759 |
21044 |
21044 |
按时间分组数据
2016 年 3 月 13 日,Pandas 版本 0.18.0 发布,重采样功能的运行方式发生了重大变化。 本教程遵循 v0.18.0,不适用于以前版本的 pandas。
首先让我们加载我们关心的模块。
# 导入所需模块
import pandas as pd
import datetime
import numpy as np
接下来,让我们创建一些样例数据,我们可以将它们按时间分组作为样本。 在这个例子中,我创建了一个包含两列 365 行的数据帧。一列是日期,第二列是数值。
# 为今天创建 datetime 变量
base = datetime.datetime.today()
# 创建一列变量
# 包含 365 天的 datetime 值
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]
# 创建 365 个数值的列表
score_list = list(np.random.randint(low=1, high=1000, size=365))
# 创建空数据帧
df = pd.DataFrame()
# 从 datetime 变量创建一列
df['datetime'] = date_list
# 将列转换为 datetime 类型
df['datetime'] = pd.to_datetime(df['datetime'])
# 将 datetime 列设为索引
df.index = df['datetime']
# 为数值得分变量创建一列
df['score'] = score_list
# 让我们看看数据
df.head()
|
datetime |
score |
datetime |
|
|
2016-06-02 09:57:54.793972 |
2016-06-02 09:57:54.793972 |
900 |
2016-06-01 09:57:54.793972 |
2016-06-01 09:57:54.793972 |
121 |
2016-05-31 09:57:54.793972 |
2016-05-31 09:57:54.793972 |
547 |
2016-05-30 09:57:54.793972 |
2016-05-30 09:57:54.793972 |
504 |
2016-05-29 09:57:54.793972 |
2016-05-29 09:57:54.793972 |
304 |
在 pandas 中,按时间分组的最常用方法是使用.resample()
函数。 在 v0.18.0 中,此函数是两阶段的。 这意味着df.resample('M')
创建了一个对象,我们可以对其应用其他函数(mean
,count
,sum
等)
# 按月对数据分组,并取每组(即每个月)的平均值
df.resample('M').mean()
|
score |
datetime |
|
2015-06-30 |
513.629630 |
2015-07-31 |
561.516129 |
2015-08-31 |
448.032258 |
2015-09-30 |
548.000000 |
2015-10-31 |
480.419355 |
2015-11-30 |
487.033333 |
2015-12-31 |
499.935484 |
2016-01-31 |
429.193548 |
2016-02-29 |
520.413793 |
2016-03-31 |
349.806452 |
2016-04-30 |
395.500000 |
2016-05-31 |
503.451613 |
2016-06-30 |
510.500000 |
# 按月对数据分组,并获取每组(即每个月)的总和
df.resample('M').sum()
|
score |
datetime |
|
2015-06-30 |
13868 |
2015-07-31 |
17407 |
2015-08-31 |
13889 |
2015-09-30 |
16440 |
2015-10-31 |
14893 |
2015-11-30 |
14611 |
2015-12-31 |
15498 |
2016-01-31 |
13305 |
2016-02-29 |
15092 |
2016-03-31 |
10844 |
2016-04-30 |
11865 |
2016-05-31 |
15607 |
2016-06-30 |
1021 |
分组有很多选项。 你可以在 Pandas 的时间序列文档中了解它们的更多信息,但是,为了你的方便,我也在下面列出了它们。
值 |
描述 |
B |
business day frequency |
C |
custom business day frequency (experimental) |
D |
calendar day frequency |
W |
weekly frequency |
M |
month end frequency |
BM |
business month end frequency |
CBM |
custom business month end frequency |
MS |
month start frequency |
BMS |
business month start frequency |
CBMS |
custom business month start frequency |
Q |
quarter end frequency |
BQ |
business quarter endfrequency |
QS |
quarter start frequency |
BQS |
business quarter start frequency |
A |
year end frequency |
BA |
business year end frequency |
AS |
year start frequency |
BAS |
business year start frequency |
BH |
business hour frequency |
H |
hourly frequency |
T |
minutely frequency |
S |
secondly frequency |
L |
milliseonds |
U |
microseconds |
N |
nanoseconds |
按小时分组数据
# 导入库
import pandas as pd
import numpy as np
# 创建 2000 个元素的时间序列
# 每五分钟一个元素,起始于 2000.1.1
time = pd.date_range('1/1/2000', periods=2000, freq='5min')
# 创建 pandas 序列,带有 0 到 100 的随机值
# 将 time 用于索引
series = pd.Series(np.random.randint(100, size=2000), index=time)
# 查看前几行
series[0:10]
'''
2000-01-01 00:00:00 40
2000-01-01 00:05:00 13
2000-01-01 00:10:00 99
2000-01-01 00:15:00 72
2000-01-01 00:20:00 4
2000-01-01 00:25:00 36
2000-01-01 00:30:00 24
2000-01-01 00:35:00 20
2000-01-01 00:40:00 83
2000-01-01 00:45:00 44
Freq: 5T, dtype: int64
'''
# 按索引的小时值对数据分组,然后按平均值进行汇总
series.groupby(series.index.hour).mean()
'''
0 50.380952
1 49.380952
2 49.904762
3 53.273810
4 47.178571
5 46.095238
6 49.047619
7 44.297619
8 53.119048
9 48.261905
10 45.166667
11 54.214286
12 50.714286
13 56.130952
14 50.916667
15 42.428571
16 46.880952
17 56.892857
18 54.071429
19 47.607143
20 50.940476
21 50.511905
22 44.550000
23 50.250000
dtype: float64
'''
对行分组
# 导入模块
import pandas as pd
# 示例数据帧
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
|
regiment |
company |
name |
preTestScore |
postTestScore |
0 |
Nighthawks |
1st |
Miller |
4 |
25 |
1 |
Nighthawks |
1st |
Jacobson |
24 |
94 |
2 |
Nighthawks |
2nd |
Ali |
31 |
57 |
3 |
Nighthawks |
2nd |
Milner |
2 |
62 |
4 |
Dragoons |
1st |
Cooze |
3 |
70 |
5 |
Dragoons |
1st |
Jacon |
4 |
25 |
6 |
Dragoons |
2nd |
Ryaner |
24 |
94 |
7 |
Dragoons |
2nd |
Sone |
31 |
57 |
8 |
Scouts |
1st |
Sloan |
2 |
62 |
9 |
Scouts |
1st |
Piger |
3 |
70 |
10 |
Scouts |
2nd |
Riani |
2 |
62 |
11 |
Scouts |
2nd |
Ali |
3 |
70 |
# 创建分组对象。 换句话说,
# 创建一个表示该特定分组的对象。
# 这里,我们按照团队来分组 pre-test 得分。
regiment_preScore = df['preTestScore'].groupby(df['regiment'])
# 展示每个团队的 pre-test 得分的均值
regiment_preScore.mean()
'''
regiment
Dragoons 15.50
Nighthawks 15.25
Scouts 2.50
Name: preTestScore, dtype: float64
'''
Pandas 中的分层数据
# 导入模块
import pandas as pd
# 创建数据帧
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
|
regiment |
company |
name |
preTestScore |
postTestScore |
0 |
Nighthawks |
1st |
Miller |
4 |
25 |
1 |
Nighthawks |
1st |
Jacobson |
24 |
94 |
2 |
Nighthawks |
2nd |
Ali |
31 |
57 |
3 |
Nighthawks |
2nd |
Milner |
2 |
62 |
4 |
Dragoons |
1st |
Cooze |
3 |
70 |
5 |
Dragoons |
1st |
Jacon |
4 |
25 |
6 |
Dragoons |
2nd |
Ryaner |
24 |
94 |
7 |
Dragoons |
2nd |
Sone |
31 |
57 |
8 |
Scouts |
1st |
Sloan |
2 |
62 |
9 |
Scouts |
1st |
Piger |
3 |
70 |
10 |
Scouts |
2nd |
Riani |
2 |
62 |
11 |
Scouts |
2nd |
Ali |
3 |
70 |
# 设置分层索引但将列保留在原位
df = df.set_index(['regiment', 'company'], drop=False)
df
|
|
regiment |
company |
name |
preTestScore |
postTestScore |
regiment |
company |
|
|
|
|
|
|
Nighthawks |
1st |
Nighthawks |
1st |
Miller |
4 |
25 |
1st |
Nighthawks |
1st |
Jacobson |
24 |
94 |
2nd |
Nighthawks |
2nd |
Ali |
31 |
57 |
2nd |
Nighthawks |
2nd |
Milner |
2 |
62 |
|
Dragoons |
1st |
Dragoons |
1st |
Cooze |
3 |
70 |
1st |
Dragoons |
1st |
Jacon |
4 |
25 |
2nd |
Dragoons |
2nd |
Ryaner |
24 |
94 |
2nd |
Dragoons |
2nd |
Sone |
31 |
57 |
|
Scouts |
1st |
Scouts |
1st |
Sloan |
2 |
62 |
1st |
Scouts |
1st |
Piger |
3 |
70 |
2nd |
Scouts |
2nd |
Riani |
2 |
62 |
2nd |
Scouts |
2nd |
Ali |
3 |
70 |
# 将分层索引设置为团队然后公司
df = df.set_index(['regiment', 'company'])
df
|
|
name |
preTestScore |
postTestScore |
regiment |
company |
|
|
|
|
Nighthawks |
1st |
Miller |
4 |
25 |
1st |
Jacobson |
24 |
94 |
2nd |
Ali |
31 |
57 |
2nd |
Milner |
2 |
62 |
|
Dragoons |
1st |
Cooze |
3 |
70 |
1st |
Jacon |
4 |
25 |
2nd |
Ryaner |
24 |
94 |
2nd |
Sone |
31 |
57 |
|
Scouts |
1st |
Sloan |
2 |
62 |
1st |
Piger |
3 |
70 |
2nd |
Riani |
2 |
62 |
2nd |
Ali |
3 |
70 |
# 查看索引
df.index
MultiIndex(levels=[['Dragoons', 'Nighthawks', 'Scouts'], ['1st', '2nd']],
labels=[[1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]],
names=['regiment', 'company'])
# 交换索引中的级别
df.swaplevel('regiment', 'company')
|
|
name |
preTestScore |
postTestScore |
company |
regiment |
|
|
|
1st |
Nighthawks |
Miller |
4 |
25 |
|
Nighthawks |
Jacobson |
24 |
94 |
2nd |
Nighthawks |
Ali |
31 |
57 |
|
Nighthawks |
Milner |
2 |
62 |
1st |
Dragoons |
Cooze |
3 |
70 |
|
Dragoons |
Jacon |
4 |
25 |
2nd |
Dragoons |
Ryaner |
24 |
94 |
|
Dragoons |
Sone |
31 |
57 |
1st |
Scouts |
Sloan |
2 |
62 |
|
Scouts |
Piger |
3 |
70 |
2nd |
Scouts |
Riani |
2 |
62 |
|
Scouts |
Ali |
3 |
70 |
# 按需求和数据
df.sum(level='regiment')
|
preTestScore |
postTestScore |
regiment |
|
|
Nighthawks |
61 |
238 |
Dragoons |
62 |
246 |
Scouts |
10 |
264 |