Pandas_Select_Data_where()
从具有布尔向量的Series中选择值通常会返回数据的子集。为了保证选择输出与原始数据具有相同的形状,您可以where在Series和中使用该方法DataFrame。
import pandas as pd
import numpy as np
dates = pd.date_range('2020-01-01',periods=5)
data = pd.DataFrame(np.random.randn(5,4), index=dates, columns=list('abcd'))
data
a b c d
2020-01-01 -1.017523 -0.838623 -0.284684 1.723855
2020-01-02 0.926578 -0.374901 -1.038738 -1.901277
2020-01-03 1.973570 -1.225851 -0.450821 -0.550839
2020-01-04 -0.456445 -0.557138 -0.227323 0.390099
2020-01-05 0.681782 -0.380826 0.989172 0.164163
仅返回选定的行
data[data.a>0]
out:
a b c d
2020-01-02 0.926578 -0.374901 -1.038738 -1.901277
2020-01-03 1.973570 -1.225851 -0.450821 -0.550839
2020-01-05 0.681782 -0.380826 0.989172 0.164163
data.where(data.a>0)
out:
a b c d
2020-01-01 NaN NaN NaN NaN
2020-01-02 0.926578 -0.374901 -1.038738 -1.901277
2020-01-03 1.973570 -1.225851 -0.450821 -0.550839
2020-01-04 NaN NaN NaN NaN
2020-01-05 0.681782 -0.380826 0.989172 0.164163
data.where(data>0)
out:
a b c d
2020-01-01 NaN NaN NaN 1.723855
2020-01-02 0.926578 NaN NaN NaN
2020-01-03 1.973570 NaN NaN NaN
2020-01-04 NaN NaN NaN 0.390099
2020-01-05 0.681782 NaN 0.989172 0.164163
other参数
在返回的副本中,where使用可选other参数替换条件为False的值。
data.where(data>0, -data)
out:
a b c d
2020-01-01 1.017523 0.838623 0.284684 1.723855
2020-01-02 0.926578 0.374901 1.038738 1.901277
2020-01-03 1.973570 1.225851 0.450821 0.550839
2020-01-04 0.456445 0.557138 0.227323 0.390099
2020-01-05 0.681782 0.380826 0.989172 0.164163
inplace参数
默认情况下,where返回数据的修改副本。有一个可选参数,inplace以便可以在不创建副本的情况下修改原始数据
data
out:
a b c d
2020-01-01 -1.017523 -0.838623 -0.284684 1.723855
2020-01-02 0.926578 -0.374901 -1.038738 -1.901277
2020-01-03 1.973570 -1.225851 -0.450821 -0.550839
2020-01-04 -0.456445 -0.557138 -0.227323 0.390099
2020-01-05 0.681782 -0.380826 0.989172 0.164163
data.where(data>0, -data, inplace=True)
data
out:
a b c d
2020-01-01 1.017523 0.838623 0.284684 1.723855
2020-01-02 0.926578 0.374901 1.038738 1.901277
2020-01-03 1.973570 1.225851 0.450821 0.550839
2020-01-04 0.456445 0.557138 0.227323 0.390099
2020-01-05 0.681782 0.380826 0.989172 0.164163
与numpy.where()的区别
DataFrame.where()不同于numpy.where(),但是如下所示是等价的。
data.where(data>1, 0) == np.where(data>1, data, 0)
out:
a b c d
2020-01-01 True True True True
2020-01-02 True True True True
2020-01-03 True True True True
2020-01-04 True True True True
2020-01-05 True True True True
axis参数
where()也可以接受axis参数。
data_2 = data.copy()
data_2.where(data_2 > 1, data_2.a, axis='index')
out:
a b c d
2020-01-01 1.017523 1.017523 1.017523 1.723855
2020-01-02 0.926578 0.926578 1.038738 1.901277
2020-01-03 1.973570 1.225851 1.973570 1.973570
2020-01-04 0.456445 0.456445 0.456445 0.456445
2020-01-05 0.681782 0.681782 0.681782 0.681782
data_2.where(data_2 > 1, data_2.a, axis=0)
out:
a b c d
2020-01-01 1.017523 1.017523 1.017523 1.723855
2020-01-02 0.926578 0.926578 1.038738 1.901277
2020-01-03 1.973570 1.225851 1.973570 1.973570
2020-01-04 0.456445 0.456445 0.456445 0.456445
2020-01-05 0.681782 0.681782 0.681782 0.681782
使用callable
where()可以接受一个可调用的条件和other参数。该函数必须带有一个参数(调用Series或DataFrame),并返回有效的输出作为条件和other参数。
data_2.where(data_2 > 1, lambda x: x + 10)
out:
a b c d
2020-01-01 1.017523 10.838623 10.284684 1.723855
2020-01-02 10.926578 10.374901 1.038738 1.901277
2020-01-03 1.973570 1.225851 10.450821 10.550839
2020-01-04 10.456445 10.557138 10.227323 10.390099
2020-01-05 10.681782 10.380826 10.989172 10.164163
data_2.where(lambda x: x >1, lambda x: x + 10)
out:
a b c d
2020-01-01 1.017523 10.838623 10.284684 1.723855
2020-01-02 10.926578 10.374901 1.038738 1.901277
2020-01-03 1.973570 1.225851 10.450821 10.550839
2020-01-04 10.456445 10.557138 10.227323 10.390099
2020-01-05 10.681782 10.380826 10.989172 10.164163
mask()
mask() 是where()的反向操作。
data_2.mask(data_2 > 1)
out:
a b c d
2020-01-01 NaN 0.838623 0.284684 NaN
2020-01-02 0.926578 0.374901 NaN NaN
2020-01-03 NaN NaN 0.450821 0.550839
2020-01-04 0.456445 0.557138 0.227323 0.390099
2020-01-05 0.681782 0.380826 0.989172 0.164163