Data and Features 2

Wrangling Your Data
It might not have been collected by the person placed in charge of doing so. There might have been a mechanical failure at the point of the sensor. Perhaps the dog ate it. Or maybe it never really even existed in the first place! Whatever the cause, it's not uncommon for datasets to come with some missing data.

When you are working with large datasets, it would be great if every sample had measurements recorded for each feature. But in reality, this almost never happens. In fact, you might not even find a single sample free of missing data. Annoying as this is, simply ignoring missing data usually isn't an option, as it can wreck havoc if not handled properly during your analysis. If not accounted for, missing data might lead you to erroneous conclusions about your samples by resulting in incorrect sums and means, and even by skewing distributions.

Pandas represents missing data internally using Numpy's np.nan. Had Python's None been used, there would be ambiguous collision cases when you actually wished to store None and could no longer differentiate that and a missing record. Pandas provides you with a few basic methods for mitigating missing data, which work on both series and dataframe objects.

Any time a nan is encountered, replace it with a scalar value:

df.my_feature.fillna( df.my_feature.mean() )
df.fillna(0)

When a nan is encountered, replace it with the immediate, previous, non-nan value. Be mindful about which axis you perform this on. You also have the abilities to specify an optional limit of how far you want the fill to go, or if you'd like to run the fill in reverse (bfill):

df.fillna(method='ffill')  # fill the values forward
df.fillna(method='bfill')  # fill the values in reverse
df.fillna(limit=5)

Fill out nans by interpolating over them with the non-nan values that come immediately before and after. You can select the interpolation method you'd like to use, such as nearest, cubic, spline and more. If your nans occur at the start or end of your list, interpolation will not be able to help you:

df.interpolate(method='polynomial', order=2)

Dropping Data

You should always first try to fill in missing data rather that deleting it. This is so important that we've included a link in the dive deeper section that provides a very comprehensive argument and explanation for this. But if all else fails and you've given up on rectifying your nans, you can always remove the sample or column completely, so that it no longer negatively impacts your analysis. This should ever be used as a last resort:

df = df.dropna(axis=0)  # remove any row with nans
df = df.dropna(axis=1)  # remove any column with nans

# Drop any row that has at least 4 NON-NaNs within it:
df = df.dropna(axis=0, thresh=4)

There may be cases where you want to get rid of non-nan values. For instance, if your dataset has a column you don't need:

# Axis=1 for columns
df = df.drop(labels=['Features', 'To', 'Delete'], axis=1)

You might also want to prune duplicate records if samples cannot have identical properties. Be careful though! To get rid of duplicate records, you should tell Pandas which features are to be examined, because Pandas generates indices for you automatically when you load a dataframe without specifying an index column. With each column having a unique index, Pandas won't find any 'duplicates' unless you limit your search to a subset of your dataframe's features:

df = df.drop_duplicates(subset=['Feature_1', 'Feature_2'])

Removing duplicate samples will cause gaps to occur in your index count. You can interpolate to fill those holes where appropriate, or alternatively you can reindex your dataframe:

df = df.reset_index(drop=True)

The drop=True parameter tells Pandas not to keep a backup copy of the original index. Most, if not all, of the above methods return a copy of your dataframe. This is useful because you can chain methods:

df = df.dropna(axis=0, thresh=2).drop(labels=['ColA', axis=1]).drop_duplicates(subset=['ColB', 'ColC']).reset_index()

However there may be times where you want these operations to work in-place on the dataframe calling them, rather than returning a new dataframe. Pass inplace=True as a parameter to any of the above methods to get that working.

More Wrangling

Pandas will automatically attempt to figure out the best data type to use for each series in your dataset. Most of the time it does this flawlessly, but other times it fails horribly! Particularly the .read_html() method is notorious for defaulting all series data types to Python objects. You should check, and double-check the actual type of each column in your dataset to avoid unwanted surprises:

>>> df.dtypes

Date        object
Name        object
Gender      object
Height      object
Weight      object
Age         object
Job         object

If your data types don't look the way you expected them, explicitly convert them to the desired type using the .to_datetime(), .to_numeric(), and .to_timedelta() methods:

>>> df.Date = pd.to_datetime(df.Date, errors='coerce')
>>> df.Height = pd.to_numeric(df.Height, errors='coerce')
>>> df.Weight = pd.to_numeric(df.Weight, errors='coerce')
>>> df.Age = pd.to_numeric(df.Age, errors='coerce')
>>> df.dtypes

Date        datetime64
Name        object
Gender      object
Height      float64
Weight      float64
Age         int64
Job         object

Take note how to_numeric properly converts to decimal or integer depending on the data it finds. The errors='coerce' parameter instructs Pandas to enter a NaN at any field where the conversion fails.

After fixing up your data types, let's say you want to see all the unique values present in a particular series. Call the .unique() method on it to view a list, or alternatively, if you'd like to know how many times each of those unique values are present, you can call .value_counts(). Either method works with series, but neither will function if called on a dataframe:

>>> df.Age.unique()

array([7, 33, 27, 40, 22], dtype=int64)


>>> df.Age.value_counts()

7      1
22     5
27     1
33     2
40     2
dtype: int64

There are many other possible data munging and wrangling tasks, many of which can be applied easily and generically to any dataset. We've referenced a site detailing almost 40 such operations for you to further explore in the Dive Deeper section. However, some wrangling tasks require you look closer at your data. For instance, if you survey users with a series of 1-10 ranked questions, and a user enters all 5's or all 1's, chances are they were not being completely honest. Another example would be a user entering in January 1, 1970 as their birthdate since you required they enter in something but they did not want to disclose the information. In order to further improve the accuracy of your datasets, always be on the lookout for these sorts of issues.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,684评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,143评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,214评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,788评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,796评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,665评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,027评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,679评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 41,346评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,664评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,766评论 1 331
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,412评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,015评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,974评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,203评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,073评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,501评论 2 343

推荐阅读更多精彩内容