Wrangling Your Data
It might not have been collected by the person placed in charge of doing so. There might have been a mechanical failure at the point of the sensor. Perhaps the dog ate it. Or maybe it never really even existed in the first place! Whatever the cause, it's not uncommon for datasets to come with some missing data.
When you are working with large datasets, it would be great if every sample had measurements recorded for each feature. But in reality, this almost never happens. In fact, you might not even find a single sample free of missing data. Annoying as this is, simply ignoring missing data usually isn't an option, as it can wreck havoc if not handled properly during your analysis. If not accounted for, missing data might lead you to erroneous conclusions about your samples by resulting in incorrect sums and means, and even by skewing distributions.
Pandas represents missing data internally using Numpy's np.nan. Had Python's None been used, there would be ambiguous collision cases when you actually wished to store None and could no longer differentiate that and a missing record. Pandas provides you with a few basic methods for mitigating missing data, which work on both series and dataframe objects.
Any time a nan is encountered, replace it with a scalar value:
df.my_feature.fillna( df.my_feature.mean() )
df.fillna(0)
When a nan is encountered, replace it with the immediate, previous, non-nan value. Be mindful about which axis you perform this on. You also have the abilities to specify an optional limit of how far you want the fill to go, or if you'd like to run the fill in reverse (bfill):
df.fillna(method='ffill') # fill the values forward
df.fillna(method='bfill') # fill the values in reverse
df.fillna(limit=5)
Fill out nans by interpolating over them with the non-nan values that come immediately before and after. You can select the interpolation method you'd like to use, such as nearest, cubic, spline and more. If your nans occur at the start or end of your list, interpolation will not be able to help you:
df.interpolate(method='polynomial', order=2)
Dropping Data
You should always first try to fill in missing data rather that deleting it. This is so important that we've included a link in the dive deeper section that provides a very comprehensive argument and explanation for this. But if all else fails and you've given up on rectifying your nans, you can always remove the sample or column completely, so that it no longer negatively impacts your analysis. This should ever be used as a last resort:
df = df.dropna(axis=0) # remove any row with nans
df = df.dropna(axis=1) # remove any column with nans
# Drop any row that has at least 4 NON-NaNs within it:
df = df.dropna(axis=0, thresh=4)
There may be cases where you want to get rid of non-nan values. For instance, if your dataset has a column you don't need:
# Axis=1 for columns
df = df.drop(labels=['Features', 'To', 'Delete'], axis=1)
You might also want to prune duplicate records if samples cannot have identical properties. Be careful though! To get rid of duplicate records, you should tell Pandas which features are to be examined, because Pandas generates indices for you automatically when you load a dataframe without specifying an index column. With each column having a unique index, Pandas won't find any 'duplicates' unless you limit your search to a subset of your dataframe's features:
df = df.drop_duplicates(subset=['Feature_1', 'Feature_2'])
Removing duplicate samples will cause gaps to occur in your index count. You can interpolate to fill those holes where appropriate, or alternatively you can reindex your dataframe:
df = df.reset_index(drop=True)
The drop=True parameter tells Pandas not to keep a backup copy of the original index. Most, if not all, of the above methods return a copy of your dataframe. This is useful because you can chain methods:
df = df.dropna(axis=0, thresh=2).drop(labels=['ColA', axis=1]).drop_duplicates(subset=['ColB', 'ColC']).reset_index()
However there may be times where you want these operations to work in-place on the dataframe calling them, rather than returning a new dataframe. Pass inplace=True as a parameter to any of the above methods to get that working.
More Wrangling
Pandas will automatically attempt to figure out the best data type to use for each series in your dataset. Most of the time it does this flawlessly, but other times it fails horribly! Particularly the .read_html() method is notorious for defaulting all series data types to Python objects. You should check, and double-check the actual type of each column in your dataset to avoid unwanted surprises:
>>> df.dtypes
Date object
Name object
Gender object
Height object
Weight object
Age object
Job object
If your data types don't look the way you expected them, explicitly convert them to the desired type using the .to_datetime(), .to_numeric(), and .to_timedelta() methods:
>>> df.Date = pd.to_datetime(df.Date, errors='coerce')
>>> df.Height = pd.to_numeric(df.Height, errors='coerce')
>>> df.Weight = pd.to_numeric(df.Weight, errors='coerce')
>>> df.Age = pd.to_numeric(df.Age, errors='coerce')
>>> df.dtypes
Date datetime64
Name object
Gender object
Height float64
Weight float64
Age int64
Job object
Take note how to_numeric properly converts to decimal or integer depending on the data it finds. The errors='coerce' parameter instructs Pandas to enter a NaN at any field where the conversion fails.
After fixing up your data types, let's say you want to see all the unique values present in a particular series. Call the .unique() method on it to view a list, or alternatively, if you'd like to know how many times each of those unique values are present, you can call .value_counts(). Either method works with series, but neither will function if called on a dataframe:
>>> df.Age.unique()
array([7, 33, 27, 40, 22], dtype=int64)
>>> df.Age.value_counts()
7 1
22 5
27 1
33 2
40 2
dtype: int64
There are many other possible data munging and wrangling tasks, many of which can be applied easily and generically to any dataset. We've referenced a site detailing almost 40 such operations for you to further explore in the Dive Deeper section. However, some wrangling tasks require you look closer at your data. For instance, if you survey users with a series of 1-10 ranked questions, and a user enters all 5's or all 1's, chances are they were not being completely honest. Another example would be a user entering in January 1, 1970 as their birthdate since you required they enter in something but they did not want to disclose the information. In order to further improve the accuracy of your datasets, always be on the lookout for these sorts of issues.