《R for Data Science》第十二章 Tidy data 啃书知识点积累
参考链接:R for Data Science
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
Tidy data
- 数据整理的三条准则:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
- 三条准则简化后的两条指令
- Put each dataset in a tibble.
- Put each variable in a column.
- 优秀dataset的例子:
table1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
- 需要整理的dataset例子
table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> # … with 6 more rows
table3
#> # A tibble: 6 x 3
#> country year rate
#> * <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
# Spread across two tibbles
table4a # cases
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
table4b # population
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
- Exercises
- Q:Compute the rate for
table2
, andtable4a + table4b
.
(两种类型table均复原为table1)
table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> # … with 6 more rows
t2_cases <- table2 %>%
filter(type == "cases") %>%
rename(cases = count) %>%
arrange(country, year)
t2_population <- table2 %>%
filter(type == "population") %>%
rename(population = count) %>%
arrange(country, year)
(t2_cases_per_cap <- tibble(
country = t2_cases$country,
year = t2_cases$year,
cases = t2_cases$cases,
population = t2_population$population) %>%
mutate(cases_per_cap = (cases / population) * 10000))
table4a # cases
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
table4b # population
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
(table4c <-
tibble(
country = table4a$country,
`1999` = table4a[["1999"]] / table4b[["1999"]] * 10000,
`2000` = table4a[["2000"]] / table4b[["2000"]] * 10000
))
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> <chr> <dbl> <dbl>
#> 1 Afghanistan 0.373 1.29
#> 2 Brazil 2.19 4.61
#> 3 China 1.67 1.67
# 也可以将table4a和table4b还原成table1再处理
# 先利用比较传统的方法,下文有更好的工具:pivot_longer()
t4a_1 <- table4a %>%
select(1:2) %>%
rename(cases = `1999`) %>%
mutate(year = 1999)
t4a_2 <- table4a %>%
select(c(1,3)) %>%
rename(cases = `2000`) %>%
mutate(year = 2000)
t4a <- rbind(t4a_1,t4a_2) %>%
arrange(country,year)
t4b_1 <- table4b %>%
select(1:2) %>%
rename(population = `1999`) %>%
mutate(year = 1999)
t4b_2 <- table4b %>%
select(c(1,3)) %>%
rename(population = `2000`) %>%
mutate(year = 2000)
t4b <- rbind(t4b_1,t4b_2) %>%
arrange(country,year)
(t4_cases_per <- tibble(
country = t4a$country,
year = t4a$year,
cases = t4a$cases,
population = t4b$population) %>%
mutate(cases_per_cap = (cases / population) * 10000))
Pivoting
- pivot_longer()
pivot_longer()
makes datasets longer by increasing the number of rows and decreasing the number of columns.
简而言之就是一行有多个观测值
# 适用类型
table4a
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
Take
table4a
:
the column names 1999 and 2000 represent values of theyear
variable, the values in the 1999 and 2000 columns represent values of thecases
variable, and each row represents two observations, not one.
解决策略:
- The set of columns whose names are values, not variables. In this example, those are the columns 1999 and 2000.
- The name of the variable to move the column names to. Here it is year.
- The name of the variable to move the column values to. Here it’s cases.
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Afghanistan 2000 2666
#> 3 Brazil 1999 37737
#> 4 Brazil 2000 80488
#> 5 China 1999 212258
#> 6 China 2000 213766
year
and cases
do not exist in table4a so we put their names in quotes.
tidy4a <- table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
# 左连接两表
dplyr::left_join(tidy4a, tidy4b)
#> Joining, by = c("country", "year")
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <chr> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# 附另一个练习
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
preg %>%
pivot_longer(c(male,female),names_to = 'sex',values_to = 'count')
- pivot_wider()
Take
table2
:
an observation is a country in a year, but each observation is spread across two rows.
简而言之就是一列有多个变量
解决策略:
- The column to take variable names from. Here, it’s type.
- The column to take values from. Here it’s count.
# 适用类型
table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> # … with 6 more rows
table2 %>%
pivot_wider(names_from = type, values_from = count)
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
-
pivot_longer()
makes wide tables narrower and longer -
pivot_wider()
makes long tables shorter and wider
- 和gather spread的对比
- gather等价于pivot_longer
table4a
## A tibble: 3 x 3
# country `1999` `2000`
# * <chr> <int> <int>
# 1 Afghanistan 745 2666
# 2 Brazil 37737 80488
# 3 China 212258 213766
table4a %>%
pivot_longer(c(`1999`,`2000`),names_to = 'year',values_to = 'value')
# 等价于
table4a %>%
gather(c(`1999`,`2000`),key = 'year',value = 'value')
- spread等价于pivot_wider
table2
# # A tibble: 12 x 4
# country year type count
# <chr> <int> <chr> <int>
# 1 Afghanistan 1999 cases 745
# 2 Afghanistan 1999 population 19987071
# 3 Afghanistan 2000 cases 2666
# 4 Afghanistan 2000 population 20595360
# 5 Brazil 1999 cases 37737
# 6 Brazil 1999 population 172006362
# 7 Brazil 2000 cases 80488
# 8 Brazil 2000 population 174504898
# 9 China 1999 cases 212258
# 10 China 1999 population 1272915272
# 11 China 2000 cases 213766
# 12 China 2000 population 1280428583
table2 %>%
pivot_wider(names_from = type,values_from = count)
# 等价于
table2 %>%
spread(type,count)
# # A tibble: 6 x 4
# country year cases population
# <chr> <int> <int> <int>
# 1 Afghanistan 1999 745 19987071
# 2 Afghanistan 2000 2666 20595360
# 3 Brazil 1999 37737 172006362
# 4 Brazil 2000 80488 174504898
# 5 China 1999 212258 1272915272
# 6 China 2000 213766 1280428583
- Exercises
- Q: Why are pivot_longer() and pivot_wider() not perfectly symmetrical
问题的关键是列属性发生了丢失
可以再gather或者pivot_longer中指定参数
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(c(`2015`,`2016`), names_to = "year", values_to = "return",
names_ptype = list(year = double()))
stocks %>%
spread(key = "year", value = "return") %>%
gather(c(`2015`,`2016`), key = "year", value = "return", convert = TRUE)
- Q:有重复值的dataset如何使用pivot_wider
people <- tribble(
~name, ~names, ~values,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
# 添加一列以区别各行
people <- people %>%
group_by(name,names) %>%
mutate(num = row_number())
# 即可区分
people %>%
pivot_wider(names_from = names, values_from = values)
Separating and uniting
- Separate
separate()
pulls apart one column into multiple columns, by splitting wherever a separator character appears.
By default,separate()
will split values wherever it sees a non-alphanumeric character
table3
# # A tibble: 6 x 3
# country year rate
# * <chr> <int> <chr>
# 1 Afghanistan 1999 745/19987071
# 2 Afghanistan 2000 2666/20595360
# 3 Brazil 1999 37737/172006362
# 4 Brazil 2000 80488/174504898
# 5 China 1999 212258/1272915272
# 6 China 2000 213766/1280428583
table3 %>%
separate(rate, into = c("cases", "population"))
# # A tibble: 6 x 4
# country year cases population
# <chr> <int> <chr> <chr>
# 1 Afghanistan 1999 745 19987071
# 2 Afghanistan 2000 2666 20595360
# 3 Brazil 1999 37737 172006362
# 4 Brazil 2000 80488 174504898
# 5 China 1999 212258 1272915272
# 6 China 2000 213766 1280428583
- 相关参数设置
(1) sep 指定分隔符
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
若sep后跟的是数字则表示从第几个位置切开,右一是-1
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
#> # A tibble: 6 x 4
#> country century year rate
#> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745/19987071
#> 2 Afghanistan 20 00 2666/20595360
#> 3 Brazil 19 99 37737/172006362
#> 4 Brazil 20 00 80488/174504898
#> 5 China 19 99 212258/1272915272
#> 6 China 20 00 213766/1280428583
(2) convert 分割后解析成合适的类型
不设定则会保留chr类型
table5 <- table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
(3) extra 少数记录分隔符较多的情况
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"),extra = 'warn') # 默认值是drop
# # A tibble: 3 x 3
# one two three
# <chr> <chr> <chr>
# 1 a b c
# 2 d e f
# 3 h i j
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"),extra = 'drop')
# # A tibble: 3 x 3
# one two three
# <chr> <chr> <chr>
# 1 a b c
# 2 d e f
# 3 h i j
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"),extra = 'merge')
# # A tibble: 3 x 3
# one two three
# <chr> <chr> <chr>
# 1 a b c
# 2 d e f,g
# 3 h i j
(4) fill 少数记录分隔符较少的情况
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = 'warn') # 默认值是right
# # A tibble: 3 x 3
# one two three
# <chr> <chr> <chr>
# 1 a b c
# 2 d e NA
# 3 f g i
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = 'left')
# # A tibble: 3 x 3
# one two three
# <chr> <chr> <chr>
# 1 a b c
# 2 NA d e
# 3 f g i
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = 'right')
# # A tibble: 3 x 3
# one two three
# <chr> <chr> <chr>
# 1 a b c
# 2 d e NA
# 3 f g i
(5) remove 选择是否保留原列,unite中也可以设置
table3 %>%
separate(rate, into = c("cases", "population"),remove = F)
# # A tibble: 6 x 5
# country year rate cases population
# <chr> <int> <chr> <chr> <chr>
# 1 Afghanistan 1999 745/19987071 745 19987071
# 2 Afghanistan 2000 2666/20595360 2666 20595360
# 3 Brazil 1999 37737/172006362 37737 172006362
# 4 Brazil 2000 80488/174504898 80488 174504898
# 5 China 1999 212258/1272915272 212258 1272915272
# 6 China 2000 213766/1280428583 213766 1280428583
table5 %>%
unite(new, century, year,remove = F)
#> # A tibble: 6 x 3
# country new century year rate
# <chr> <chr> <chr> <chr> <chr>
# 1 Afghanistan 19_99 19 99 745/19987071
# 2 Afghanistan 20_00 20 00 2666/20595360
# 3 Brazil 19_99 19 99 37737/172006362
# 4 Brazil 20_00 20 00 80488/174504898
# 5 China 19_99 19 99 212258/1272915272
# 6 China 20_00 20 00 213766/1280428583
- unite
unite()
is the inverse ofseparate()
: it combines multiple columns into a single column.
# 默认的连接符是'_'
table5 %>%
unite(new, century, year)
#> # A tibble: 6 x 3
#> country new rate
#> <chr> <chr> <chr>
#> 1 Afghanistan 19_99 745/19987071
#> 2 Afghanistan 20_00 2666/20595360
#> 3 Brazil 19_99 37737/172006362
#> 4 Brazil 20_00 80488/174504898
#> 5 China 19_99 212258/1272915272
#> 6 China 20_00 213766/1280428583
默认的连接符是'_',也可用sep指定
table5 %>%
unite(new, century, year, sep = "")
#> # A tibble: 6 x 3
#> country new rate
#> <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
- extract
支持使用正则表达式,比separate更灵活
# example with separators
tibble(x = c("X_1", "X_2", "AA_1", "AA_2")) %>%
extract(x, c("variable", "id"), regex = "([A-Z])_([0-9])")
#> # A tibble: 4 x 2
#> variable id
#> <chr> <chr>
#> 1 X 1
#> 2 X 2
#> 3 A 1
#> 4 A 2
# example with position
tibble(x = c("X1", "X2", "Y1", "Y2")) %>%
extract(x, c("variable", "id"), regex = "([A-Z])([0-9])")
#> # A tibble: 4 x 2
#> variable id
#> <chr> <chr>
#> 1 X 1
#> 2 X 2
#> 3 Y 1
#> 4 Y 2
# example that separate could not parse
# 尤其是这个例子,separate只能按分隔符或者按数字位置切割
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
extract(x, c("variable", "id"), regex = "([A-Z]+)([0-9]+)")
#> # A tibble: 4 x 2
#> variable id
#> <chr> <chr>
#> 1 X 1
#> 2 X 20
#> 3 AA 11
#> 4 AA 2
Missing values
A value can be missing in one of two possible ways:
- Explicitly, i.e. flagged with NA.
- Implicitly, i.e. simply not present in the data.
下面这句解释非常哲学又令人豁然开朗:
- 用
values_drop_na
可以去除缺失值
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)
#> # A tibble: 6 x 3
#> qtr year return
#> <dbl> <chr> <dbl>
#> 1 1 2015 1.88
#> 2 2 2015 0.59
#> 3 2 2016 0.92
#> 4 3 2015 0.35
#> 5 3 2016 0.17
#> 6 4 2016 2.66
- 用
complete
可以显示显式和隐式缺失
stocks %>%
complete(year, qtr)
# # A tibble: 8 x 3
# year qtr return
# <dbl> <dbl> <dbl>
# 1 2015 1 1.88
# 2 2015 2 0.59
# 3 2015 3 0.35
# 4 2015 4 NA
# 5 2016 1 NA
# 6 2016 2 0.92
# 7 2016 3 0.17
# 8 2016 4 2.66
可以用nesting
确定组合便于暴露NA
df <- tibble(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
df %>%
complete(group)
# # A tibble: 3 x 5
# group item_id item_name value1 value2
# <dbl> <dbl> <chr> <int> <int>
# 1 1 1 a 1 4
# 2 1 2 b 3 6
# 3 2 2 b 2 5
df %>%
complete(group,nesting(item_id, item_name))
# # A tibble: 4 x 5
# group item_id item_name value1 value2
# <dbl> <dbl> <chr> <int> <int>
# 1 1 1 a 1 4
# 2 1 2 b 3 6
# 3 2 1 a NA NA
# 4 2 2 b 2 5
- 用
fill
完成缺失值填充
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
treatment %>%
fill(person) # 默认是向下填充,可以用.direction指定方向
#> # A tibble: 4 x 3
#> person treatment response
#> <chr> <dbl> <dbl>
#> 1 Derrick Whitmore 1 7
#> 2 Derrick Whitmore 2 10
#> 3 Derrick Whitmore 3 9
#> 4 Katherine Burke 1 4
treatment %>%
fill(person,.direction = 'down') # 向下填充
treatment %>%
fill(person,.direction = 'up') # 向上填充
complete
中也有fill参数,是list类型可以指定不同列的参数填补
df %>%
complete(group,nesting(item_id, item_name), fill=list(value1=2, value2=3))
# # A tibble: 4 x 5
# group item_id item_name value1 value2
# <dbl> <dbl> <chr> <int> <int>
# 1 1 1 a 1 4
# 2 1 2 b 3 6
# 3 2 1 a 2 3
# 4 2 2 b 2 5
Case Study
用的是tidyr::who
数据集
It contains redundant columns, odd variable codes, and many missing values.
- The best place to start is almost always to gather together the columns that are not variables.
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
who1
#> # A tibble: 76,046 x 6
#> country iso2 iso3 year key cases
#> <chr> <chr> <chr> <int> <chr> <int>
#> 1 Afghanistan AF AFG 1997 new_sp_m014 0
#> 2 Afghanistan AF AFG 1997 new_sp_m1524 10
#> 3 Afghanistan AF AFG 1997 new_sp_m2534 6
#> 4 Afghanistan AF AFG 1997 new_sp_m3544 3
#> 5 Afghanistan AF AFG 1997 new_sp_m4554 5
#> 6 Afghanistan AF AFG 1997 new_sp_m5564 2
#> # … with 7.604e+04 more rows
含义解释如下:
who2 <- who1 %>%
mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))
# 这个地方书上写错了,应该拆分的是新形成的列否则who2没有意义
who3 <- who2 %>%
separate(names_from, c("new", "type", "sexage"), sep = "_")
who4 <- who3 %>%
select(-new, -iso2, -iso3)
who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)
# 完整步骤如下
who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
) %>%
mutate(
key = stringr::str_replace(key, "newrel", "new_rel")
) %>%
separate(key, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
Non-tidy data
There are good reasons to use other structures;
Tidy data is not the only way.