[R语言] tidyr包数据整理《R for data science》 6

《R for Data Science》第十二章 Tidy data 啃书知识点积累
参考链接：R for Data Science

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

Tidy data

数据整理的三条准则：

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

三条准则简化后的两条指令

Put each dataset in a tibble.
Put each variable in a column.

优秀dataset的例子：

table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

需要整理的dataset例子

table2
#> # A tibble: 12 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # … with 6 more rows

table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

# Spread across two tibbles
table4a  # cases
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

table4b  # population
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

- Exercises

Q：Compute the rate for table2, and table4a + table4b.

（两种类型table均复原为table1）

table2
#> # A tibble: 12 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # … with 6 more rows

t2_cases <- table2 %>% 
  filter(type == "cases") %>%
  rename(cases = count) %>%
  arrange(country, year)

t2_population <- table2 %>% 
  filter(type == "population") %>%
  rename(population = count) %>%
  arrange(country, year)

(t2_cases_per_cap <- tibble(
  country = t2_cases$country,
  year = t2_cases$year,
  cases = t2_cases$cases,
  population = t2_population$population) %>%
    mutate(cases_per_cap = (cases / population) * 10000))

table4a  # cases
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

table4b  # population
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

(table4c <-
  tibble(
    country = table4a$country,
    `1999` = table4a[["1999"]] / table4b[["1999"]] * 10000,
    `2000` = table4a[["2000"]] / table4b[["2000"]] * 10000
  ))
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#>   <chr>        <dbl>  <dbl>
#> 1 Afghanistan  0.373   1.29
#> 2 Brazil       2.19    4.61
#> 3 China        1.67    1.67

# 也可以将table4a和table4b还原成table1再处理
# 先利用比较传统的方法，下文有更好的工具：pivot_longer()

t4a_1 <- table4a %>% 
  select(1:2) %>% 
  rename(cases = `1999`) %>% 
  mutate(year = 1999)

t4a_2 <- table4a %>% 
  select(c(1,3)) %>% 
  rename(cases = `2000`) %>% 
  mutate(year = 2000)

t4a <- rbind(t4a_1,t4a_2) %>% 
  arrange(country,year)

t4b_1 <- table4b %>% 
  select(1:2) %>% 
  rename(population = `1999`) %>% 
  mutate(year = 1999)

t4b_2 <- table4b %>% 
  select(c(1,3)) %>% 
  rename(population = `2000`) %>% 
  mutate(year = 2000)

t4b <- rbind(t4b_1,t4b_2) %>% 
  arrange(country,year)

(t4_cases_per <- tibble(
  country = t4a$country,
  year = t4a$year,
  cases = t4a$cases,
  population = t4b$population) %>%
    mutate(cases_per_cap = (cases / population) * 10000))

Pivoting

- pivot_longer()

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.

简而言之就是一行有多个观测值

# 适用类型
table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

Take table4a:
the column names 1999 and 2000 represent values of the year variable, the values in the 1999 and 2000 columns represent values of the cases variable, and each row represents two observations, not one.

解决策略：

The set of columns whose names are values, not variables. In this example, those are the columns 1999 and 2000.
The name of the variable to move the column names to. Here it is year.
The name of the variable to move the column values to. Here it’s cases.

table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Afghanistan 2000    2666
#> 3 Brazil      1999   37737
#> 4 Brazil      2000   80488
#> 5 China       1999  212258
#> 6 China       2000  213766

year and cases do not exist in table4a so we put their names in quotes.

tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")

# 左连接两表
dplyr::left_join(tidy4a, tidy4b)
#> Joining, by = c("country", "year")
#> # A tibble: 6 x 4
#>   country     year   cases population
#>   <chr>       <chr>  <int>      <int>
#> 1 Afghanistan 1999     745   19987071
#> 2 Afghanistan 2000    2666   20595360
#> 3 Brazil      1999   37737  172006362
#> 4 Brazil      2000   80488  174504898
#> 5 China       1999  212258 1272915272
#> 6 China       2000  213766 1280428583

# 附另一个练习
preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)

preg %>% 
  pivot_longer(c(male,female),names_to = 'sex',values_to = 'count')

- pivot_wider()

Take table2:
an observation is a country in a year, but each observation is spread across two rows.

简而言之就是一列有多个变量

解决策略：

The column to take variable names from. Here, it’s type.
The column to take values from. Here it’s count.

# 适用类型
table2
#> # A tibble: 12 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # … with 6 more rows

table2 %>%
    pivot_wider(names_from = type, values_from = count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

pivot_longer() makes wide tables narrower and longer
pivot_wider() makes long tables shorter and wider

- 和gather spread的对比

gather等价于pivot_longer

table4a
## A tibble: 3 x 3
#    country     `1999` `2000`
# * <chr>        <int>  <int>
# 1 Afghanistan    745   2666
# 2 Brazil       37737  80488
# 3 China       212258 213766

table4a %>% 
  pivot_longer(c(`1999`,`2000`),names_to = 'year',values_to = 'value')
# 等价于
table4a %>% 
  gather(c(`1999`,`2000`),key = 'year',value = 'value')

spread等价于pivot_wider

table2
# # A tibble: 12 x 4
#   country      year type            count
#   <chr>       <int> <chr>           <int>
# 1 Afghanistan  1999 cases             745
# 2 Afghanistan  1999 population   19987071
# 3 Afghanistan  2000 cases            2666
# 4 Afghanistan  2000 population   20595360
# 5 Brazil       1999 cases           37737
# 6 Brazil       1999 population  172006362
# 7 Brazil       2000 cases           80488
# 8 Brazil       2000 population  174504898
# 9 China        1999 cases          212258
# 10 China        1999 population 1272915272
# 11 China        2000 cases          213766
# 12 China        2000 population 1280428583

table2 %>% 
  pivot_wider(names_from = type,values_from = count)
# 等价于
table2 %>% 
  spread(type,count)
# # A tibble: 6 x 4
#   country      year  cases population
#   <chr>       <int>  <int>      <int>
# 1 Afghanistan  1999    745   19987071
# 2 Afghanistan  2000   2666   20595360
# 3 Brazil       1999  37737  172006362
# 4 Brazil       2000  80488  174504898
# 5 China        1999 212258 1272915272
# 6 China        2000 213766 1280428583

- Exercises

Q: Why are pivot_longer() and pivot_wider() not perfectly symmetrical

问题的关键是列属性发生了丢失

可以再gather或者pivot_longer中指定参数

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(c(`2015`,`2016`), names_to = "year", values_to = "return",
               names_ptype = list(year = double()))

stocks %>%
  spread(key = "year", value = "return") %>%
  gather(c(`2015`,`2016`), key = "year", value = "return", convert = TRUE)

Q：有重复值的dataset如何使用pivot_wider

people <- tribble(
  ~name,             ~names,  ~values,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)

# 添加一列以区别各行
people <- people %>% 
  group_by(name,names) %>% 
  mutate(num = row_number())

# 即可区分
people %>% 
  pivot_wider(names_from = names, values_from = values)

Separating and uniting

- Separate

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears.
By default, separate() will split values wherever it sees a non-alphanumeric character

table3
# # A tibble: 6 x 3
#   country      year rate             
# * <chr>       <int> <chr>            
# 1 Afghanistan  1999 745/19987071     
# 2 Afghanistan  2000 2666/20595360    
# 3 Brazil       1999 37737/172006362  
# 4 Brazil       2000 80488/174504898  
# 5 China        1999 212258/1272915272
# 6 China        2000 213766/1280428583

table3 %>% 
  separate(rate, into = c("cases", "population"))
# # A tibble: 6 x 4
#   country      year cases  population
#   <chr>       <int> <chr>  <chr>     
# 1 Afghanistan  1999 745    19987071  
# 2 Afghanistan  2000 2666   20595360  
# 3 Brazil       1999 37737  172006362 
# 4 Brazil       2000 80488  174504898 
# 5 China        1999 212258 1272915272
# 6 China        2000 213766 1280428583

相关参数设置

(1) sep 指定分隔符

table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")

若sep后跟的是数字则表示从第几个位置切开，右一是-1

table3 %>% 
  separate(year, into = c("century", "year"), sep = 2)
#> # A tibble: 6 x 4
#>   country     century year  rate             
#>   <chr>       <chr>   <chr> <chr>            
#> 1 Afghanistan 19      99    745/19987071     
#> 2 Afghanistan 20      00    2666/20595360    
#> 3 Brazil      19      99    37737/172006362  
#> 4 Brazil      20      00    80488/174504898  
#> 5 China       19      99    212258/1272915272
#> 6 China       20      00    213766/1280428583

(2) convert 分割后解析成合适的类型
不设定则会保留chr类型

table5 <- table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

(3) extra 少数记录分隔符较多的情况

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"),extra = 'warn') # 默认值是drop
# # A tibble: 3 x 3
#   one   two   three
#   <chr> <chr> <chr>
# 1 a     b     c    
# 2 d     e     f    
# 3 h     i     j

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"),extra = 'drop')
# # A tibble: 3 x 3
#   one   two   three
#   <chr> <chr> <chr>
# 1 a     b     c    
# 2 d     e     f    
# 3 h     i     j

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"),extra = 'merge')
# # A tibble: 3 x 3
#   one   two   three
#   <chr> <chr> <chr>
# 1 a     b     c    
# 2 d     e     f,g  
# 3 h     i     j

(4) fill 少数记录分隔符较少的情况

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = 'warn') # 默认值是right
# # A tibble: 3 x 3
#   one   two   three
#  <chr> <chr> <chr>
# 1 a     b     c    
# 2 d     e     NA   
# 3 f     g     i 

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = 'left')
# # A tibble: 3 x 3
#   one   two   three
#   <chr> <chr> <chr>
# 1 a     b     c    
# 2 NA    d     e    
# 3 f     g     i 

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = 'right')
# # A tibble: 3 x 3
#   one   two   three
#  <chr> <chr> <chr>
# 1 a     b     c    
# 2 d     e     NA   
# 3 f     g     i

(5) remove 选择是否保留原列，unite中也可以设置

table3 %>% 
  separate(rate, into = c("cases", "population"),remove = F)
# # A tibble: 6 x 5
#   country      year rate              cases  population
#   <chr>       <int> <chr>             <chr>  <chr>     
# 1 Afghanistan  1999 745/19987071      745    19987071  
# 2 Afghanistan  2000 2666/20595360     2666   20595360  
# 3 Brazil       1999 37737/172006362   37737  172006362 
# 4 Brazil       2000 80488/174504898   80488  174504898 
# 5 China        1999 212258/1272915272 212258 1272915272
# 6 China        2000 213766/1280428583 213766 1280428583

table5 %>% 
  unite(new, century, year,remove = F)
#> # A tibble: 6 x 3
#     country     new   century year  rate             
#     <chr>       <chr> <chr>   <chr> <chr>            
# 1 Afghanistan 19_99 19      99    745/19987071     
# 2 Afghanistan 20_00 20      00    2666/20595360    
# 3 Brazil      19_99 19      99    37737/172006362  
# 4 Brazil      20_00 20      00    80488/174504898  
# 5 China       19_99 19      99    212258/1272915272
# 6 China       20_00 20      00    213766/1280428583

- unite

unite() is the inverse of separate(): it combines multiple columns into a single column.

# 默认的连接符是'_'
table5 %>% 
  unite(new, century, year)
#> # A tibble: 6 x 3
#>   country     new   rate             
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 19_99 745/19987071     
#> 2 Afghanistan 20_00 2666/20595360    
#> 3 Brazil      19_99 37737/172006362  
#> 4 Brazil      20_00 80488/174504898  
#> 5 China       19_99 212258/1272915272
#> 6 China       20_00 213766/1280428583

默认的连接符是'_'，也可用sep指定

table5 %>% 
  unite(new, century, year, sep = "")
#> # A tibble: 6 x 3
#>   country     new   rate             
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583

- extract

支持使用正则表达式，比separate更灵活

# example with separators
tibble(x = c("X_1", "X_2", "AA_1", "AA_2")) %>%
  extract(x, c("variable", "id"), regex = "([A-Z])_([0-9])")
#> # A tibble: 4 x 2
#>   variable id   
#>   <chr>    <chr>
#> 1 X        1    
#> 2 X        2    
#> 3 A        1    
#> 4 A        2

# example with position
tibble(x = c("X1", "X2", "Y1", "Y2")) %>%
  extract(x, c("variable", "id"), regex = "([A-Z])([0-9])")
#> # A tibble: 4 x 2
#>   variable id   
#>   <chr>    <chr>
#> 1 X        1    
#> 2 X        2    
#> 3 Y        1    
#> 4 Y        2

# example that separate could not parse
# 尤其是这个例子，separate只能按分隔符或者按数字位置切割
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
  extract(x, c("variable", "id"), regex = "([A-Z]+)([0-9]+)")
#> # A tibble: 4 x 2
#>   variable id   
#>   <chr>    <chr>
#> 1 X        1    
#> 2 X        20   
#> 3 AA       11   
#> 4 AA       2

Missing values

A value can be missing in one of two possible ways:

Explicitly, i.e. flagged with NA.

Implicitly, i.e. simply not present in the data.

下面这句解释非常哲学又令人豁然开朗：

用values_drop_na可以去除缺失值

stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(
    cols = c(`2015`, `2016`), 
    names_to = "year", 
    values_to = "return", 
    values_drop_na = TRUE
  )
#> # A tibble: 6 x 3
#>     qtr year  return
#>   <dbl> <chr>  <dbl>
#> 1     1 2015    1.88
#> 2     2 2015    0.59
#> 3     2 2016    0.92
#> 4     3 2015    0.35
#> 5     3 2016    0.17
#> 6     4 2016    2.66

用complete可以显示显式和隐式缺失

stocks %>% 
  complete(year, qtr)
# # A tibble: 8 x 3
#   year   qtr return
#   <dbl> <dbl>  <dbl>
# 1  2015     1   1.88
# 2  2015     2   0.59
# 3  2015     3   0.35
# 4  2015     4  NA   
# 5  2016     1  NA   
# 6  2016     2   0.92
# 7  2016     3   0.17
# 8  2016     4   2.66

可以用nesting确定组合便于暴露NA

df <- tibble(
  group = c(1:2, 1),
  item_id = c(1:2, 2),
  item_name = c("a", "b", "b"),
  value1 = 1:3,
  value2 = 4:6

df %>% 
  complete(group)
# # A tibble: 3 x 5
#    group item_id item_name value1 value2
#    <dbl>   <dbl> <chr>      <int>  <int>
# 1     1       1 a              1      4
# 2     1       2 b              3      6
# 3     2       2 b              2      5

df %>% 
  complete(group,nesting(item_id, item_name))
# # A tibble: 4 x 5
#    group item_id item_name value1 value2
#    <dbl>   <dbl> <chr>      <int>  <int>
# 1     1       1 a              1      4
# 2     1       2 b              3      6
# 3     2       1 a             NA     NA
# 4     2       2 b              2      5

用fill完成缺失值填充

treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)

treatment %>% 
  fill(person)  # 默认是向下填充，可以用.direction指定方向
#> # A tibble: 4 x 3
#>   person           treatment response
#>   <chr>                <dbl>    <dbl>
#> 1 Derrick Whitmore         1        7
#> 2 Derrick Whitmore         2       10
#> 3 Derrick Whitmore         3        9
#> 4 Katherine Burke          1        4

treatment %>% 
  fill(person,.direction = 'down') # 向下填充

treatment %>% 
  fill(person,.direction = 'up') # 向上填充

complete中也有fill参数，是list类型可以指定不同列的参数填补

df %>% 
  complete(group,nesting(item_id, item_name), fill=list(value1=2, value2=3))
# # A tibble: 4 x 5
#    group item_id item_name value1 value2
#    <dbl>   <dbl> <chr>      <int>  <int>
# 1     1       1 a              1      4
# 2     1       2 b              3      6
# 3     2       1 a              2      3
# 4     2       2 b              2      5

Case Study

用的是tidyr::who数据集

It contains redundant columns, odd variable codes, and many missing values.

The best place to start is almost always to gather together the columns that are not variables.

who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1
#> # A tibble: 76,046 x 6
#>   country     iso2  iso3   year key          cases
#>   <chr>       <chr> <chr> <int> <chr>        <int>
#> 1 Afghanistan AF    AFG    1997 new_sp_m014      0
#> 2 Afghanistan AF    AFG    1997 new_sp_m1524    10
#> 3 Afghanistan AF    AFG    1997 new_sp_m2534     6
#> 4 Afghanistan AF    AFG    1997 new_sp_m3544     3
#> 5 Afghanistan AF    AFG    1997 new_sp_m4554     5
#> 6 Afghanistan AF    AFG    1997 new_sp_m5564     2
#> # … with 7.604e+04 more rows

含义解释如下：

who2 <- who1 %>% 
  mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))

# 这个地方书上写错了，应该拆分的是新形成的列否则who2没有意义
who3 <- who2 %>% 
  separate(names_from, c("new", "type", "sexage"), sep = "_")

who4 <- who3 %>% 
  select(-new, -iso2, -iso3)

who5 <- who4 %>% 
  separate(sexage, c("sex", "age"), sep = 1)

# 完整步骤如下
who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

Non-tidy data

There are good reasons to use other structures;
Tidy data is not the only way.

最后编辑于：2020.04.21 16:51:18

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 205,132评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 87,802评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,566评论 0赞 338
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,858评论 1赞 277
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,867评论 5赞 368
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,695评论 1赞 282
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,064评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,705评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 42,915评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,677评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,796评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,432评论 4赞 322
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,041评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,992评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,223评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,185评论 2赞 352
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,535评论 2赞 343

[R语言] tidyr包 数据整理《R for data science》 6

Tidy data

- Exercises

Pivoting

- pivot_longer()

- pivot_wider()

- 和gather spread的对比

- Exercises

Separating and uniting

- Separate

- unite

- extract

Missing values

Case Study

Non-tidy data

推荐阅读更多精彩内容

[R语言] tidyr包数据整理《R for data science》 6