加载数据

要探索dplyr的基本数据操作动词，我们将使用nycflights13::flights。此数据集包含2013年从纽约市出发的所有336776个航班。数据来自美国运输统计局，并在?nycflights13

#加载数据
library(nycflights13)
library(dplyr)
dim(flights)
dat <- flights
dat
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

单个动词

filter() 筛选行。

arrange() 重新排序数据。

select()选择列

rename()重命名列

mutate()新增一列

summarise() 将多个值压缩为一个值。

sample_n()并sample_frac()随机抽样。

用以下内容过滤行 filter()

filter()允许您选择数据框中的行的子集。像所有单个动词一样，第一个参数是tibble（或数据框）。第二个参数和后续参数引用该数据框中的变量，选择表达式为的行TRUE

#例如，我们可以选择1月1日的所有航班：
filter(dat, month == 1, day == 1)
a = dat%>%
filter(month == 1, day == 1)
## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 832 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# 这在根本上等效于以下基本R代码
dat[dat$month == 1& flights$day == 1,]
## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 832 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

用arrange()排列行

arrange()的工作原理与filter()类似，只是它不是过滤或选择行，而是对它们进行重新排序。它需要一个数据帧和一组列名(或更复杂的表达式)来排序。如果你提供一个以上的列名，每一个额外的列将被用来打破前一列的值的联系:

arrange(dat, year, month, day)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#使用desc()按降序排列一个列:
arrange(dat, desc(arr_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     7    22     2257            759       898      121           1026
##  9  2013    12     5      756           1700       896     1058           2020
## 10  2013     5     3     1133           2055       878     1250           2215
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

使用Select()选择列

通常，您处理的是具有许多列的大型数据集，但实际上只有少数列是您感兴趣的。select()允许您快速放大一个有用的子集，使用的操作通常只适用于数值变量位置:

# Select columns by name
select(dat, year, month, day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows
# Select all columns between year and day (inclusive)
select(dat, year:day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

在select()中可以使用许多辅助函数，比如starts_with()、ends_with()、matches()和contains()。这使您能够快速匹配满足某些条件的更大的变量块。看到了吗?select选择查看更多详细信息。

select(dat, tail_num = tailnum)
## # A tibble: 336,776 x 1
##    tail_num
##    <chr>   
##  1 N14228  
##  2 N24211  
##  3 N619AA  
##  4 N804JB  
##  5 N668DN  
##  6 N39463  
##  7 N516JB  
##  8 N829AS  
##  9 N593JB  
## 10 N3ALAA  
## # ... with 336,766 more rows
#但是因为select()删除了没有明确提到的所有变量，所以它不是很有用。相反,使用重命名():
rename(dat, tail_num = tailnum)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

使用mutate()添加新列

除了选择现有列集之外，添加作为现有列的函数的新列通常也很有用。这是mutate()的工作:

mutate(dat,
  gain = arr_delay - dep_delay,#新建一列gain
  speed = distance / air_time * 60#新建一列speed
)
## # A tibble: 336,776 x 21
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   gain <dbl>, speed <dbl>
#如果只想保留新变量，可以使用transmute():
transmute(dat,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)
## # A tibble: 336,776 x 2
##     gain gain_per_hour
##    <dbl>         <dbl>
##  1     9          2.38
##  2    16          4.23
##  3    31         11.6 
##  4   -17         -5.57
##  5   -19         -9.83
##  6    16          6.4 
##  7    24          9.11
##  8   -11        -12.5 
##  9    -5         -2.14
## 10    10          4.35
## # ... with 336,766 more rows

使用Summarise()对数值进行总结

最后一个动词是summarise()。它将数据帧折叠为一行。

summarise(dat,
  delay = mean(dep_delay, na.rm = TRUE)
## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6
)

使用sample_n()和随机采样行sample_frac()

您可以使用sample_n()和sample_frac()随机抽取行样本：sample_n()用于固定数量和sample_frac()固定分数。

sample_n(dat, 10)#随机抽取10行
## # A tibble: 10 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     5    20     2027           2031        -4     2325           2348
##  2  2013     4    25     1730           1700        30     2100           2030
##  3  2013     8     2     1347           1345         2     1625           1635
##  4  2013    10     7      608            615        -7      758            811
##  5  2013     4    22     2151           2005       106     2350           2200
##  6  2013     7    27      838            840        -2     1118           1135
##  7  2013     1    23     1601           1605        -4     1919           1925
##  8  2013     1     8      808            812        -4      949            950
##  9  2013     9     4      614            610         4      839            855
## 10  2013     5    21      831            834        -3     1040           1018
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
sample_frac(dat, 0.01)#随机抽取1%行
## # A tibble: 3,368 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     6    10     2146           2040        66     2321           2200
##  2  2013     4    10      657            700        -3      947            950
##  3  2013     3     9     1402           1335        27     1934           1936
##  4  2013     4    25      954           1000        -6     1158           1115
##  5  2013     4    20     1917           1920        -3     2025           2050
##  6  2013     3     2     1605           1520        45     1706           1627
##  7  2013     9    29     1159           1200        -1     1322           1334
##  8  2013    10    26     1300           1300         0     1444           1435
##  9  2013     5     5      556            600        -4      848            911
## 10  2013    12    14      904            905        -1     1131           1111
## # ... with 3,358 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

分组操作

dplyr动词本身很有用，但是当您将它们应用于数据集中的观察组时，它们会变得更加强大。在dplyr中，您可以使用group_by()函数执行此操作。它将数据集分为指定的行组。然后，将以上动词应用于结果对象时，它们将自动“按组”应用。
分组对动词的影响如下：

grouped select()与ungrouped相同select()，不同之处在于始终保留分组变量。

分组arrange()与未分组相同；除非您设置.by_group = TRUE，在这种情况下，它首先按分组变量排序

mutate()并且filter()与窗口函数（如rank()或min(x) == x）结合使用时最有用。它们在中进行了详细描述vignette("window-functions")。

sample_n()并sample_frac()采样每组中指定的行数/行数。

summarise() 计算每个组的摘要。

by_tailnum <- group_by(dat, tailnum);by_tailnum
## # A tibble: 336,776 x 19
## # Groups:   tailnum [4,044]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
delay <- summarise(by_tailnum,
  count = n(),#计算航班的数量
  dist = mean(distance, na.rm = TRUE),#计算平均距离
  delay = mean(arr_delay, na.rm = TRUE))#计算平均延迟时间
  
delay <- filter(delay, count > 20, dist < 2000)
library(ggplot2)#画图看下
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

01.png

dplyr 数据筛选：R包学习（二）

dplyr 数据筛选：R包学习（二）

加载数据

单个动词

用以下内容过滤行 filter()

用arrange()排列行

使用Select()选择列

使用mutate()添加新列

使用Summarise()对数值进行总结

使用sample_n()和随机采样行sample_frac()

分组操作