R语言学习笔记总结 05.05
R语言初步-用dplyr进行数据转换
install.packages("tidyverse")
install.packages("nycflights13")#仍然记得要先安装
library(nycflights13)#航班信息文件
library(tidyverse)
?flights#查看数据信息的说明书
flights#查看航班信息
6、关于group_by()函数的综合应用
6.1、按多个变量进行分组
daily <- group_by(flights,year,month,day)
(per_day <- summarise(daily,flights=n()))
(per_month <- summarise(per_day,flights=sum(flights)))
(per_year <- summarise(per_month,flights=sum(flights)))
#运行:
# A tibble: 365 x 4
# Groups: year, month [12]
year month day flights
<int> <int> <int> <int>
1 2013 1 1 842
2 2013 1 2 943
3 2013 1 3 914
4 2013 1 4 915
5 2013 1 5 720
6 2013 1 6 832
7 2013 1 7 933
8 2013 1 8 899
9 2013 1 9 902
10 2013 1 10 932
# ... with 355 more rows
> (per_month <- summarise(per_day,flights=sum(flights)))
`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
# A tibble: 12 x 3
# Groups: year [1]
year month flights
<int> <int> <int>
1 2013 1 27004
2 2013 2 24951
3 2013 3 28834
4 2013 4 28330
5 2013 5 28796
6 2013 6 28243
7 2013 7 29425
8 2013 8 29327
9 2013 9 27574
10 2013 10 28889
11 2013 11 27268
12 2013 12 28135
> (per_year <- summarise(per_month,flights=sum(flights)))
# A tibble: 1 x 2
year flights
<int> <int>
1 2013 336776
注意,对数据进行分组之后,对组求和依旧是求和,但是中位数等特殊的取值可能会不再适用,需要多加注意。
6.2、 取消分组ungroup()函数
ungroup()可以取消之前的分组,回到未分组的状态
daily%>%
ungroup()%>%
summarise(flights=n())
#运行:
# A tibble: 1 x 1
flights
<int>
1 336776
上述代码与6.1中的代码仔细比对,就会发现ungroup()取消了分组。
6.3、分组新的变量
group_by()函数的使用拓展
6.3.1group_by()和filter()结合使用
例如:找出航班中延误时间最长的前十名:
flights%>%
group_by(year,month,day) %>%
filter(rank(desc(arr_delay))<10)
#运行:
# A tibble: 3,306 x 19
# Groups: year, month, day [365]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 848 1835 853 1001 1950 851
2 2013 1 1 1815 1325 290 2120 1542 338
3 2013 1 1 1842 1422 260 1958 1535 263
4 2013 1 1 1942 1705 157 2124 1830 174
5 2013 1 1 2006 1630 216 2230 1848 222
6 2013 1 1 2115 1700 255 2330 1920 250
7 2013 1 1 2205 1720 285 46 2040 246
8 2013 1 1 2312 2000 192 21 2110 191
9 2013 1 1 2343 1724 379 314 1938 456
10 2013 1 2 1244 900 224 1431 1104 207
# ... with 3,296 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
再例如:找出大于某个阈值的所有分组
popular_dests <- flights %>%
group_by(dest) %>%
filter(n()>365) #降落次数大于365次的地点~
popular_dests#查看表格
#运行:
# A tibble: 332,577 x 19
# Groups: dest [77]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 517 515 2 830 819 11
2 2013 1 1 533 529 4 850 830 20
3 2013 1 1 542 540 2 923 850 33
4 2013 1 1 544 545 -1 1004 1022 -18
5 2013 1 1 554 600 -6 812 837 -25
6 2013 1 1 554 558 -4 740 728 12
7 2013 1 1 555 600 -5 913 854 19
8 2013 1 1 557 600 -3 709 723 -14
9 2013 1 1 557 600 -3 838 846 -8
10 2013 1 1 558 600 -2 753 745 8
# ... with 332,567 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
6.3.2group_by()和mutate()结合使用
popular_dests %>%
filter(arr_delay>0) %>%
mutate(prop_delay=arr_delay/sum(arr_delay)) %>% #生成新的列
select(year:day,dest,arr_delay,prop_delay) #展现选出的这几个列
#运行:
# A tibble: 131,106 x 6
# Groups: dest [77]
year month day dest arr_delay prop_delay
<int> <int> <int> <chr> <dbl> <dbl>
1 2013 1 1 IAH 11 0.000111
2 2013 1 1 IAH 20 0.000201
3 2013 1 1 MIA 33 0.000235
4 2013 1 1 ORD 12 0.0000424
5 2013 1 1 FLL 19 0.0000938
6 2013 1 1 ORD 8 0.0000283
7 2013 1 1 LAX 7 0.0000344
8 2013 1 1 DFW 31 0.000282
9 2013 1 1 ATL 12 0.0000400
10 2013 1 1 DTW 16 0.000116
# ... with 131,096 more rows
注意:year:day冒号的用法前面已经说过,代表year到day之间,包括year和day本身的所有列。不要打成双冒号::有另外的作用。
总结:
dplyr函数很多,前面几节介绍了最基础的5个函数:
- filter() 筛选列
- arrange() 排序列但仍可展示其余的列
- select() 选择列,不展示其余的列
- mutate() 添加新的变量
- summarise() 对变量进行分组摘要,经常和group_by()组合使用
五个函数结合group_by()函数的用法更加丰富。