首先是加载相关的包,mutate主要属于dplyr包里,这里我们统一使用tidyverse包。
tidyverse包中含有各种数据整理以及画图的包,如下加载tidyverse包:
> library(tidyverse)
-- Attaching packages ------------------------ tidyverse 1.3.0 --
√ ggplot2 3.3.3 √ purrr 0.3.4
√ tibble 3.0.5 √ dplyr 1.0.3
√ tidyr 1.1.2 √ stringr 1.4.0
√ readr 1.4.0 √ forcats 0.5.1
-- Conflicts --------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
参考
https://dplyr.tidyverse.org/reference/mutate_all.html
教材《R数据科学》
mutate函数
mutate() 的主要功能是为数据框增加列。mutate总是把新的列加在数据集的最后。新列一旦创建就可以立即使用。
一个简单的栗子:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
#在最后的地方增加新列
> mutate(iris, new_col = Petal.Length + Petal.Width) %>% head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_col
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.6
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.7
5 5.0 3.6 1.4 0.2 setosa 1.6
6 5.4 3.9 1.7 0.4 setosa 2.1
PS:%>%是管道符号,用于把前面的数据向后传递,避免函数嵌套,增加代码的可阅读性。
mutate还有三个衍生函数:
mutate_at(); mutate_if(); mutate_all()
在官网上的关于这三个后缀的解释如下:
_all: affects every variable
_at: affects variables selected with a character vector or vars()
_if : affects variables selected with a predicate function:
其中,all是针对所有列,at是针对特定的列,if的满足特定条件的列
参数如下:
mutate_all(.tbl, .funs, ...)
mutate_if(.tbl, .predicate, .funs, ...)
mutate_at(.tbl, .vars, .funs, ..., .cols = NULL)
Arguments
解释一下官网给出的例子
mutate_at
scale2 <- function(x, na.rm = FALSE)(x - mean(x, na.rm = na.rm)) / sd(x, na.rm)
starwars %>% mutate_at(c("height", "mass"), scale2)
# A tibble: 87 x 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke S~ NA NA blond fair blue 19 male mascu~
2 C-3PO NA NA NA gold yellow 112 none mascu~
3 R2-D2 NA NA NA white, bl~ red 33 none mascu~
4 Darth ~ NA NA none white yellow 41.9 male mascu~
5 Leia O~ NA NA brown light brown 19 fema~ femin~
6 Owen L~ NA NA brown, gr~ light blue 52 male mascu~
7 Beru W~ NA NA brown light blue 47 fema~ femin~
8 R5-D4 NA NA NA white, red red NA none mascu~
9 Biggs ~ NA NA black light brown 24 male mascu~
10 Obi-Wa~ NA NA auburn, w~ fair blue-gray 57 male mascu~
# ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
在height,mass列执行scale2
以下两个命令是等同的
starwars %>% mutate_at(c(height,mass), scale2)
starwars %>% mutate(across(c("height", "mass"), scale2))
PS: across() 即让函数穿过所选择的列,即同时对所选择的多列应用若干函数,这里和mutate联合使用,达到mutate_at的作用。
mutate_at的参数中使用vars(), funs()来完善整个函数
eg:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> mutate_at(iris, vars(-Species), funs(log(.))) %>% head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1.629241 1.252763 0.3364722 -1.6094379 setosa
2 1.589235 1.098612 0.3364722 -1.6094379 setosa
3 1.547563 1.163151 0.2623643 -1.6094379 setosa
4 1.526056 1.131402 0.4054651 -1.6094379 setosa
5 1.609438 1.280934 0.3364722 -1.6094379 setosa
6 1.686399 1.360977 0.5306283 -0.9162907 setosa
mutate_if
starwars %>% mutate_if(is.numeric, scale2, na.rm = TRUE)
# A tibble: 87 x 14
name height mass hair_color skin_color eye_color birth_year sex
<chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Luke Skyw~ -0.0678 -0.120 blond fair blue -0.443 male
2 C-3PO -0.212 -0.132 NA gold yellow 0.158 none
3 R2-D2 -2.25 -0.385 NA white, bl~ red -0.353 none
4 Darth Vad~ 0.795 0.228 none white yellow -0.295 male
5 Leia Orga~ -0.701 -0.285 brown light brown -0.443 fema~
6 Owen Lars 0.105 0.134 brown, grey light blue -0.230 male
7 Beru Whit~ -0.269 -0.132 brown light blue -0.262 fema~
8 R5-D4 -2.22 -0.385 NA white, red red NA none
9 Biggs Dar~ 0.249 -0.0786 black light brown -0.411 male
10 Obi-Wan K~ 0.220 -0.120 auburn, wh~ fair blue-gray -0.198 male
# ... with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
# species <chr>, films <list>, vehicles <list>, starships <list>
同理,这两行代码的性质也是一样的
starwars %>% mutate_if(is.numeric, scale2, na.rm = TRUE)
starwars %>% mutate(across(where(is.numeric), scale2, na.rm = TRUE))
使用where函数筛选出numeric的列,再使用across联合这些列,因此函数可以特定的穿过这些列,达到mutate_if的作用。
如果你想对数据框中的某列同时使用多个函数,使用list()。当同时使用多个function时,将会创建一个新的列,而不是像之前那样在原列上进行修饰。
eg:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> iris %>% mutate_if(is.numeric, list(scale2, log)) %>% head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_fn1
1 5.1 3.5 1.4 0.2 setosa -0.8976739
2 4.9 3.0 1.4 0.2 setosa -1.1392005
3 4.7 3.2 1.3 0.2 setosa -1.3807271
4 4.6 3.1 1.5 0.2 setosa -1.5014904
5 5.0 3.6 1.4 0.2 setosa -1.0184372
6 5.4 3.9 1.7 0.4 setosa -0.5353840
Sepal.Width_fn1 Petal.Length_fn1 Petal.Width_fn1 Sepal.Length_fn2
1 1.01560199 -1.335752 -1.311052 1.629241
2 -0.13153881 -1.335752 -1.311052 1.589235
3 0.32731751 -1.392399 -1.311052 1.547563
4 0.09788935 -1.279104 -1.311052 1.526056
5 1.24503015 -1.335752 -1.311052 1.609438
6 1.93331463 -1.165809 -1.048667 1.686399
Sepal.Width_fn2 Petal.Length_fn2 Petal.Width_fn2
1 1.252763 0.3364722 -1.6094379
2 1.098612 0.3364722 -1.6094379
3 1.163151 0.2623643 -1.6094379
4 1.131402 0.4054651 -1.6094379
5 1.280934 0.3364722 -1.6094379
6 1.360977 0.5306283 -0.9162907
还可以进一步对function进行命名,注意下面的dataframe的列名与上面的不一样,冠以函数名。
> iris %>% mutate_if(is.numeric, list(scale = scale2, log = log)) %>% head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_scale
1 5.1 3.5 1.4 0.2 setosa -0.8976739
2 4.9 3.0 1.4 0.2 setosa -1.1392005
3 4.7 3.2 1.3 0.2 setosa -1.3807271
4 4.6 3.1 1.5 0.2 setosa -1.5014904
5 5.0 3.6 1.4 0.2 setosa -1.0184372
6 5.4 3.9 1.7 0.4 setosa -0.5353840
Sepal.Width_scale Petal.Length_scale Petal.Width_scale Sepal.Length_log
1 1.01560199 -1.335752 -1.311052 1.629241
2 -0.13153881 -1.335752 -1.311052 1.589235
3 0.32731751 -1.392399 -1.311052 1.547563
4 0.09788935 -1.279104 -1.311052 1.526056
5 1.24503015 -1.335752 -1.311052 1.609438
6 1.93331463 -1.165809 -1.048667 1.686399
Sepal.Width_log Petal.Length_log Petal.Width_log
1 1.252763 0.3364722 -1.6094379
2 1.098612 0.3364722 -1.6094379
3 1.163151 0.2623643 -1.6094379
4 1.131402 0.4054651 -1.6094379
5 1.280934 0.3364722 -1.6094379
6 1.360977 0.5306283 -0.9162907
mutate_all
mutate_all网页上没有过多的例子,但是根据其解释,应该是对所有的变量进行操作。
> a = matrix(rep(1:5,each =10),10) %>% as.data.frame()
> a
V1 V2 V3 V4 V5
1 1 2 3 4 5
2 1 2 3 4 5
3 1 2 3 4 5
4 1 2 3 4 5
5 1 2 3 4 5
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
> mutate_all(a,funs(sum(.)))
V1 V2 V3 V4 V5
1 10 20 30 40 50
2 10 20 30 40 50
3 10 20 30 40 50
4 10 20 30 40 50
5 10 20 30 40 50
6 10 20 30 40 50
7 10 20 30 40 50
8 10 20 30 40 50
9 10 20 30 40 50
10 10 20 30 40 50
补充一点:
调用funs时,可以按照例子那样自己写一个function,多个function使用list(),也可以使用~fun(.)调用。
starwars %>% mutate_at(c("height", "mass"), ~scale2(., na.rm = TRUE))
总结
与mutate增加新变量不同,mutate的衍生函数主要是按列对数据赋予function,如果想增加按行,可以增加group_by以及rowwise函数。