1 tidyverse系统
https://www.math.pku.edu.cn/teachers/lidf/docs/Rbook/html/_Rbook/summary-manip.html#summm-tidyv
(完整版)
载入tidyverse
包, 则magrittr包,readr包,dplyr包和tidyr包都会被自动载入:
library(tidyverse)
下面的例子中用如下的一个班的学生数据作为例子, 保存在如下class.csv
文件中:
name,sex,age,height,weight
Alice,F,13,56.5,84
Becka,F,13,65.3,98
Gail,F,14,64.3,90
Karen,F,12,56.3,77
Kathy,F,12,59.8,84.5
Mary,F,15,66.5,112
Sandy,F,11,51.3,50.5
Sharon,F,15,62.5,112.5
Tammy,F,14,62.8,102.5
Alfred,M,14,69,112.5
Duke,M,14,63.5,102.5
Guido,M,15,67,133
James,M,12,57.3,83
Jeffrey,M,13,62.5,84
John,M,12,59,99.5
Philip,M,16,72,150
Robert,M,12,64.8,128
Thomas,M,11,57.5,85
William,M,15,66.5,112
读入为tibble:
d.class <- read_csv(
"class.csv",
col_types=cols(
.default = col_double(),
name=col_character(),
sex=col_factor(levels=c("M", "F"))
))
这个数据框有19个观测, 有如下5个变量:
- name
- sex
- age
- height
- weight
R的NHANES扩展包提供了一个规模更大的示例数据框NHANES, 可以看作是美国扣除住院病人以外的人群的一个随机样本, 有10000个观测,有76个变量, 主题是个人的健康与营养方面的信息。 仅作为教学使用而不足以作为严谨的科研用数据。 原始数据的情况详见http://www.cdc.gov/nchs/nhanes.htm。 载入NHANES数据框:
library(NHANES)
data(NHANES)
print(dim(NHANES))
## [1] 10000 76
print(names(NHANES))
## [1] "ID" "SurveyYr" "Gender" "Age"
## [5] "AgeDecade" "AgeMonths" "Race1" "Race3"
## [9] "Education" "MaritalStatus" "HHIncome" "HHIncomeMid"
## [13] "Poverty" "HomeRooms" "HomeOwn" "Work"
## [17] "Weight" "Length" "HeadCirc" "Height"
## [21] "BMI" "BMICatUnder20yrs" "BMI_WHO" "Pulse"
## [25] "BPSysAve" "BPDiaAve" "BPSys1" "BPDia1"
## [29] "BPSys2" "BPDia2" "BPSys3" "BPDia3"
## [33] "Testosterone" "DirectChol" "TotChol" "UrineVol1"
## [37] "UrineFlow1" "UrineVol2" "UrineFlow2" "Diabetes"
## [41] "DiabetesAge" "HealthGen" "DaysPhysHlthBad" "DaysMentHlthBad"
## [45] "LittleInterest" "Depressed" "nPregnancies" "nBabies"
## [49] "Age1stBaby" "SleepHrsNight" "SleepTrouble" "PhysActive"
## [53] "PhysActiveDays" "TVHrsDay" "CompHrsDay" "TVHrsDayChild"
## [57] "CompHrsDayChild" "Alcohol12PlusYr" "AlcoholDay" "AlcoholYear"
## [61] "SmokeNow" "Smoke100" "Smoke100n" "SmokeAge"
## [65] "Marijuana" "AgeFirstMarij" "RegularMarij" "AgeRegMarij"
## [69] "HardDrugs" "SexEver" "SexAge" "SexNumPartnLife"
## [73] "SexNumPartYear" "SameSex" "SexOrientation" "PregnantNow"
变量ID是受试者编号, SurveyYr是调查年份, 同一受试者可能在多个调查年份中有数据。 变量中包括性别、年龄、种族、收入等人口学数据, 包括体重、身高、脉搏、血压等基本体检数据, 以及是否糖尿病、是否抑郁、是否怀孕、已生产子女数等更详细的健康数据, 运动习惯、饮酒、性生活等行为方面的数据。 这个教学用数据集最初的使用者是Cashmere高中的Michelle Dalrymple 和新西兰奥克兰大学的Chris Wild。
2 用filter()选择行子集
行子集可以用行下标选取, 如d.class[8:12,]。 函数head()取出数据框的前面若干行, tail()取出数据框的最后若干行。
从d.class
中选出年龄在13岁和13岁以下的女生:
d.class %>%
filter(sex=="F", age<=13) %>%
knitr::kable()