统计分布和描述性统计分析
概率函数通常用来生成特征已知的模拟数据,以及在用户编写的统计函数中计算概率值。
- d = 密度函数(density):是一个描述随机变量在某种分布下的概率密度
- p = 分布函数(distribution function):随机变量的取值落在某个区域之内的概率(密度函数在该区域上的积分)
- q = 分位数函数(quantile function):输出某种分布的指定分位点值
- r = 生成随机数(随机偏差):生成属于某种分布的指定数量的随机数
# 从标准正态分布中进行采样
> x <- pretty(c(-3,3), 30)
> dnorm(x)
[1] 0.004431848 0.007915452 0.013582969 0.022394530 0.035474593 0.053990967 0.078950158
[8] 0.110920835 0.149727466 0.194186055 0.241970725 0.289691553 0.333224603 0.368270140
[15] 0.391042694 0.398942280 0.391042694 0.368270140 0.333224603 0.289691553 0.241970725
[22] 0.194186055 0.149727466 0.110920835 0.078950158 0.053990967 0.035474593 0.022394530
[29] 0.013582969 0.007915452 0.004431848
# 位于z=1.96 左侧的标准正态曲线下方面积
> pnorm(1.96)
[1] 0.9750021
# 均值为500,标准差为100 的正态分布的0.9 分位点值
> qnorm(.9, mean=500, sd=100)
[1] 628.1552
# 生成50 个均值为50,标准差为10 的正态随机数
> rnorm(50, mean=50, sd=10)
[1] 43.19326 67.19558 53.12459 49.13804 49.49784 50.18141 60.02579 38.84806 47.05036 33.10094
[11] 53.04860 61.74979 38.86004 53.86042 73.50134 58.71853 58.41579 47.24429 37.81980 55.60278
[21] 44.88429 46.91278 52.49478 41.42598 38.77225 56.51503 56.60067 67.63822 49.80543 53.74116
[31] 60.52062 61.17447 29.86609 58.37057 66.00415 33.32716 72.76983 47.78504 68.31046 62.89388
[41] 45.45358 67.27605 65.82000 48.16882 62.26423 51.98385 45.67497 48.98096 41.50788 70.69757
R语言主要是为了进行统计分析开发的语言,在R中内置的许多函数可以方便的进行统计分析。
描述性统计分析
在描述性统计量的计算方面,R中的选择多得让人尴尬。让我们从基础安装中包含的函数开始。可以使用summary()
函数来获取描述性统计量,包括最小值,25分位数,中位数,均值,75分位数以及最大值。或者通过Hmisc
包中的describe
函数进行描述性统计。
> myvars <- c("mpg", "hp", "wt")
> summary(mtcars[myvars])
mpg hp wt
Min. :10.40 Min. : 52.0 Min. :1.513
1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:2.581
Median :19.20 Median :123.0 Median :3.325
Mean :20.09 Mean :146.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:3.610
Max. :33.90 Max. :335.0 Max. :5.424
或者使用aggregate
分组获取描述性统计量:
# aggregate需要指定三个参数,分别为:数据,分组(需要传入list),功能
> aggregate(mtcars[myvars], by=list(am=mtcars$am), mean)
am mpg hp wt
1 0 17.14737 160.2632 3.768895
2 1 24.39231 126.8462 2.411000
如果需要分组执行比较复杂的功能,可以使用by()
函数。
# 需要三个参数:数据,分组,功能
> by(mtcars[myvars], mtcars$am, summary)
mtcars$am: 0
mpg hp wt
Min. :10.40 Min. : 62.0 Min. :2.465
1st Qu.:14.95 1st Qu.:116.5 1st Qu.:3.438
Median :17.30 Median :175.0 Median :3.520
Mean :17.15 Mean :160.3 Mean :3.769
3rd Qu.:19.20 3rd Qu.:192.5 3rd Qu.:3.842
Max. :24.40 Max. :245.0 Max. :5.424
------------------------------------------------------------------------
mtcars$am: 1
mpg hp wt
Min. :15.00 Min. : 52.0 Min. :1.513
1st Qu.:21.00 1st Qu.: 66.0 1st Qu.:1.935
Median :22.80 Median :109.0 Median :2.320
Mean :24.39 Mean :126.8 Mean :2.411
3rd Qu.:30.40 3rd Qu.:113.0 3rd Qu.:2.780
Max. :33.90 Max. :335.0 Max. :3.570
频数表和列联表
在对数据进行统计分析时,类别变量的频数表和列联表十分重要,在R中通过内置函数我们可以轻易地获得数据的频数表和列联表。
table()
函数生成简单的频数统计表
> library(vcd)
载入需要的程辑包:grid
> with(Arthritis, table(Improved))
Improved
None Some Marked
42 14 28
prop.table()
将这些频数转化为比例值
# 乘100获得百分数
> prop.table(with(Arthritis, table(Improved))) * 100
Improved
None Some Marked
50.00000 16.66667 33.33333
xtabs()
根据公式生成列联表
> xtabs(~ Treatment+Improved, data=Arthritis)
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
# xtabs可以自然地推广到高于二维的情况,需要注意公式的写法(axis1, axis2, axis3...)
> xtabs(~ Treatment+Sex+Improved, data=Arthritis)
, , Improved = None
Sex
Treatment Female Male
Placebo 19 10
Treated 6 7
, , Improved = Some
Sex
Treatment Female Male
Placebo 7 0
Treated 5 2
, , Improved = Marked
Sex
Treatment Female Male
Placebo 6 1
Treated 16 5
margin.table()
生成边际频数表
# 2代表列
> margin.table(xtabs(~ Treatment+Improved, data=Arthritis), 2)
Improved
None Some Marked
42 14 28
# 1代表行
> margin.table(xtabs(~ Treatment+Improved, data=Arthritis), 1)
Treatment
Placebo Treated
43 41
addmargins()
直接在列联表中添加边际和
> addmargins(xtabs(~ Treatment+Improved, data=Arthritis))
Improved
Treatment None Some Marked Sum
Placebo 29 7 7 43
Treated 13 7 21 41
Sum 42 14 28 84
ftable()
函数可以以一种紧凑而吸引人的方式输出多维列联表。
> mytable <- xtabs(~ Treatment+Sex+Improved, data=Arthritis)
> ftable(mytable)
Improved None Some Marked
Treatment Sex
Placebo Female 19 7 6
Male 10 0 1
Treated Female 6 5 16
Male 7 2 5
# 计算每个Treatment×Sex组合中改善情况为None、Some和Marked患者的比例
> ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100
Improved None Some Marked Sum
Treatment Sex
Placebo Female 59.375000 21.875000 18.750000 100.000000
Male 90.909091 0.000000 9.090909 100.000000
Treated Female 22.222222 18.518519 59.259259 100.000000
Male 50.000000 14.285714 35.714286 100.000000
摘抄自R语言实战(第二版)