偶遇
五月初一次偶然的机会,在外导得知我在自学R语言后,推荐了一些课程给我。其中,一个学校机构组织的Data workshop通知吸引了我,在得知这是一个公益限量面向博士开课的课程后,我还是抱着试试看的态度报了名。报名截止后,我还发邮件问了主办方,被告知接下来的一周会给回复。五月底,自己收到了两个课程的正式通知,也得知后面还有很多申请的同学在排队,庆幸自己得到了这么来之不易的学习机会,必将好好珍惜。
Introduction Data Carpentry
Data Carpentry在国外很盛行,类似一些Volunteers定期为学生或研究人员组织的培训活动。对于这次活动,是KCL几位老师为wet-lab的博士们开设的数据分析课程。首次课程针对R语言零基础又对数据分析有着迫切需求的同学,课程为40人的小班教学,有两名教师轮番讲解并现场演练。课程提供免费早中餐及课件茶点,参与的学生多为生命科学领域的学生,地点选在了知名的Guys校区医院的Seminar Room。本次课程分为两天,第一天从常用的Excel入手,介绍科研中采用Excel输入及处理数据中的问题及弊端,引入R并简单入门。第二天从如何dataframe的一些常用操作入手,然后介绍了两个常用package:tidyverse和ggplot。
课程感受
本次课程紧凑内容丰富,的确是初学者入门R的精彩课程。由于全英语授课,加之主要基于mac进行讲解,虽然有些自学基础,但到第二天的学习仍然感觉很吃力。基于来英已有仨月的生活经历,可以跟上老师的讲解,但由于坐在后排,代码看不十分清楚,加之敲代码及快捷键的使用并不熟悉,所以后期还是有难度,需要课下及时巩固学习。全班40名同学,遇到2位疑似华人学生,但由于他们英语讲的都很流利,也没好意思汉语交流,并不确定华人身份。课堂上外国学生很踊跃,反应也很快,旁边几位男生边听课边做着自己的数据分析和PPT,佩服他们超高的效率。旁边一位小姐姐也完全跟得上老师的脚步,并给我帮助很多,课间之余也跟她聊起了科研生活,更觉得自己该多下些功夫。
课堂笔记Day 1
1. Data organization in spreadsheets
1.1 Don'ts in spreadsheets
DON'T:
- modify your raw data. Always make a copy before making any changes.
- combine multiple values in one cell (units, numbers, etc)
- never mess with your raw data: always work on a duplicate copy.
- export as a text based file (csv or txt) so that R can read it.
- make calculations. When you try to export that you will not export your formulae
- Do not color code things. Computer does not care.
1.2 Names for columns:
- do not use spaces
- use UpperCaseLikeThis
- use Underscore_to_separate_words
1.3 Dealing with missing values:
- Do not use 0, because 0 is data sometimes
- NA is the best
- blank spaces also work
- Do not combine columns
1.4 Dealing with Dates
- Use buit-in functions
- create a new column to split the year from month and day
- use the formula =YEAR (#click on the cell you want to split)
- double click on the right bottom of the cell where the little cross appears. It will apply the formula all the way down
- reconstructing the date: =DATE(cell1; cell2; cell3) and this reconstructs the date based on the year, month and day
- string format: a succession of numbers
2. Introduction to R
2.1 R and RStudio
- R allows you to handle large datasets.It has lots of 'packages'.
- R Studio is like an in-built computer to work R in a more user-friendly environment.
- Every 'window' gives you information. The upper left corner is where you can write your script: you write your instructions like a lab protocol.The Console is the window in which you can execute your commands.The upper right is the Environment. Bottom right includes files, plots, packages and help.
- Pipeline: one script after the other that takes you through all the actions that you need to do to deal with your data.
2.2 Advantages of R
- it is free
- it has thousands of functions built in- so that you don't have to do this!
- it is much user friendly than other programs
- there is a large community to ask questions (and you will get an answer!)
2.3 Tips to start using R
- Be very organised. Make sub-folders that organise your project (data, outputs, figures, scripts).
- A path shows you the way: this is a series of folders and subfolders to show you where your documents are.
- Start a New Project: always whenever you are starting something new.
- Start a new R Script: this is where you will type all of your commands- your script.
2.4 You can change the appearance in R
- Windows: Options --> global options
- Mac: In the tab 'R Studio' check 'Preferences'
2.5 Object
- <- is the assign operator. This is how we assign a value to an object.
- Shortcut: Windows/Linux: "Alt" + "-" | Mac: "Option" + "-".
- Object names: with underscores, meaningful, and do not start with a number.
2.6 Useful Commands
- getwd() : it shows you where you are in your computer, it tells you the working directory
- setwd () Set working directory
- ls() : it lists the things that are in your 'workspace'.
- rm() : removes one object. THERE IS NO WAY OF RECOVERING IT!!
- sqrt () : square root
- round () : it rounds the number to whatever number of decimals that you want/need.
- length () : it tells you the number of values in a vector.
- class () : this tells you the type of object that you are dealing with
- str () : this function tells you the structure of the object
- ? #name of function It gives you the information about that function
- print () it prints the value in the screen
- (function) it prints the value in the screen
- (#)This allows you to annotate your script
- mean () It calculates the mean of a number
- args (function) Args tells you the arguments of a function
- c () Combines in one vector
- [ ] Subsets elements from vectors. The order of the elements starts in 1.
- ! means 'opposite'
2.7 Type of Data
- character – text
- Numeric (numbers)
- integer - numbers without decimals
- double - numbers with decimals
- logical - TRUE or FALSE
- In R there is a hierarchy about these types of data: logical → numeric → character ← logical
2.8 Vectors
- This is another type of object in R.
- This is just a series of values that you put together in an object using c .
2.9 Functions
- A function is a command that executes some action in your input.
- A function has 'arguments' in it: the things you input on your function so that it gets executed with your particular parameters.
2.10 Subsetting vectors
- extract values for vector use [ ].
Conditional subsetting
- AND: &
- OR: |
- Equal to: ==
- More or equal: >=
- Less or equal: <=
- More: >
- Less: <
- %in%
2.11 Missing Data
- Missing data as NA in vector.
- na.rm = TRUE (ignore the missing data)
- ( )[!is.na()]: extract those elements are not missing.
- na.omit(): return with incomplete removed.
- ()[complete.cases()]: return with complete.