第一次用scanpy分析单细胞数据

In May 2017, this started out as a demonstration that Scanpy would allow to reproduce most of Seurat’s guided clustering tutorial (Satija et al., 2015).

We gratefully acknowledge Seurat’s authors for the tutorial! In the meanwhile, we have added and removed a few pieces.

The data consist of 3k PBMCs from a Healthy Donor and are freely available from 10x Genomics (here from this webpage). On a unix system, you can uncomment and run the following to download and unpack the data. The last line creates a directory for writing processed data.

[1]: mkdir data
wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
mkdir write
cd回到家目录</pre>

Note

Download the notebook by clicking on the Edit on GitHub button. On GitHub, you can download using the Raw button via right-click and Save Link As. Alternatively, download the whole scanpy-tutorial repository.

Note

In Jupyter notebooks and lab, you can see the documentation for a python function by hitting SHIFT + TAB. Hit it twice to expand the view.

[2]:
import numpy as np
import pandas as pd
import scanpy as sc
[3]:
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')</pre>

scanpy==1.6.0 anndata==0.7.5.dev7+gefffdfb umap==0.4.2 numpy==1.18.1 scipy==1.4.1 pandas==1.0.3 scikit-learn==0.22.1 statsmodels==0.11.0 python-igraph==0.7.1 leidenalg==0.7.0
[4]:
results_file = 'write/pbmc3k.h5ad' # the file that will store the analysis results</pre>
Read in the count matrix into an ``AnnData[<https://anndata.readthedocs.io/en/latest/anndata.AnnData.html](%3Chttps://anndata.readthedocs.io/en/latest/anndata.AnnData.html)>__ object, which holds many slots for annotations and different representations of the data. It also comes with its own HDF5 file format:.h5ad.

[5]:
adata = sc.read_10x_mtx(
'data/filtered_gene_bc_matrices/hg19/', # the directory with the .mtx file
var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
cache=True) # write a cache file for faster subsequent reading</pre>

... reading from cache file cache/data-filtered_gene_bc_matrices-hg19-matrix.h5ad
[6]:
adata.var_names_make_unique() # this is unnecessary if using var_names='gene_ids' in sc.read_10x_mtx
[7]:
adata
[7]:
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'</pre>

Preprocessing

Show those genes that yield the highest fraction of counts in each single cell, across all cells.

[8]:
sc.pl.highest_expr_genes(adata, n_top=20, )

normalizing counts per cell
finished (0:00:00)</pre>

Louvain Group	Markers	Cell Type
0	IL7R	CD4 T cells
1	CD14, LYZ	CD14+ Monocytes
2	MS4A1	B cells
3	CD8A	CD8 T cells
4	GNLY, NKG7	NK cells
5	FCGR3A, MS4A7	FCGR3A+ Monocytes
6	FCER1A, CST3	Dendritic Cells
7	PPBP	Megakaryocytes

	0	1	2	3	4	5	6	7
0	RPS12	LYZ	CD74	CCL5	NKG7	LST1	HLA-DPA1	PF4
1	LDHB	S100A9	CD79A	NKG7	GZMB	FCER1G	HLA-DPB1	SDPR
2	RPS25	S100A8	HLA-DRA	B2M	GNLY	AIF1	HLA-DRA	GNG11
3	RPS27	TYROBP	CD79B	CST7	CTSW	COTL1	HLA-DRB1	PPBP
4	RPS6	FTL	HLA-DPB1	GZMA	PRF1	FCGR3A	CD74	NRGN

	0_n	0_p	1_n	1_p	2_n	2_p	3_n	3_p	4_n	4_p	5_n	5_p	6_n	6_p	7_n	7_p
0	RPS12	3.642456e-222	LYZ	1.007060e-252	CD74	3.043536e-182	CCL5	3.896273e-119	NKG7	4.689070e-95	LST1	5.650219e-85	HLA-DPA1	5.422417e-21	PF4	4.722886e-10
1	LDHB	3.242464e-216	S100A9	3.664292e-248	CD79A	6.860832e-170	NKG7	1.170992e-97	GZMB	2.381363e-89	FCER1G	1.697236e-81	HLA-DPB1	7.591860e-21	SDPR	4.733899e-10
2	RPS25	1.394016e-196	S100A8	9.457377e-239	HLA-DRA	8.398068e-166	B2M	3.032705e-81	GNLY	9.322195e-87	AIF1	1.377723e-79	HLA-DRA	1.306768e-19	GNG11	4.733899e-10
3	RPS27	9.718451e-188	TYROBP	2.209430e-224	CD79B	1.171444e-153	CST7	1.129293e-78	CTSW	1.035081e-85	COTL1	9.684016e-78	HLA-DRB1	1.865104e-19	PPBP	4.744938e-10
4	RPS6	1.771786e-185	FTL	3.910903e-219	HLA-DPB1	6.167786e-148	GZMA	4.263559e-73	PRF1	3.364126e-85	FCGR3A	2.516161e-76	CD74	5.853161e-19	NRGN	4.800511e-10

第一次用scanpy分析单细胞数据

Preprocessing

Principal component analysis

Computing the neighborhood graph

Embedding the neighborhood graph

Clustering the neighborhood graph

Finding marker genes

推荐阅读更多精彩内容