L1000 data proceeds through a data processing pipeline outlined in the figure below. Briefly, the pipeline captures raw data from Luminex FlexMap 3D scanners as it is generated, deconvolutes 978 transcripts from only 500 Luminex bead colors, normalizes the data based on 80 invariant control genes, infers the expression of the non-measured transcripts, determines differentially expressed genes following a perturbation compared to controls, and generates composite signatures across biological replicates. Along the way the data are subjected to rigorous quality control filters at both the sample and plate level.
Level 1
Level 1 -LXB - raw fluorescent intensity (FI) values measured for every bead detected by Luminex scanners. The FI is proportional to the amount of amplicon bound to the bead, and hence also proportional to the transcript abundance of the genes that particular bead is interrogating. Each 384-well plate generates 384 LXB files, where each file contains a fluorescent intensity value for each observed bead in the well. Here, the data from each perturbagen treatment is referred to as a profile, experiment, or instance.
Level2
Level 2 - GEX - Gene expression levels for the 978 landmark genes, deconvoluted from the measured fluorescent intensity values. (See supplementary information in Subramanian, et al., 2017 for details on peak deconvolution.) Here, the data from each perturbagen treatment is referred to as a profile, experiment, or instance.
Level3
Level 3a - NORM - Gene expression (GEX, Level 2) are normalized to invariant gene set curves and quantile normalized across each plate. Here, the data from each perturbagen treatment is referred to as a profile, experiment, or instance.
Level 3b - INF- Additional values for 11,350 additional genes not directly measured in the L10000 assay are inferred based on the normalized values for the 978 landmark genes.
Level4
Level 4 - ZS - Z-scores for each gene based on Level 3 with respect to the entire plate population. This comparison of profiles to their appropriate population control generates a list of differentially expressed genes.
Level5
Level 5 - MODZ - replicate-collapsed z-score vectors based on Level 4. Replicate collapse generates one differential expression vector, which we term a signature. Connectivity analyses are performed on signatures.
For levels 1 and 2, values are present for only the 978 landmark features. For levels 3-5, values are present for each of the 12,328 genes (978 landmark plus 11,350 inferred).
The code for the data processing pipeline is available in the cmapM GitHub repository. The procedure to replicate each step the pipeline along with sample data are detailed here.