10X Genomics have been very successful in developing their gel-bead droplet technology for phased genome sequencing and more recently, single-cell 3'mRNA-seq. I've posted about their technology before (at AGBT2016, and March and November 2015) and based most of what I've written on discussion with 10X or from presentations by early access users. Now 10X have a paper up on the BioRxiv: Massively parallel digital transcriptional profiling of single cells. This describes their approach to single-cell 3'mRNA-seq in some detail and describes how you might use their technology in trying to better understand biology and complex tissues.
Technical performance of the GEMcode system: The paper is unfortunately based on the earlier GEMcode system rather than the latest Chromium, but the results are likely, though not definitely, going to be representative of what Chromium can deliver.
Technical performance was assessed using 1200 Human 293T or Mouse 3T3 cells, with 100,000 reads per cell. 71% of reads aligned to Human or Mouse genomes (38% and 33% respectively). Analysis of the UMIs allowed the authors to estimate a total number of cell-containing GEMs to be just over 1000 (482 and 538 Human or Mouse respectively). Only 8 GEMs appeared to have Human and Mouse cells co-located, as assessed by GEM barcoded reads aligning to both genomes. It is not easy (is it possible) to detect Human:Human or Mouse:Mouse cell doublets so the inferred doublet rate for this experiment was 1.6% (see figure 2a in the paper with multiplet GEMs as grey dots).
The 1.6% multiplet (doublet, triplet, or higher) rate appears low, but as cell numbers increase so does the multiplet rate, the authors describe a linear relationship of multiplet rate to cell loading from 1000-10000 cells (Supplementary Fig. 1a), however it is not clear how this rate changes at 20k, 30k, 40, or 50k (the maximum loading recommended). What the impact is on experiments I do not know - but this is an area several labs are focusing on. The multiplet rate "approximately followed a Poisson distribution" as assessed by imaging experiments (Supplementary Fig. 1b). In these a Nikon microscope equipped with a high-speed camera capable of capturing 4000 frames per second imaged GEMs as they were created. 28,000 frames were analysed for single-cell encapsulation (7 seconds of video, which only represents about 1.5% of the time your Chromium is actually making GEMs) but the multiplet rate was 16% higher than expected - I don't think the authors delve deeply enough into the reasons for this. Multiplets are likely to add significant noise to analysis of single-cell experiments, every single-cell technology has to account for them and cells like to stick together so user probably can't rely on actually having a single cell suspension in the first place.
To further investigate this the authors also carried out mixing experiments with Human 293T (female & expressing XIST) and Jurkat cells (male & expressing CD3D). Figure 2e (see above) in the paper shows the PCA for these mixes at 100% 293T, 100% Jurkat, 50:50 or 10:90. The 50:50 mix shows a lot of cells in the space between the cell clusters, I\d suggest this indicates higher multiplet rates in this experiment than the 1.6% suggested? But I could not see the cell loading density used, which may explain the higher numbers of apparent muliplets.
Cell capture efficiency: The rate of cell capture is important especially where rare cell populations are being studied. 10X captures about 50% of the cells loaded into GEMS (Supplementary Tables 1&3), and whilst this could be increased it would be to the detriment of an increased cell doublet/triplet rate. This might be a parameter users are willing to tweak depending on their needs and it would be interesting to ask how many users would accept higher doublets in return for 80-90% cell capture rates? What we really need in a single-cell system is the ability to image cells in droplets so we can exclude empty drops, doublets and triplets; I'd be interested to know if anyone is working on something like this?
The level of cross-talk between cell barcodes was about 1% (see Online Methods) but it is not clear in the manuscript where this cross-talk comes from. If it is error in reading the cell barcodes then this could be reduced by sequencing longer, more error-tolerant barcodes, and a longer barcode read (if >25bp) would allow a proper error estimation of the index read. But if this is coming from molecular cross-over during the downstream library prep (which is going to happen to some degree) then fixing it will be much more difficult (see these papers to learn more about PCR chimeras and their affect on NGS: NAR 2012, NAR 1990, JBioChem 1990, NAR 1995).
83% of UMIs were associated with cell barcodes suggesting that cell-free RNA does not significantly affect the results - this is an issue scSeq users will have to consider carefully as the amount of cell-free RNA or DNA in a sample is likely to be highly variable, and it may be that experiments with artificially high levels might show us the failure mode in these sample types.
Transcript counting: With 100,000 reads per cell the authors report a median detection rate of 4500 genes or 27,000 transcripts with little bias for GC content or gene length. However as a 3' assay I'd not expect a huge variation here, and this is something that would become much more important as 10X, and others, move to whole transcript assays. Clustering analysis was performed Seurat (Satija et al., 2015).
SNV detection from scRNA-seq data: while deciphering population structure and discovering rare cells is great many people will want to look for SNP/SNVs in their scRNA-seq data. The authors reported an analysis of a curated set of high quality SNVs only observed in only 293T or Jurkat cells, but not both (see Online Methods). They showed that they could detect SNVs reliably, and that multiplet rates predicted from SNV were highly correlated with those from gene expression analysis. The paper is confusing in suggesting that each cDNA generates 250bp of sequence for SNV detection, but the sequencing run generates only 98bp in read 1 from the cDNA (I'd like to understand this better or see this corrected in the final version if it is a typo).
scRNA-seq from frozen cells: In the discussion the authors make a strong statement about the ability to analyse frozen cells: "the ability of GemCode to generate faithful scRNA-seq profiles from cryopreserved samples enables its application to clinical samples". The frozen cells in questions were fresh cells recovered from whole blood, cryopreserved and "gently thawed" one week later (see Online Methods). Only a small number of genes (57) showed greater than 2-fold upregulation (no down regulated genes were reported), suggesting that freezing cells is possible. However I suspect that the minimal freezing time and "gentle" protocols will put many users off relying on cell storage until more comprehensive evaluation is undertaken. The fact that they got such good results is encouraging, we're working on a project with patient material that needs to be processed immediately for best results. Right now we're brining cells over from the hospital about one hour after collection and processing straigh-away, but this is not an efficient use of the technology when the plastic chip holds 8 samples and costs $150 each time.
A few words about sequencing 10X scRNA-seq libraries: In the paper the authors say that after GEMcode prep "libraries then undergo standard Illumina short-read sequencing" - there is nothing standard about the run type you need to do for 10X. It is a 18.104.22.168 format run - 98bp 1st read (mRNA), 14bp Index 1 (UMI), 8bp Index 2 (sample index), 10bp 2nd read (Cell index) - I hope I got that right!
10X sequencing does not fit easily into a core lab running HiSeq instruments due to the run configuration (we need 8 lanes of the same sample type). I suspect this is going to get much easier as we do more and more 10X sequencing, but for now we're either running longer reads than necessary, or using NextSeq/2500 RapidRuns. Chromium genomes can now be run on X Ten as PE150 with no modification. Hopefully single-cell RNA-seq will move to a more standard single-end run for differential gene expression, this would make life easier for my team, and reduce costs by around 40%.
Summary: All on all this paper explains many of the things potential users of 10X single-cell are looking to understand. I'm expecting papers to be coming thick and fast over the next six months now people have the instruments in their hands.
It is going to be interesting to see how 10X develop their chemistry, particularly for whole transcriptome single-cell, for copy-number and for applications like G&T-seq or scM&T-seq, or even ATAC-seq.
How will RainDance fight back with their own single cell methods? And how does this 3'mRNA-seq assay compare to Fluidigm's C1? Both of these are questions I look forward to seeing answered. Ultimately the more technologies we have for single-cell the better, there are likely to be strengths and weaknesses in each. But I'd not be surprised if the one with the most open chemistry becomes dominant - this was part of Illumina/Solexa's success as it meant users could develop methods from a core technology.
PS: Supplementary Figure legends are available on BioRxiv, but not the figures - go figure! Online methods are also missing. Probably because the BioRxiv does not check if these have been submitted.