Analyzing Your Data
Overview
Teaching: 5 min
Exercises: 60 minQuestions
Can I analyze my own single cell RNA-Seq experiment?
Objectives
Identify a workflow for analyzing your data.
Be aware of key decision points where your choices influence analysis outcomes.
Understand that there are specific approaches that can be used to analyze very large datasets.
Know how to seek help with your work.
In this lesson we will dive into the data that you bring. It is not required that you bring along data. We will not necessarily discuss any new questions that we did not cover earlier in the course; rather, this lesson is an opportunity to see how some of these issues play out in a set of diverse datasets that have different characteristics.
If you have your own data, you might set a goal of completing quality control, normalization, basic clustering, and identifying major cell clusters. If you are working with data from a publication, you might try to reproduce – as best you can – one of the main figures from the paper.
Points to keep in mind as you work with your data
- You can find a high-level overview of the scRNA-Seq workflow by reviewing lessons 3 and 4 of this course.
- If you have your own data it may be helpful to find a published dataset from the same tissue which will likely be valuable for confirming the cell types that you see in your data.
- Your cell/gene filtering parameters may be quite different from those we used earlier in this course.
- The number of PCs that you select for dimensionality reduction is an important quantity and you may wish to examine results from more than one value to determine how they change.
- Your clustering resolution will obviously affect the clusters that you detect and can easily be altered to match your intuition regarding the heterogeneity of cells in your sample.
A review of a typical Seurat scRNA-Seq analysis pipeline
obj <- CreateSeuratObject(counts, project = 'my project',
meta.data = metadata, min.cells = 5) %>%
PercentageFeatureSet(pattern = "^mt-", col.name = "percent.mt")
obj <- subset(obj, subset = nFeature_RNA > 250 & nFeature_RNA < 5000 & nCount_RNA > 500)
Analyze using base Seurat workflow
obj <- NormalizeData(obj, normalization.method = "LogNormalize") %>%
FindVariableFeatures(nfeatures = 2000) %>%
ScaleData(vars.to.regress = c("percent.mt", "nCount_RNA")) %>%
RunPCA(verbose = FALSE, npcs = 100)
Look at your PC’s and decide how many to use for dimensionality reduction and clustering:
ElbowPlot(obj, ndims = 100)
# let's use 24 PCs
obj <- FindNeighbors(obj, reduction = 'pca', dims = 1:24, verbose = FALSE) %>%
FindClusters(verbose = FALSE, resolution = 0.3) %>%
RunUMAP(reduction = 'pca', dims = 1:X, verbose = FALSE)
Computational efficiency for very large datasets
In this section of the workshop, everyone will have the chance to work with a dataset of their choice. You may end up choosing to work with a dataset that is much larger than the example data we have been working with. For the purposes of this workshop, we will consider a “very large” dataset to be one with at least hundreds of thousands, if not millions, of cells.
Not long ago it was very impressive to profile thousands of cells. Today larger labs and/or consortia are fairly routinely profiling millions of cells! From an analytical perspective, these very large datasets must be thought of differently than smaller datasets. Very large datasets may often cause memory or disk space errors if one tries to use conventional methods to analyze them (e.g. the methods we have shown earlier in this course).
Here we will mention a few possible solutions for working with very large datasets. This is not an exhaustive list but we think it is a reasonable menu of options:
- Downsample your data by randomly sub-sampling so that you can work with a smaller dataset of (say) 100k cells. While this ameliorates issues of computational efficiency, it is highly undesireable to discard a significant fraction of your data and we do not recommend this approach unless you are doing a rough preliminary analysis of your data.
- Combine very similar cells into mini-bulk samples in order to drastically reduce the number of cells. Methods available to do this include Metacells Ben-Kiki et al. 2022 and SuperCell Bilous et al. 2022. This is an appealing approach as scRNA-Seq data is sparse so it seems natural to form mini-bulks of similar cells. However, analyzing data processed in this manner can be challenging because it can be difficult to interpret differences between “metacells” that are formed from varying numbers of real cells.
- Use the
BPCells
package which leverages dramatic improvements in low-level handling of the data (storing data using bitpacked compressed binary files and disk-backed analysis capabilities) to increase speed. This package is directly integrated withSeurat
(see a vignette demonstrating this integration at this link) and benchmarks indicate vastly improved workflow efficiency particularly with respect to RAM usage. There should be no downside to using this approach other than the need for disk space to write the additional binary files. - Take advantage of
Seurat
’s “sketch-based” analysis methods. This is a method that is a bit like smart subsampling of your data but with several added bonuses. The idea of sketching was introduced in Hie et al. 2019 and seeks to subsample from a large scRNA-Seq dataset with the goal of preserving both common and rare cell types. This is a much better approach than random subsampling because rare cells could turn out to be among the most interesting cells in your data! The real game-changer when it comes toSeurat
is the introduction inSeurat
version 5 of built-in methods for sketch analysis. This means that you can do the sketch-based sampling within Seurat, carry out a typical workflow (finding variable features, running PCA, clustering), and then extend these back to ALL cells. There is an excellentSeurat
vignette on the sketch-based analysis approach here. The only caveat with this approach is that someSeurat
functions have not yet (fall 2024) been modified to support the sketch-based approach.
Working With Your Own Data: Integration and Label Transfer
When analyzing your own data, you may find it useful to obtain someone else’s previously published data (or your lab’s unpublished data) with which someone has already identified cell clusters. You might then “bootstrap” your own analyses off of this existing data. Assuming the data you are using is derived from the same (or a similar) tissue, and that there are no enormous differences in the techology used to profile the cells, you can use the cell labels identified in the other dataset to try to identify your own clusters. This process can be called “integration” when two or more datasets are merged in some way, and “label transfer” when the cell cluster labels are transferred from one dataset to another.
If you wish to try this process with your own data, you may find the Seurat Integration and Label Transfer vignette helpful.
If You Don’t Have Your Own Data: Further Explorations of Our Liver Data
If you don’t have your own data you might choose to do some further exploration of the liver data we have been working with throughout the course. You can do anything you like - be creative and try to investigate any aspect that you are curious about! Here are a few suggestions of things that you could try:
- Change filtering parameters (e.g.
percent.mt
or number of reads) and see how it affects your analysis. - Adjust the number of variable genes that you choose. Try varying across a wide range (e.g. 100 vs 10,000).
- Adjust the number of PCs you select.
- Adjust the clustering resolution.
- Subcluster some cells that you might be interested in to see if there is any “hidden” heterogeneity that was not apparent when viewing all cells.
- Change the normalization method you use. The
sctransform
method is a statistical modeling approach that uses the negative binomial distribution. For a vignette on using it within Seurat, see this link.
Session Info
sessionInfo()
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.48
loaded via a namespace (and not attached):
[1] compiler_4.4.0 magrittr_2.0.3 cli_3.6.3 tools_4.4.0
[5] glue_1.8.0 rstudioapi_0.16.0 vctrs_0.6.5 stringi_1.8.4
[9] stringr_1.5.1 xfun_0.44 lifecycle_1.0.4 rlang_1.1.4
[13] evaluate_1.0.1
Key Points
There are excellent tools for helping you analyze your scRNA-Seq data.
Pay attention to points we have stressed in this course and points that are stressed in analysis vignettes that you may find elsewhere.
There is a vibrant single cell community at JAX and online (e.g. Twitter) where you can seek out help.