Analyzing Your Data

Overview

Teaching: 5 min
Exercises: 60 min
Questions
  • Can I analyze my own single cell RNA-Seq experiment?

Objectives
  • Identify a workflow for analyzing your data.

  • Be aware of key decision points where your choices influence analysis outcomes.

  • Understand that there are specific approaches that can be used to analyze very large datasets.

  • Know how to seek help with your work.

In this lesson we will dive into the data that you bring. It is not required that you bring along data. We will not necessarily discuss any new questions that we did not cover earlier in the course; rather, this lesson is an opportunity to see how some of these issues play out in a set of diverse datasets that have different characteristics.

If you have your own data, you might set a goal of completing quality control, normalization, basic clustering, and identifying major cell clusters. If you are working with data from a publication, you might try to reproduce – as best you can – one of the main figures from the paper.

Points to keep in mind as you work with your data

A review of a typical Seurat scRNA-Seq analysis pipeline

obj <- CreateSeuratObject(counts, project = 'my project',
        meta.data = metadata, min.cells = 5) %>%
        PercentageFeatureSet(pattern = "^mt-", col.name = "percent.mt")
obj <- subset(obj, subset = nFeature_RNA > 250 & nFeature_RNA < 5000 & nCount_RNA > 500)

Analyze using base Seurat workflow

obj <- NormalizeData(obj, normalization.method = "LogNormalize") %>% 
    FindVariableFeatures(nfeatures = 2000) %>% 
    ScaleData(vars.to.regress = c("percent.mt", "nCount_RNA")) %>%
    RunPCA(verbose = FALSE, npcs = 100)

Look at your PC’s and decide how many to use for dimensionality reduction and clustering:

ElbowPlot(obj, ndims = 100)
# let's use 24 PCs
obj <- FindNeighbors(obj, reduction = 'pca', dims = 1:24, verbose = FALSE) %>%
    FindClusters(verbose = FALSE, resolution = 0.3) %>%
    RunUMAP(reduction = 'pca', dims = 1:X, verbose = FALSE)

Computational efficiency for very large datasets

In this section of the workshop, everyone will have the chance to work with a dataset of their choice. You may end up choosing to work with a dataset that is much larger than the example data we have been working with. For the purposes of this workshop, we will consider a “very large” dataset to be one with at least hundreds of thousands, if not millions, of cells.

Not long ago it was very impressive to profile thousands of cells. Today larger labs and/or consortia are fairly routinely profiling millions of cells! From an analytical perspective, these very large datasets must be thought of differently than smaller datasets. Very large datasets may often cause memory or disk space errors if one tries to use conventional methods to analyze them (e.g. the methods we have shown earlier in this course).

Here we will mention a few possible solutions for working with very large datasets. This is not an exhaustive list but we think it is a reasonable menu of options:

Working With Your Own Data: Integration and Label Transfer

When analyzing your own data, you may find it useful to obtain someone else’s previously published data (or your lab’s unpublished data) with which someone has already identified cell clusters. You might then “bootstrap” your own analyses off of this existing data. Assuming the data you are using is derived from the same (or a similar) tissue, and that there are no enormous differences in the techology used to profile the cells, you can use the cell labels identified in the other dataset to try to identify your own clusters. This process can be called “integration” when two or more datasets are merged in some way, and “label transfer” when the cell cluster labels are transferred from one dataset to another.

If you wish to try this process with your own data, you may find the Seurat Integration and Label Transfer vignette helpful.

If You Don’t Have Your Own Data: Further Explorations of Our Liver Data

If you don’t have your own data you might choose to do some further exploration of the liver data we have been working with throughout the course. You can do anything you like - be creative and try to investigate any aspect that you are curious about! Here are a few suggestions of things that you could try:

Session Info

sessionInfo()
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.48

loaded via a namespace (and not attached):
 [1] compiler_4.4.0    magrittr_2.0.3    cli_3.6.3         tools_4.4.0      
 [5] glue_1.8.0        rstudioapi_0.16.0 vctrs_0.6.5       stringi_1.8.4    
 [9] stringr_1.5.1     xfun_0.44         lifecycle_1.0.4   rlang_1.1.4      
[13] evaluate_1.0.0   

Key Points

  • There are excellent tools for helping you analyze your scRNA-Seq data.

  • Pay attention to points we have stressed in this course and points that are stressed in analysis vignettes that you may find elsewhere.

  • There is a vibrant single cell community at JAX and online (e.g. Twitter) where you can seek out help.