Welcome to the RNA-Seq world
Overview
Recently, Philippe Boileau joined the Dudoit lab and I had to think on which resources to point him to so that he could start getting up to speed with on the field. Together, we hence build the following list of interesting material. Those fall into 4 broad (and largely overlapping) categories:
- How to produce data with sequencing
- How to find, access and pre-process data (including how to use bio-informatics tools)
- How to analyze the data: statistics
- How to use all of the above to conduct an analysis (with - of course - a strong emphasis on R tools)
- How to do all those analyses correctly: reproducible research
I’ll therefore try to list resources to help in all subjects. This will be a mix of (scientific) papers, books, package vignette, personal tutorials, video…
If you find any other relevant material, feel free to put an issue on the github repo.
Before we begin
The resources below broadly assume that you have at least a basic knowledge of R, statistics and biology.
Otherwise, for R, I recommend (and I am far from the only one) Garrett Grolemund & Hadley Wickham’s R For Data Science. Additionally, the statistical computing facility at Berkeley host yearly R bootcamp. The material is on Github and are a great way to practice (this is how I really learned R, and I had the change to help in the teaching in 2018).
Biological concept you should be familiar with are (with links to the relevant Wikipedia pages): the Central Dogma, mRNA splicing. Keeping up to date with the current knowledge on majors diseases is also crucial to .
For statistics, I do not have any material to recommend.
Sequencing methodology
- RNA - Sequencing mostly relies on Illumina Sequencing nowadays but it’s useful to learn about other sequencing methods. I feel that for this, videos are the best medium to understand the mechanisms.
Sanger sequencing The first method used for sequening. The video below gives a 3mn tutorial on how it works.
Illumina Sequencing Also called Next Generation Sequencing or Short read sequencing. The best resources I have found come from the Illumina Website
PacBio Sequencing Long read sequencing, 2$^nd$ generation of NGS. In very, short, longer reads with lower accuracy compared to Illumina reads
Oxford NanoPore sequencing Even longer reads with even lower quality.
-
Single-cell RNA-Sequencing is, of course, conducted at the single-cell level. The way the cells are pre-processed, separated and sequenced is method-dependent. A good overview of those methods can be found in Svensson et al.
-
Many other different types of ‘omics exist: the youtube chanel StatQuest gives a good intro to many of them (as well as clear explanations for many statistical analyses). Some ‘omics that are often collected along RNA-Seq are Chip-Seq and ATAC-Seq.
Prepping the data for analysis
Obtaining data
The R GEOQuery package from Bioconductor allows you to download data and sample information from ncbi. Most papers provide GEO accession numbers as a key to access their data and is usually enough.
Pre-processing data
The sequencing step produces reads at the cell level. The next question is how to map them to a reference genome to obtain transcript or gene expression levels. To obtain this matrix, a few bioinformatic tools exist. One series of tool is Tophat, which was replaced by Tophat2 and then HISAT, which was finally upgraded to HISAT2. A popular alternative is the STAR algorithm.
Recently, tools that do not rely on full genome alignment have been proposed, including Kallisto and Salmon. A recent (not yet per-reviewed comparison) can be found by here.
To run those scripts, basic knowledge of shell scripting is needed. Good tutorials can be found here or here.
Also, sequencing data can be stored into many different formats. This page is a very complete overview of the types of format and list tools to convert from one to another.
Pre-processed data.
Many datasets have been stored in a clean format in the recount2 project. This is an excellent place to start if you want to explore algorithms on real world and diverse datasets.
The conquer project has a shorter list of very well processed datasets, using Salmon.
The Bioconductor Project also has a lot of datasets packages.
Tools needed for analysis
Many excellent books exist to give a clear understanding of the many tools and concept needed. Freedman’s book, Statistical Models: Theory and Practice is a must-read, which I consider maybe closer to a philosophical treaty. More applied, Susan’s Holmes new book, Modern Statistics for Modern Biology is quite complete for tools used in genomics. Key concepts include dimensionality reduction, clustering and modelling. A deeper and mathier book is the The Elements of Statistical Learning by Hastie, Tibshirani & Friedman.
A central concept in high-dimensional statistics that characterize genomics is multiple hypothesis testing. Efron (2010) is a good reference with a clear Baysien justification. This paper present the concepts in a genomic setting.
The R language is used in many RNA-Seq applications. To improve your coding practices, I also recommend advanced R by Hadley Wickham. The S4 section of the book will be of particular interest, given that Bioconductor strongly recommands using S4 objects and functions.
Applications to RNA-Seq
We’re finally getting to the RNA-Seq analysis part. To start, I advice to begin with a practical tutorial. The Bioconductor 2018 Workshop provides a few that I recommend. This other one is longer but more complete. This other tutorial that I wrote is shorter but can be a good place to start.
Once you have seen how to proceed, this recent review by Koen, et al. is a good follow-up that very nicely recoup many concepts from above and can point you to further resources.
Then, I recommend to dive into method papers. This list is partly inspired from the Leek group paper.
- Sequencing technology does not eliminate biological variability
- The first paper behind the Limma package and the second, with voom
- edgeR
- Key need for Normalization, especially because of batch effects
- Methods for normalization include scone, zinbWave and scVI.
- Seurat is a global framework of analysis
- Clustering is a key step of RNA-Seq analysis. This paper contrasts several methods on real and synthetic data.
- This paper proposes a nice overview of trajectory inference
- Gene Set Enrichment Analysis
Reproducible research
All the analyses you conduct should aim for reproducibility in the science. This nature collection group of papers gives a broad overview of the issue of reproducibility, and of how to fix it. A first step to try and limit the crisis of reproducibilty in science - and one where (bio)-statisticians have a key role - is statistical and computational reproducibility.
To help, several tool exist. Git and Github are powerful tools and many good tutorial exist (see here for example). Good coding practices are also necessary. Saving your software in packages is a part of it, and this book is a good reference on how to build them in an efficient fashion. I also recommend using tools like RMarkdown or knitr to combine text, math, code and figures in one coherent place when writing papers, reports or even for day-to-day analyses.
While code is now commonly saved and made public for reproducibility purposes, data is unfortunatly not always available. Raw files can be made accessible through ncbi. Pre-processed data (a count matrix for example) can also be made public using a data R package, such as the experiment packages that can be found in Bioconductor.