Single-cell RNA-sequencing (scRNA-seq) is a very potent biological tool used for many applications. A far from exhaustive list would include identifying new cell types, finding differentially expressed (DE) genes, and discovering lineages among cells. To get an overview of the principle, see here.
However, the usual framework might seem a little daunting for beginners and, while many well-crafted tutorials exists, they all share the same idea: use a biological dataset as an example. Here, we want to use a dataset that would be more understandable to a broader public to explain the usual steps in scRNA-seq analysis. Only some basics on data analysis are needed to understand this tutorial. Knowledge of R helps to understand the code but is not necessary to follow along.
Our dataset is the voting record of the French delegates of the \(14^{th}\) legislature, going from May 2012 to May 2017. Delegates are characterized by their names and surnames, their department and circonscription, their political group. For each of the 644 occasions of votes (which can be bills, amendments, choosing the prime minister, …), we also have the voting behavior of each delegate (voted yes, no, abstain or did not took part in the vote) and the reason of the vote. We are interested in investigating the link between voting behavior and political affiliations.
The usual scRNA-seq data is a matrix of \(J\) genes by \(n\) biological cells. Each cell of the matrix represents the number of copies of messenger RNA of a given gene in a given biological cell.This number serves as a proxy for the number of proteins from this mRNA and therefore represent how active a gene is in a given cell. We also eventually have some information about cells.
Here, we instead have a \(J\) bills by \(n\) delegates matrix. Each cell represent the vote of the delegate for that low (\(-1\) if against, \(1\) is for, \(0\) if not voting or abstained). Remarks: We use bill as a generic term to cover amendments, confidence votes, actual bills, ….
## SingleCellExperiment clusterExperiment knitr
## TRUE TRUE TRUE
## kableExtra dplyr tidyr
## TRUE TRUE TRUE
## purrr ggplot2 readr
## TRUE TRUE TRUE
## stringr RColorBrewer matrixStats
## TRUE TRUE TRUE
# Create the phenodata matrix
voting_record <- read_csv("data/voting_record.csv")
MetaData <- voting_record %>% select(circo, dept, name, surname, identifiant,
iden, group1, group2, group3, NbVote,
NbYes, NbNo, NbAbst)
voting_record <- voting_record %>% select(-one_of(colnames(MetaData))) %>%
t(.)
# We count abstaining and not voting as the same thing for simplification
letters_to_num <- function(vote) {
if (length(vote) > 1) {
return(sapply(vote, letters_to_num))
}
if (is.na(vote)) return(NA)
if (vote == "For") return(1)
if (vote == "Against") return(-1)
if (vote == "Abstain") return(0)
if (vote == "NotVoting") return(0)
}
voting_record <- apply(voting_record, 2, letters_to_num)
# Build the SingleCellExperiment Object
bills <- read_csv("data/votes.csv")
colnames(voting_record) <- rownames(MetaData) <- MetaData$identifiant
rownames(voting_record) <- rownames(bills) <- bills$Number
Sce <- SingleCellExperiment(assays = list(counts = voting_record),
colData = MetaData,
rowData = bills)
In scRNA-seq settings, we want to filter the cells with too few reads in total, or with too many reads of poor quality. Here, we instead filter the delegates with too few voting possibilities. This may happen either because a delegate had to leave office before the end of the session, or was appointed to the government, or replaced a delegate in the previous cases.
We therefore filter delegates whose number of voting possibilities is more than 2 SD below the mean (so equal or less to 40).
# Count how many votes happen during the delegate mandate
Sce$Nb_Votes <- colSums(!is.na(counts(Sce)))
ggplot(colData(Sce) %>% as.data.frame(),
aes(x = Nb_Votes, fill = Nb_Votes > 40)) +
stat_bin(breaks = seq(0, max(Sce$Nb_Votes), 20)) +
labs(x = "Number of bills") +
guides(fill = guide_legend(title = "Could have voted\nat least \n40 times")) +
scale_fill_manual(values = brewer.pal(3, "Set1")[c(1, 3)],
breaks = c(TRUE, FALSE)) +
theme_classic()
# Filter
Sce <- Sce[, Sce$Nb_Votes > 40]
We then simplify the data by replacing NAs by 0 (as if they were abstentions). We can then check whether some delegates rarely voted. Again, we remove those whose actual number of votes is more than 2 SD below the mean number of votes (so delegates who actually voted less than 38 votes).
# Remplace NAs by zeros
counts <- counts(Sce)
counts[is.na(counts)] <- 0
assays(Sce)$counts <- counts
Sce$Nb_Votings <- colSums(counts(Sce) != 0)
ggplot(colData(Sce) %>% as.data.frame(),
aes(x = Nb_Votings, fill = Nb_Votings > 38)) +
stat_bin(breaks = seq(0, max(Sce$Nb_Votings), 19)) +
labs(x = "Number of votes") +
guides(fill = guide_legend(title = "Actually voted\nmore than\n38 times")) +
scale_fill_manual(values = brewer.pal(3, "Set1")[c(1, 3)],
breaks = c(TRUE, FALSE)) +
theme_classic()
Among those, we of course find the president of the parliament, who does not take part in the votes and therefore never voted.
sum(counts(Sce)[, Sce$name == "Bartolone"])
## [1] 0
# Filter
Sce <- Sce[, Sce$Nb_Votings > 38]
In scRNA-seq settings, we want to filter out genes that are expressed in a small number of cells. This is done for two reasons. The first is computational. Most datasets would have dozens of thousands of genes, over dozens or hundreds of cells. Reducing the number of genes helps speed up calculations by orders of magnitude. Secondly, we are usually interested in a specific tissue or process. Most genes are not being expressed in the samples of interest and just add noise to the analysis. Hence filtration. Simple heuristics in the form of “at least \(i\) counts in \(j\) cells” are usually quite efficient. The aim is too be quite broad. We want to filter as many genes as possible while retaining most of the reads. In usual settings, you would only keep ~20% of the genes at most but that would still capture more than 90% of the reads.
In our context, we only have 644 votes so the computational goal does not really hold. But we can filter out votes where many delegates did not vote. Those are probably non-political technical votes. We therefore use the filter rules “at least \(j\) delegates cast a vote on the bill”.
# Compute the number of votes and number of votes casted as a function of j
Nb_Voters <- rowSums(counts(Sce) != 0)
voting_behavior <- data.frame(n_votes_cast = rep(0, nrow(Sce)),
n_votes = rep(0, nrow(Sce)),
j = 1:nrow(Sce)
)
for (j in 1:nrow(Sce)) {
Nb_Voters <- Nb_Voters[Nb_Voters >= j]
voting_behavior$n_votes[j] <- length(Nb_Voters)
voting_behavior$n_votes_cast[j] <- sum(Nb_Voters)
}
voting_behavior <- voting_behavior %>%
mutate(n_votes_cast = n_votes_cast / max(n_votes_cast),
n_votes = n_votes / max(n_votes))
j_opt <- which.min(
(voting_behavior$n_votes)^2 + (1 - voting_behavior$n_votes_cast)^2
)
ggplot(voting_behavior, aes(x = n_votes, y = n_votes_cast, col = j)) +
geom_point() + theme_classic() +
labs(x = "% of bills kept",
y = "% of caste votes kept",
col = "Cutoff value") +
geom_label(label = "Final\nCutoff",
data = voting_behavior %>% filter(j == j_opt),
nudge_x = -.15, nudge_y = .1,
col = "black") +
geom_curve(data = voting_behavior %>% filter(j == j_opt), curvature = 0,
aes(x = n_votes - .1, xend = n_votes,
y = n_votes_cast + .1, yend = n_votes_cast),
col = "black", size = .9,
arrow = arrow(length = unit(0.03, "npc"))) +
geom_point(col = "red", data = voting_behavior %>% filter(j == j_opt))
We can see a clear inflexion point in the curve, that we pick as the final cutoff value. This is the closest point to the optimal place, which is the top left corner. Our final filtering rule for bills (genes) is therefore “At least 171 delegates cast at vote.”
# Filter
Sce <- Sce[rowSums(counts(Sce) != 0) >= j_opt, ]
In scRNA-seq settings, after filtering comes a normalization step. scRNA-sequencing has a lot of technical issues and this technical noise can be stronger than any biological signal. Normalization tries to adjust for this. Here, we have no technical noise so we do not need to adjust for it.
This is also the time to adjust for dropouts. Indeed, because of technical difficulties, some genes may seem non-expressed (i.e their count is zero) while they in fact are. Zero-inflation methods try to adjust for that by categorizing the zeros into real and dropouts and, in the case of dropouts, try to input the missing values.
To better understand dropouts, let us go back to our example. In our voting record matrix, zeros could be either abstention (i.e real zeros) and not being able to votes (i.e dropouts, we do not know what the delegate would have done if he could have voted). Adjusting for dropouts means imputing which zeros are true zeros (true abstentions) and then, for dropouts, estimating the real value. Since the techniques that would be used to correct for “dropouts” in our political example would be different to biostatistical techniques used in scRNA-seq settings, we will not cover it here.
In most scRNA-seq settings, we have no information about the biological type. We therefore want to cluster the cells and then use marker genes to identify known-cell types. Alternatively, we can use clustering to discover new cell types and identify marker genes for those new types.
In our example, let us imagine for a moment that we do not know the political group to which each delegate belong. Can we recover those groups? This will be done in two steps: discover stable clusters of delegates (cells) then identify bills (genes) that help distinguish those clusters.
We use a broad clustering framework based on the ClusterExperiment package, that test many clustering algorithms and use all the results to find stable clusters. As a result, some cells (delegates) are left unclustered.
# Run all clustering methods
ce <- clusterMany(x = Sce, k = seq(5, 50, 5),
clusterFunction = c("pam", "hierarchicalK"),
reduceMethod = c("PCA"),
ks = 2:15, minSizes = 5,
run = TRUE, transFun = function(x){x})
# Find stable clusters
ce <- makeConsensus(ce, minSize = 10, proportion = 0.7,
clusterLabel = "combineMany"
)
# Merge clusters
ce <- makeDendrogram(ce)
ce <- mergeClusters(ce, mergeMethod = "adjP", plotInfo = "adjP", cutoff = 0.3,
clusterLabel = "Clusters", plot = F, DEMethod = "limma")
After running the algorithm, we find 6 clusters that we can visualize on the first 2 principal components.
# Plot clusters
pca <- prcomp(t(counts(Sce)))
pc1 <- round(100 * (pca$sdev[1])^2/sum(pca$sdev^2), digits = 1)
pc2 <- round(100 * (pca$sdev[2])^2/sum(pca$sdev^2), digits = 1)
Clusters <- data.frame(clusters = primaryCluster(ce),
x = pca$x[, 1], y = pca$x[, 2],
group = Sce$group1
)
ggplot(Clusters,
aes(x = x, y = y, col = factor(clusters), alpha = clusters != -1)) +
geom_point(size = 2) +
labs(col = "Clusters", x = paste0("PC1 [", pc1, "% of the total variation]"),
y = paste0("PC2 [", pc2, "% of the total variation]")) +
guides(alpha = F) +
scale_alpha_manual(values = c(0.5, .8), breaks = c(T, F)) +
# Color from RColor Brewer, modified to map usual political colors
scale_color_manual(values = c("#66C2A5", "#8DA0CB", "#FFD92F", "#E5C494",
"#A6D854", "#E78AC3", "#FC8D62"),
breaks = c(-1, 1:6),
labels = c("Unclustered", paste("Cluster", 1:6))) +
theme_classic() +
theme(legend.title = element_text(hjust = 0.7))
We can see how those clusters compare with the actual political groups
ggplot(Clusters, aes(x = x, y = y,
col = factor(group, levels = c("Independent", "LR", "RRPD", "UDI",
"Greens", "SER","GDR")))) +
geom_point(size = 2, alpha = 0.7) +
labs(col = "Clusters", x = paste0("PC1 [", pc1, "% of variation]"),
y = paste0("PC2 [", pc2, "% of variation]")) +
scale_color_manual(values = c("#66C2A5", "#8DA0CB", "#FFD92F", "#E5C494",
"#A6D854", "#E78AC3", "#FC8D62")) +
theme_classic() +
theme(legend.title = element_text(hjust = 0.7))
We can see that we correctly identified 5 main groups: SER (left, governing party), as well as the two main opposition parties, LR (right) and UDI (center-right). The far left party GDR is also correctly identify. However, we cannot identify the RRPD (center-left), which was part of the governing coalition, nor the greens (which can be identified if we choose a smaller minimum cluster size but we then start to see to many clusters).
We can moreover see that we partitioned the governing coalition in three. This situation has an analog in biological settings. We cluster our data in an unsupervised manner even though we know some rough biological types. Then we can compare our clusters and the types. If, like it does here, we have some good matches, we can be confident in our clustering techniques. Any new partition we discover might therefore be interesting and can be investigated.
We can now look at the votes and ask: which bills set a given group apart from the others? We look at the political labels that we know and find the bills that separate them, to find those that create a lot of political differentiation.
In scRNA-seq settings, we would want to identify marker genes for known-biological types, or find DE genes between treatment and control, or many other possible scenario.
In this example, we will ask: What bills distinguish each group from the governing coalition (grouping SER, RRPD and Greens)?
groups <- Sce$group1
groups[groups %in% c("SER", "Greens", "RRPD")] <- "Gov"
groups <- as.numeric(factor(groups, levels = c("LR", "UDI", "GDR",
"Independent", "Gov")))
# We add 1 to all counts since Limma blocks negative counts
DE_Bills <- getBestFeatures(counts(Sce), groups, contrastType = "Pairs",
DEMethod = "limma") %>%
filter(str_detect(Contrast, "Cl05")) %>%
group_by(Contrast) %>%
arrange(P.Value) %>%
top_n(-5, P.Value) %>%
select(Contrast, IndexInOriginal) %>%
mutate(IndexInOriginal = as.numeric(IndexInOriginal))
contrasts <- DE_Bills %>%
mutate(Title = rowData(Sce)[IndexInOriginal, "Title"]) %>%
select(-IndexInOriginal) %>%
group_by(Contrast) %>%
mutate(groupid = row_number()) %>%
spread(Contrast, Title) %>%
select(-groupid)
We can see that most laws that distinguish between groups are related to social security, budget, education, environment and institutions.
LR (right) | UDI (center-right) | GDR (far-left) | Independent |
---|---|---|---|
le projet de loi organique relatif au procureur de la République financier (lecture définitive). | l’ensemble de la première partie du projet de loi de finances pour 2014. | l’ensemble de la première partie du projet de loi de finances pour 2014. | l’ensemble de la première partie du projet de loi de finances pour 2014. |
l’ensemble de la première partie du projet de loi de finances pour 2014. | l’ensemble du projet de loi de financement de la sécurité sociale pour 2014. | l’ensemble du projet de loi de financement de la sécurité sociale pour 2014. | l’ensemble du projet de loi d’orientation et de programmation pour la refondation de l’école de la République. |
l’ensemble du projet de loi de financement de la sécurité sociale pour 2014. | l’ensemble du projet de loi d’orientation et de programmation pour la refondation de l’école de la République. | l’ensemble du projet de loi de financement de la sécurité sociale pour 2014 (nouvelle lecture). | l’ensemble du projet de loi de financement de la sécurité sociale pour 2014 (nouvelle lecture). |
l’ensemble du projet de loi d’avenir pour l’agriculture, l’alimentation et la forêt. | la première partie du projet de loi de finances pour 2013. | le projet de loi de financement de la sécurité sociale pour 2014 (lecture définitive). | la première partie du projet de loi de finances pour 2013. |
la motion de censure déposée en application de l’article 49, alinéa 2 de la Constitution par M. Christian Jacob et 144 membres de l’Assemblée. | l’ensemble du projet de loi de financement de la sécurité sociale pour 2014 (nouvelle lecture). | le projet de loi de programmation des finances publiques pour les années 2012 à 2017. | l’ensemble du projet de loi d’avenir pour l’agriculture, l’alimentation et la forêt. |
We have seen that the governing coalition cluster in 3 distinct groups. Let us see if we can identify them and find which votes separate them. We assume that the main cluster 25 is the usual governing group and we want to see if we can characterize the smaller clusters.
In scRNA-seq settings, this would be equivalent to looking at clusters and use known-marker genes, gene ontology or pathway analysis to identify them.
We look at the 5 tops genes that differentiate clusters 4 and 5 and we look at the mean vote. For all those votes, nearly everyone in cluster 4 abstained while most of those in cluster 5 voted for. This is because members of cluster 4 either where in the government at some point, or replaced a delegate that was called into the government. Therefore, they “abstained” for many votes (because they were not members of parliament at the time).
contrasts <- getBestFeatures(ce, contrastType = "Pairs",
DEMethod = "limma") %>%
filter(Contrast == "Cl04-Cl05") %>%
arrange(P.Value) %>%
top_n(-5, P.Value) %>%
select(Contrast, IndexInOriginal) %>%
mutate(IndexInOriginal = as.numeric(IndexInOriginal)) %>%
mutate(Title = rowData(Sce)[IndexInOriginal, "Title"])
votes <- counts(Sce)[contrasts$IndexInOriginal, ] %>%
as.data.frame() %>%
mutate(Vote_Number = Sce$Number) %>%
as.matrix(.)
clusters <- primaryCluster(ce)
results <- data.frame(mean4 = rowMeans(votes[,clusters == 4]),
sd4 = rowSds(votes[,clusters == 4]),
mean5 = rowMeans(votes[,clusters == 5]),
sd5 = rowSds(votes[,clusters == 5])
) %>%
map_df(function(x) round(x, digits = 2)) %>%
mutate(cluster4 = paste(mean4, " (", sd4, ")"),
cluster5 = paste(mean5, " (", sd5, ")")) %>%
select(cluster4, cluster5)
contrasts <- data.frame(contrasts$Title, results)
colnames(contrasts) <- c("Title",
cell_spec("Cluster 4\nMean (sd)", color = "#A6D854"),
cell_spec("Cluster 5\nMean (sd)", color = "#E78AC3"))
kable(contrasts, escape = F, align = "lcc") %>%
kable_styling(full_width = T) %>%
column_spec(1, width = "70em") %>%
column_spec(2, width = "20em") %>%
column_spec(3, width = "20em")
Title |
Cluster 4 Mean (sd) |
Cluster 5 Mean (sd) |
---|---|---|
l’ensemble du projet de loi prorogeant l’application de la loi n° 55-385 du 3 avril 1955 relative à l’état d’urgence et renforçant l’efficacité de ses dispositions (première lecture). | -0.04 ( 0.21 ) | 0.96 ( 0.19 ) |
l’ensemble du projet de loi de finances rectificative pour 2015 (première lecture). | 0 ( 0 ) | 0.95 ( 0.22 ) |
l’ensemble du projet de loi pour une République numérique (première lecture). | 0.04 ( 0.21 ) | 0.95 ( 0.21 ) |
le projet de loi de finances rectificative pour 2014 (1ère lecture). | 0.09 ( 0.29 ) | 0.97 ( 0.16 ) |
l’ensemble du projet de loi de finances pour 2016 (première lecture). | -0.04 ( 0.21 ) | 0.93 ( 0.25 ) |
By now, you should therefore have the tools to do explore scRNA-seq data on your own. Many other tutorials, either general or focused on one package, exist. A good place to start and find them, or to get access to some datasets, is the website of the bioconductor project. I especially recommend the book of the Bioconductor Workshop.
This tutorial only skimmed the analysis of the dataset and there are many other options to explore: looking at voting records across times, finding which bills drew the most votes, clustering bills, … The script used to scrap the data can be found on the Github page and can be adapted for further enquiries.
Special thanks to Vincent Viers for the initial inspiration of this project. The code used for scraping the data is based on the blog post https://freakonometrics.hypotheses.org/50973.
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 stringr_1.4.0
## [3] readr_1.3.1 ggplot2_3.1.1
## [5] purrr_0.3.2 tidyr_0.8.3
## [7] dplyr_0.8.1 kableExtra_1.1.0
## [9] knitr_1.23 clusterExperiment_2.4.4
## [11] bigmemory_4.5.33 SingleCellExperiment_1.6.0
## [13] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
## [15] BiocParallel_1.18.0 matrixStats_0.54.0
## [17] Biobase_2.44.0 GenomicRanges_1.36.0
## [19] GenomeInfoDb_1.20.0 IRanges_2.18.1
## [21] S4Vectors_0.22.0 BiocGenerics_0.30.0
## [23] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] bigmemory.sri_0.1.3 colorspace_1.4-1 XVector_0.24.0
## [4] base64enc_0.1-3 rstudioapi_0.10 gsl_1.9-10.3
## [7] bit64_0.9-7 AnnotationDbi_1.46.0 RSpectra_0.14-0
## [10] mvtnorm_1.0-10 xml2_1.2.0 codetools_0.2-16
## [13] splines_3.6.0 doParallel_1.0.14 ade4_1.7-13
## [16] locfdr_1.1-8 phylobase_0.8.6 gridBase_0.4-7
## [19] annotate_1.62.0 cluster_2.0.9 kernlab_0.9-27
## [22] stabledist_0.7-1 HDF5Array_1.12.1 copula_0.999-19.1
## [25] BiocManager_1.30.4 compiler_3.6.0 httr_1.4.0
## [28] assertthat_0.2.1 Matrix_1.2-17 lazyeval_0.2.2
## [31] limma_3.40.2 htmltools_0.3.6 prettyunits_1.0.2
## [34] tools_3.6.0 gtable_0.3.0 glue_1.3.1
## [37] GenomeInfoDbData_1.2.1 reshape2_1.4.3 Rcpp_1.0.1
## [40] softImpute_1.4 zinbwave_1.6.0 NMF_0.21.0
## [43] ape_5.3 nlme_3.1-140 iterators_1.0.10
## [46] xfun_0.7 rvest_0.3.4 rngtools_1.3.1.1
## [49] XML_3.98-1.20 edgeR_3.26.4 zlibbioc_1.30.0
## [52] MASS_7.3-51.4 scales_1.0.0 hms_0.4.2
## [55] rhdf5_2.28.0 yaml_2.2.0 memoise_1.1.0
## [58] pkgmaker_0.27 stringi_1.4.3 RSQLite_2.1.1
## [61] genefilter_1.66.0 pcaPP_1.9-73 foreach_1.4.4
## [64] bibtex_0.4.2 rlang_0.3.4 pkgconfig_2.0.2
## [67] bitops_1.0-6 rncl_0.8.3 evaluate_0.14
## [70] lattice_0.20-38 Rhdf5lib_1.6.0 labeling_0.3
## [73] bit_1.1-14 tidyselect_0.2.5 plyr_1.8.4
## [76] magrittr_1.5 bookdown_0.11 R6_2.4.0
## [79] ADGofTest_0.3 DBI_1.0.0 pillar_1.4.1
## [82] withr_2.1.2 survival_2.44-1.1 RCurl_1.95-4.12
## [85] tibble_2.1.3 pspline_1.0-18 crayon_1.3.4
## [88] uuid_0.1-2 howmany_0.3-1 rmarkdown_1.13
## [91] progress_1.2.2 RNeXML_2.3.0 locfit_1.5-9.1
## [94] grid_3.6.0 blob_1.1.1 webshot_0.5.1
## [97] digest_0.6.19 xtable_1.8-4 numDeriv_2016.8-1.1
## [100] munsell_0.5.0 glmnet_2.0-18 viridisLite_0.3.0
## [103] registry_0.5-1