Finding All Significant Closed Connected Subgraphs

Abstract

In many applications, covariates describing the data are structured according to a known graph $\mathcal{G}$. Subgraphs of $\mathcal{G}$ can then serve to design new covariates, yielding an enriched representation of the data. Testing the association of these new covariates to an outcome of interest can provide more insight on critical biological processes. However, the number of subgraphs is often exponential in the number of original covariates. Therefore, a method testing all possible subgraphs would have very low power, due to multiple testing corrections and could quickly become computationally intractable. The concept of testable hypothesis has been used to simultaneously address both issues in similar contexts. Here, we introduce a method leveraging this concept to test all closed connected subgraphs, i.e., those which are not included in a larger one leading to the exact same covariate. We propose a novel enumeration scheme for these objects which fully exploits the pruning opportunity offered by testability, leading to drastic improvements in speed. We illustrate this improvement on both real and simulated datasets. This paves the way for numerous applications in biomedicine, especially for genome-wide association studies in bacterial genomes.

Publication
Machine Learning in Computational Biology (MLCB)
Hector Roux de Bézieux
Hector Roux de Bézieux
Ph.D Student in Biostatistics

Biostatistics Ph.D Student with strong interest in anything ‘omics related.

Related