TitleCERENKOV3: Clustering and molecular network-derived features improve computational prediction of functional noncoding SNPs.
Publication TypeJournal Article
Year of Publication2020
AuthorsYao, Y, Ramsey, SA
JournalPac Symp Biocomput
Date Published2020
KeywordsAlgorithms, Cluster Analysis, Computational Biology, Genome-Wide Association Study, Humans, Machine Learning, Models, Genetic, Polymorphism, Single Nucleotide, RNA, Untranslated

Identification of causal noncoding single nucleotide polymorphisms (SNPs) is important for maximizing the knowledge dividend from human genome-wide association studies (GWAS). Recently, diverse machine learning-based methods have been used for functional SNP identification; however, this task remains a fundamental challenge in computational biology. We report CERENKOV3, a machine learning pipeline that leverages clustering-derived and molecular network-derived features to improve prediction accuracy of regulatory SNPs (rSNPs) in the context of post-GWAS analysis. The clustering-derived feature, locus size (number of SNPs in the locus), derives from our locus partitioning procedure and represents the sizes of clusters based on SNP locations. We generated two molecular network-derived features from representation learning on a network representing SNP-gene and gene-gene relations. Based on empirical studies using a ground-truth SNP dataset, CERENKOV3 significantly improves rSNP recognition performance in AUPRC, AUROC, and AVGRANK (a locus-wise rank-based measure of classification accuracy we previously proposed).

Alternate JournalPac Symp Biocomput
PubMed ID31797625
PubMed Central IDPMC6897322
Grant ListOT2 TR002520 / TR / NCATS NIH HHS / United States