AI2Health: AI and Computational Biology for Health
The AI2Health research cluster is geared towards fundamental and interdisciplinary research in AI and computational biology for human health. Numerous scientific advances have emerged in recent years that are specific to the application of AI to human health. The goal of AI2Health is to leverage this momentum and develop AI methods and tools to make advances in essential problems in human health through three research areas in Computational Biology: (i) Systems and Integrative Biology, (ii) Structural and Functional Biology, and (iii) Metagenomics and Microbiome Biology.
Cluster Members
- Lead PI: Todd Treangen (Computer Science, Bioengineering, Rice University)
- Eric Chi (Statistics, Rice University)
- Santiago Segarra (Electrical & Computer Engineering, Rice University)
- Vicky Yao (Computer Science, Rice University)
Collaborators
- Lydia Kavraki (Computer Science, Mechanical Engineering, Bioengineering, Electrical & Computer Engineering, Rice University)
- Luay Nakhleh (William and Stephanie Sick Dean, George R. Brown School of Engineering; Computer, BioSciences, Rice University)
- Fritz Sedlazeck (Computer Science, Rice University; Human Genome Sequencing Center, Baylor College of Medicine)
-
Selected Publications
-
AI2Health faculty highlighted in bold.
- Ali Azizpour, Advait Balaji, Todd J. Treangen, and Santiago Segarra. “Graph-based self-supervised learning for repeat detection in metagenomic assembly.” Genome Research (2024): gr-279136.
- Advait Balaji, Bryce Kille, Anthony D. Kappell, Gene D. Godbold, Madeline Diep, RA Leo Elworth, Zhiqin Qian, Dreycey Albin, Dan Nasko, Nidhi Shah, Mihai Pop, Santiago Segarra, Krista Ternus, and Todd J. Treangen. “SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning.” Genome biology 23, no. 1 (2022): 133.
- Gal Mishne, Eric Chi, and Ronald Coifman. “Co-manifold learning with missing data.” In International Conference on Machine Learning, pp. 4605-4614. PMLR, 2019.
- Felix Quintana, Todd J. Treangen, and Lydia Kavraki. “Leveraging Large Language Models for Predicting Microbial Virulence from Protein Structure and Sequence.” In Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 1-6. 2023.
- Nick Sapoval, Amirali Aghazadeh, Michael Nute, Dinler Antunes, Advait Balaji, Rich Baraniuk, CJ Barberan, Ruth Dannenfelser, Chen Dun, Mohammadamin Edrisi, RA Leo Elworth , Bryce Kille, Anastasios Kyrillidis, Luay Nakhleh, Cameron Wolfe, Zhi Yan, Vicky Yao, and Todd J. Treangen. “Current progress and open challenges for applying deep learning across the biosciences”. Nat Commun. 2022 Apr 1;13(1):1728. doi: 10.1038/s41467-022-29268-7. PMID: 35365602; PMCID: PMC8976012.
- Mohammadamin Edrisi, Huw A. Ogilvie, Meng Li, and Luay Nakhleh. “MoTERNN: Classifying the Mode of Cancer Evolution Using Recursive Neural Networks.” In RECOMB International Workshop on Comparative Genomics, pp. 232-247. Cham: Springer Nature Switzerland, 2023.
- Romanos Fasoulis, Georgios Paliouras, and Lydia E. Kavraki. “Graph representation learning for structural proteomics.” Emerging Topics in Life Sciences 5, no. 6 (2021): 789-802.
- Fasoulis, Romanos, Mauricio M. Rigo, Gregory Lizée, Dinler A. Antunes, and Lydia E. Kavraki. “APE-Gen2. 0: Expanding Rapid Class I Peptide–Major Histocompatibility Complex Modeling to Post-Translational Modifications and Noncanonical Peptide Geometries.” Journal of Chemical Information and Modeling 64, no. 5 (2024): 1730-1750.
- Lechuan Li, Ruth Dannenfelser, Yu Zhu, Nathaniel Hejduk, Santiago Segarra, and Vicky Yao. “Joint embedding of biological networks for cross-species functional alignment.” Bioinformatics 39, no. 9 (2023): btad529.
- Lechuan Li, Ruth Dannenfelser, Charlie Cruz, and Vicky Yao. “ANDES: a novel best-match approach for enhancing gene set analysis in embedding spaces.” bioRxiv (2023): 2023-11.
- Ruibang Luo, Fritz J. Sedlazeck, Tak-Wah Lam, and Michael C. Schatz. “A multi-task convolutional deep neural network for variant calling in single molecule sequencing.” Nature communications 10, no. 1 (2019): 998.
- Ali Azizpour, Advait Balaji, Todd J. Treangen, and Santiago Segarra. “Graph-based self-supervised learning for repeat detection in metagenomic assembly.” Genome Research (2024): gr-279136.
Research Areas
1) AI for Systems and Integrative Biology
Systems biology is one of the early adopters of incorporating computational advances and machine learning, with a heavy focus on Bayesian methods. However, as increasingly more specific, high throughput biological assays continue to grow (both in number and in measurement types), there are new methodological challenges:
Incorporating biologically-inspired intuition into AI model formulation is key to building generalizable methods. One example of such approaches is building biologically meaningful embedding spaces, an AI/ML technique representing high-dimensional data in lower-dimensional spaces while capturing complex nonlinear relationships and intrinsic structures in the original data. Instead of using problem-agnostic embeddings, AI2Health core members Yao and Segarra have developed biologically motivated embedding methods to enable joint modeling of protein interaction networks from different organisms.
Moreover, we observed that in biology, gene set comparisons are routine (e.g., doing gene set enrichment comparing annotated genes with a collection of new genes), yet even in research areas where embeddings are used routinely, such as natural language processing, most efforts to compare sets rely on simple averages. Noticing this gap led to the development of a new, effective general-purpose set comparison method for embeddings that shows promise for broader non-biological applications. These examples highlight our vision for leveraging biological insights to innovate AI methods, which can synergistically enhance fundamental research in AI.
2) AI for Structural and Functional Biology
AlphaFold has unveiled the enormous potential of AI applied to the problem of protein folding and structure prediction, making paradigm-shifting progress on a 50-year-old grand challenge in computational biology. Despite this major success, there are two main limitations.
Building on the success of AlphaFold in protein structure prediction, we are expanding our focus to develop novel machine learning techniques in Functional Biology, particularly for assigning functions to protein-coding genes. Highlighting this area, Segarra and Treangen have developed an ensemble machine learning method to predict microbial pathogenic functions, which we plan to enhance by incorporating protein structure data and improving the handling of poorly annotated genes.
The challenge in computational pathogen screening includes dealing with complex host interactions, virulence factor dynamics, and community-level dynamics. Cluster members Kavraki and Treangen are now exploring an LLM-inspired model that leverages protein sequences and structures to predict functions, specifically targeting virulence factors. This model integrates evolutionary features derived from the DistilProtBert language model with protein structures in a graph convolutional network, promising significant advancements in understanding and predicting protein functions that impact disease causation.
3) AI for Metagenomics and Microbiome Biology
The microbiome refers to the collection of microbes (bacteria, viruses, fungi) that occupy a specific ecological niche (human gut, skin, air filters, etc). Given the established relevance of the human microbiome to human health, there is a recent push towards applying ML to the human host microbiome (in particular, the gut microbiome) to learn signatures of microbiome health and disease states.
Our motivating example on repeat detection highlights the untapped potential of learning discriminative graph features through graph neural networks (GNNs). Unlike predefined features, GNNs generate these characteristics through trainable iterative computations, making them adaptive to specific data samples. This novel approach has shown success in fields such as wireless networks, material discovery, and molecular design, yet its application in metagenomics is still emerging. A primary challenge in genomic data analysis is that most of the data is unlabeled, particularly in distinguishing between repeat and non-repeat sequences.
Modern machine learning techniques seek to embrace this unlabeled data through self-supervised learning, which starts with initially noisy labels and is refined through subsequent machine learning iterations and fine-tuning. Recently, we presented the first use of graph-based self-supervised learning for repeat detection in metagenomics. This serves as an illustration of the potential benefits that can be unlocked by further exploring this avenue.
Long-Term Goal
The AI2Health research cluster aims to tackle pressing health issues of our time. Toward this goal, AI2Health will leverage the research expertise of its core and affiliate members and collaborations with clinicians and scientists at the Texas Medical Center. Our research cluster will focus on transformative Health-outcome-inspired AI research in predicting, diagnosing, and treating health issues. Examples of specific goals include (i) improved cancer screening for early cancer detection and treatment, (ii) early warning systems for pathogen outbreak tracking and mitigation, and (iii) improved vaccine and drug design.
Updates will be posted to the following web page, managed by the cluster: https://treangenlab.github.io/ai2health/