Research Publications

Publication statistics

28 Total Publications

68 Unique Authors

All Peer-Reviewed Publications (newest first)

Clinical Knowledge Representation in Data Science

Nguyen TA, Su W, Rajagopalan A, Abdurezak N, Hewryk OSI, & Romano JD LA CA

Annual Review of Biomedical Data Science (accepted; in-press)(2026)

Abstract

The vast potential of observational healthcare data in biomedical discovery remains largely unrealized because clinical records are fragmented, unstructured, and generated for patient care rather than research. Clinical knowledge representation (KR) helps to bridge this gap by encoding information in standardized, computable formats that preserve meaning and context. This review examines KR across the clinical data lifecycle, from its generation in healthcare settings to its transformation for secondary use and its eventual application in data science. We highlight foundational components such as standardized terminologies, ontologies, and common data models that enable data harmonization and interoperability. We further discuss how these structured representations support multimodal data integration and the development of more accurate, interpretable AI models. Adopting a semantic-first approach to KR is essential for transforming fragmented clinical data into reusable, trustworthy knowledge that advances data-driven discovery and improves patient care.

Towards symbolic regression for interpretable clinical decision scores

Aldeia GSI, Romano JD, de Franca FO, Herman DS, & La Cava WG

Philosophical Transactions A (accepted; in-press)(2026)

Abstract

Medical decision-making makes frequent use of algorithms that combine risk equations with rules, providing clear and standardized treatment pathways. Symbolic regression (SR) traditionally limits its search space to continuous function forms and their parameters, making it difficult to model this decisionmaking. However, due to its ability to derive datadriven, interpretable models, SR holds promise for developing data-driven clinical risk scores. To that end we introduce Brush, an SR algorithm that combines decision-tree-like splitting algorithms with non-linear constant optimization, allowing for seamless integration of rule-based logic into symbolic regression and classification models. Brush achieves Paretooptimal performance on SRBench, and was applied to recapitulate two widely used clinical scoring systems, achieving high accuracy and interpretable models. Compared to decision trees, random forests, and other SR methods, Brush achieves comparable or superior predictive performance while producing simpler models.

Learnable Protein Representations in Computational Biology for Predicting Drug-Target Affinity

Kumar R, Romano JD, & Ritchie MD

Journal of Cheminformatics(2026)

DOI | PubMed

Abstract

In this review, we discuss the various different types of learnable protein representations that have been used in computational biology, with a particular focus on representations that have been used in the paradigm of predicting drug-target affinity. We explore this from multiple perspectives: the source of protein information used, the training paradigms used in generating and applying such representations, and the types of (deep-learning-based) encoding or embedding methods that have been used to generate and operate on such representations. We focus on drug-target affinity due to its particular relevance and utility in the field of drug development and assessment, and we make suggestions for how drug-target affinity prediction methods development can be further improved by examining the current literature from the aforementioned perspectives. This survey thus serves as a valuable resource for researchers seeking to develop methods for predicting drug-target affinity by exploring how protein information has been used and could be used in effective ways to improve such predictions.

DRIVE-KG: Enhancing variant-phenotype association discovery in understudied complex diseases using heterogeneous knowledge graphs

Rajagopalan A, Nguyen TA, Guare LA, Garao Rico AL, Venkatesh R, Caruth L, Regeneron Genetics Center, Penn Medicine BioBank, Verma A, Ritchie MD, Hall MA, Setia-Verma S, & Romano JD LA CA

2026 Pacific Symposium on Biocomputing, 2026, 830-848(2026)

Abstract

Multi-omics data are instrumental in obtaining a comprehensive picture of complex biological systems. This is particularly useful for women's health conditions, such as endometriosis which has been historically understudied despite having a high prevalence (around 10% of women of reproductive age). Subsequently, endometriosis has limited genetic characterization: current genome-wide association studies explain only 11% of its 47% total estimated heritability. Graph representations provide an intuitive and meaningful way to relate concepts across diverse data sources and address fundamental sparsity and dimensionality challenges with multi-omics data analysis. Here we present DRIVE-KG (Disease Risk Inference and Variant Exploration-Knowledge Graph), which uses a heterogeneous graph representation to integrate biological data from multi-omics datasets: dbSNP, NCBI Human Gene, Omics Pred, GTEx, and Open Targets. We drew directly from the knowledge captured in these data, using nodes to represent genes, single nucleotide polymorphisms, proteins, and phenotypes, and edges to represent relationships between these concepts. We trained two models using DRIVE-KG: a link prediction model to suggest associations between SNPs and two pilot phenotypes (endometriosis and obesity), and a graph convolutional network (GCN) to classify patient-level endometriosis status. We conducted the patient-level classification using data from 1,441 Penn Medicine BioBank participants with gold standard chart-reviewed endometriosis status. The link prediction model uncovered 66 high-confidence (score ≥ 0.95) previously unreported SNP-endometriosis associations. Many of these variants were linked to obesity/body mass index traits (24.2%), lipid metabolism (6%), and depressive disorders (4.5%), showing agreement with emerging hypotheses about endometriosis etiology. In contrast, 11% of the 149 high confidence, candidate SNP-obesity associations (score ≥ 0.9888) were in LD with known obesity associations. The GCN to classify patient endometriosis status had an AUPRC of 0.738 compared to 0.679 for a genetic risk score. Despite this moderate improvement, we found that the GCN learned meaningful stratification of underlying adenomyosis signal and severe grades of endometriosis. We have demonstrated that heterogeneous integration of multi-omics data is valuable for diverse downstream tasks-including discovery and clinical prediction-particularly for understudied diseases where traditional genomic approaches are insufficient.

CASTER-DTA: Equivariant Graph Neural Networks for Predicting Drug-Target Affinity

Kumar R, Romano JD, & Ritchie MD

Briefings in Bioinformatics, 26, bbaf554(2025)

DOI | PubMed | PMC

Abstract

Accurately determining the binding affinity of a ligand with a protein is important for drug design, development, and screening. With the advent of accessible protein structure prediction methods such as AlphaFold, predicted protein 3D structures are readily available; however, scalable methods for predicting binding affinity currently do not take full advantage of 3D protein information. Here, we present CASTER-DTA (Cross-Attention with Structural Target Equivariant Representations for Drug–Target Affinity), which uses an equivariant graph neural network (GNN) to learn more robust protein representations alongside a standard GNN to learn molecular representations to predict DTA. We augment these representations by incorporating an attention-based mechanism between protein residues and drug atoms to improve interpretability. We show that CASTER-DTA represents a state-of-the-art improvement on multiple benchmarks for predicting DTA, and that it generates novel insights for several related tasks. We then apply CASTER-DTA to create a large resource of the binding affinities of every drug approved by the U.S. Food and Drug Administration (FDA) against every protein in the human proteome and make these predictions freely available for download. We also make available a web server for researchers to apply a pretrained CASTER-DTA model for predicting binding affinities between arbitrary proteins and drugs.

Enhancing Molecular Representation Learning through the Combination of 3D and 2D Graph Machine Learning

Pan IT & Romano JD LA CA

The 39th Annual AAAI Conference on Artificial Intelligence, 39, 29464-29465(2025)

DOI

Abstract

Molecular machine learning has broad applications across multiple domains such as drug development, environmental toxicology, and materials science. Various pre-trained frameworks using self-supervised representation learning have emerged to tackle the difficulty of obtaining large molecular datasets useful for training high-performing molecular machine learning models. In this study, we explore a novel representation learning framework trained using both 2D and 3D molecular data. Specifically, a 3D invariant graph neural network to learn how to capture 3D atomic information and then pass these atomic representations into a regular 2D graph neural network which can leverage molecular topology. Results from experiments demonstrate the representations produced by our method using both 3D and 2D molecular information lead to strong performance in downstream tasks.

Network-based analyses of multiomics data in biomedicine

Kumar R, Romano JD, & Ritchie MD

BioData Mining, 18(2025)

DOI | PubMed | PMC

Abstract

Network representations of data are designed to encode relationships between concepts as sets of edges between nodes. Human biology is inherently complex and is represented by data that often exists in a hierarchical nature. One canonical example is the relationship that exists within and between various -omics datasets, including genomics, transcriptomics, and proteomics, among others. Encoding such data in a network-based or graph-based representation allows the explicit incorporation of such relationships into various biomedical big data tasks, including (but not limited to) disease subtyping, interaction prediction, biomarker identification, and patient classification. This review will present various existing approaches in using network representations and analysis of data in multiomics in the framework of deep learning and machine learning approaches, subdivided into supervised and unsupervised approaches, to identify benefits and drawbacks of various approaches as well as the possible next steps for the field.

Session Introduction: AI and Machine Learning in Clinical Medicine: Generative and Interactive Systems at the Human-Machine Interface

Hardasht FN, Kim D, Romano JD, Tison G, Daneshjou R, & Chen JH

Pacific Symposium on Biocomputing, 30, 33-39(2025)

DOI | PubMed

Abstract

Artificial Intelligence (AI) technologies are increasingly capable of processing complex and multilayered datasets. Innovations in generative AI and deep learning have notably enhanced the extraction of insights from both unstructured texts, images, and structured data alike. These breakthroughs in AI technology have spurred a wave of research in the medical field, leading to the creation of a variety of tools aimed at improving clinical decision-making, patient monitoring, image analysis, and emergency response systems. However, thorough research is essential to fully understand the broader impact and potential consequences of deploying AI within the healthcare sector.

The Alzheimer's Knowledge Base: A Knowledge Graph for Alzheimer's Disease Research

Romano JD, Truong V, Kumar R, Venkatesan M, Graham BE, Hao Y, Matsumoto N, Li X, Wang Z, Ritchie MD, Shen L, & Moore JH FA CA

Journal of Medical Internet Research, 26, e46777(2024)

DOI | PubMed | PDF

Abstract

Background: As global populations age and become susceptible to neurodegenerative illnesses, new therapies for Alzheimer disease (AD) are urgently needed. Existing data resources for drug discovery and repurposing fail to capture relationships central to the disease’s etiology and response to drugs. Objective: We designed the Alzheimer’s Knowledge Base (AlzKB) to alleviate this need by providing a comprehensive knowledge representation of AD etiology and candidate therapeutics. Methods: We designed the AlzKB as a large, heterogeneous graph knowledge base assembled using 22 diverse external data sources describing biological and pharmaceutical entities at different levels of organization (eg, chemicals, genes, anatomy, and diseases). AlzKB uses a Web Ontology Language 2 ontology to enforce semantic consistency and allow for ontological inference. We provide a public version of AlzKB and allow users to run and modify local versions of the knowledge base. Results: AlzKB is freely available on the web and currently contains 118,902 entities with 1,309,527 relationships between those entities. To demonstrate its value, we used graph data science and machine learning to (1) propose new therapeutic targets based on similarities of AD to Parkinson disease and (2) repurpose existing drugs that may treat AD. For each use case, AlzKB recovers known therapeutic associations while proposing biologically plausible new ones. Conclusions: AlzKB is a new, publicly available knowledge resource that enables researchers to discover complex translational associations for AD drug discovery. Through 2 use cases, we show that it is a valuable tool for proposing novel therapeutic hypotheses based on public biomedical knowledge.

Centralized and Federated Models for the Analysis of Clinical Data

Li R, Romano JD, Chen Y, & Moore JH

Annual Review of Biomedical Data Science, 7, 179-199(2024)

DOI | PubMed | PMC

Abstract

The progress of precision medicine research hinges on the gathering and analysis of extensive and diverse clinical datasets. With the continued expansion of modalities, scales, and sources of clinical datasets, it becomes imperative to devise methods for aggregating information from these varied sources to achieve a comprehensive understanding of diseases. In this review, we describe two important approaches for the analysis of diverse clinical datasets, namely the centralized model and federated model. We compare and contrast the strengths and weaknesses inherent in each model and present recent progress in methodologies and their associated challenges. Finally, we present an outlook on the opportunities that both models hold for the future analysis of clinical data.

Knowledge Graph Aids Comprehensive Explanation of Drug and Chemical Toxicity

Hao Y, Romano JD, & Moore JH

CPT: Pharmacometrics & Systems Pharmacology, 12, 1072-1079(2023)

DOI | PubMed | PMC

Abstract

In computational toxicology, prediction of complex endpoints has always been challenging, as they often involve multiple distinct mechanisms. State-of-the-art models are either limited by low accuracy, or lack of interpretability due to their black-box nature. Here, we introduce AIDTox, an interpretable deep learning model which incorporates curated knowledge of chemical-gene connections, gene-pathway annotations, and pathway hierarchy. AIDTox accurately predicts cytotoxicity outcomes in HepG2 and HEK293 cells. It also provides comprehensive explanations of cytotoxicity covering multiple aspects of drug activity, including target interaction, metabolism, and elimination. In summary, AIDTox provides a computational framework for unveiling cellular mechanisms for complex toxicity endpoints.

Exploring genetic influences on adverse outcome pathways using heuristic simulation and graph data science

Romano JD, Mei L, Senn J, Moore JH, & Mortensen HM FA

Computational Toxicology, 25, 100261(2023)

DOI | PubMed | PMC | PDF

Abstract

Adverse outcome pathways provide a powerful tool for understanding the biological signaling cascades that lead to disease outcomes following toxicity. The framework outlines downstream responses known as key events, culminating in a clinically significant adverse outcome as a final result of the toxic exposure. Here we use the AOP framework combined with artificial intelligence methods to gain novel insights into genetic mechanisms that underlie toxicity-mediated adverse health outcomes. Specifically, we focus on liver cancer as a case study with diverse underlying mechanisms that are clinically significant. Our approach uses two complementary AI techniques: Generative modeling via automated machine learning and genetic algorithms, and graph machine learning. We used data from the US Environmental Protection Agency's Adverse Outcome Pathway Database (AOP-DB; aopdb.epa.gov) and the UK Biobank's genetic data repository. We use the AOP-DB to extract disease-specific AOPs and build graph neural networks used in our final analyses. We use the UK Biobank to retrieve real-world genotype and phenotype data, where genotypes are based on single nucleotide polymorphism data extracted from the AOP-DB, and phenotypes are case/control cohorts for the disease of interest (liver cancer) corresponding to those adverse outcome pathways. We also use propensity score matching to appropriately sample based on important covariates (demographics, comorbidities, and social deprivation indices) and to balance the case and control populations in our machine language training/testing datasets. Finally, we describe a novel putative risk factor for LC that depends on genetic variation in both the aryl-hydrocarbon receptor (AHR) and ATP binding cassette subfamily B member 11 (ABCB11) genes.

Discovering venom-derived drug candidates using differential gene expression

Romano JD, Li H, Napolitano T, Realubit R, Karan C, Holford M, & Tatonetti NP FA

Toxins, 15, 451(2023)

DOI | PubMed | PMC | PDF

Abstract

Venoms are a diverse and complex group of natural toxins that have been adapted to treat many types of human disease, but rigorous computational approaches for discovering new therapeutic activities are scarce. We have designed and validated a new platform-named VenomSeq-to systematically identify putative associations between venoms and drugs/diseases via high-throughput transcriptomics and perturbational differential gene expression analysis. In this study, we describe the architecture of VenomSeq and its evaluation using the crude venoms from 25 diverse animal species and 9 purified teretoxin peptides. By integrating comparisons to public repositories of differential expression, associations between regulatory networks and disease, and existing knowledge of venom activity, we provide a number of new therapeutic hypotheses linking venoms to human diseases supported by multiple layers of preliminary evidence.

Improving QSAR Modeling for Predictive Toxicology using Publicly Aggregated Semantic Graph Data and Graph Neural Networks

Romano JD, Hao Y, & Moore JH FA

Pacific Symposium on Biocomputing, 27, 187-198(2022)

DOI | PubMed | PMC

Abstract

Quantitative Structure-Activity Relationship (QSAR) modeling is a common computational technique for predicting chemical toxicity, but a lack of new methodological innovations has impeded QSAR performance on many tasks. We show that contemporary QSAR modeling for predictive toxicology can be substantially improved by incorporating semantic graph data aggregated from open-access public databases, and analyzing those data in the context of graph neural networks (GNNs). Furthermore, we introspect the GNNs to demonstrate how they can lead to more interpretable applications of QSAR, and use ablation analysis to explore the contribution of different data elements to the final models' performance.

PMLB v1.0: An open-source dataset collection for benchmarking machine learning methods

Romano JD, Le TT, La Cava W, Gregg JT, Goldberg DJ, Chakraborty P, Ray NL, Himmelstein D, Fu W, & Moore JH FA

Bioinformatics, 38, 878-880(2022)

PMC

Abstract

Motivation: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows. Results: This release of PMLB (Penn Machine Learning Benchmarks) provides the largest collection of diverse, public benchmark datasets for evaluating new machine learning and data science methods aggregated in one location. v1.0 introduces a number of critical improvements developed following discussions with the open-source community. Availability and implementation: PMLB is available at https://github.com/EpistasisLab/pmlb. Python and R interfaces for PMLB can be installed through the Python Package Index and Comprehensive R Archive Network, respectively.

Automating predictive toxicology using ComptoxAI

Romano JD, Hao Y, Moore JH, & Penning T FA

Chemical Research in Toxicology, 35, 1370-1382(2022)

DOI | PubMed | PMC

Abstract

ComptoxAI is a new data infrastructure for computational and artificial intelligence research in predictive toxicology. Here, we describe and showcase ComptoxAI’s graph-structured knowledge base in the context of three real-world use-cases, demonstrating that it can rapidly answer complex questions about toxicology that are infeasible using previous technologies and data resources. These use-cases each demonstrate a tool for information retrieval from the knowledge base being used to solve a specific task: The “shortest path” module is used to identify mechanistic links between perfluorooctanoic acid (PFOA) exposure and nonalcoholic fatty liver disease; the “expand network” module identifies communities that are linked to dioxin toxicity; and the quantitative structure–activity relationship (QSAR) dataset generator predicts pregnane X receptor agonism in a set of 4,021 pesticide ingredients. The contents of ComptoxAI’s source data are rigorously aggregated from a diverse array of public third-party databases, and ComptoxAI is designed as a free, public, and open-source toolkit to enable diverse classes of users including biomedical researchers, public health and regulatory officials, and the general public to predict toxicology of unknowns and modes of action.

Knowledge-guided deep learning models of drug toxicity improve interpretation

Hao Y, Romano JD, & Moore JH

Patterns, 3(2022)

DOI | PubMed | PMC

Abstract

In drug development, a major reason for attrition is the lack of understanding of cellular mechanisms governing drug toxicity. The black-box nature of conventional classification models has limited their utility in identifying toxicity pathways. Here we developed DTox (deep learning for toxicology), an interpretation framework for knowledge-guided neural networks, which can predict compound response to toxicity assays and infer toxicity pathways of individual compounds. We demonstrate that DTox can achieve the same level of predictive performance as conventional models with a significant improvement in interpretability. Using DTox, we were able to rediscover mechanisms of transcription activation by three nuclear receptors, recapitulate cellular activities induced by aromatase inhibitors and pregnane X receptor (PXR) agonists, and differentiate distinctive mechanisms leading to HepG2 cytotoxicity. Virtual screening by DTox revealed that compounds with predicted cytotoxicity are at higher risk for clinical hepatic phenotypes. In summary, DTox provides a framework for deciphering cellular mechanisms of toxicity in silico.

The promise of automated machine learning for the genetic analysis of complex traits

Manduchi E, Romano JD, & Moore JH

Human Genetics, 141, 1529-1544(2022)

DOI | PubMed | PMC

Abstract

The genetic analysis of complex traits has been dominated by parametric statistical methods due to their theoretical properties, ease of use, computational efficiency, and intuitive interpretation. However, there are likely to be patterns arising from complex genetic architectures which are more easily detected and modeled using machine learning methods. Unfortunately, selecting the right machine learning algorithm and tuning its hyperparameters can be daunting for experts and non-experts alike. The goal of automated machine learning (AutoML) is to let a computer algorithm identify the right algorithms and hyperparameters thus taking the guesswork out of the optimization process. We review the promises and challenges of AutoML for the genetic analysis of complex traits and give an overview of several approaches and some example applications to omics data. It is our hope that this review will motivate studies to develop and evaluate novel AutoML methods and software in the genetics and genomics space. The promise of AutoML is to enable anyone, regardless of training or expertise, to apply machine learning as part of their genetic analysis strategy.

Omics Methods in Toxins Research-A Toolkit to Drive the Future of Scientific Inquiry

Romano JD FA LA CA

Toxins, 14, 761(2022)

DOI | PubMed | PMC

TPOT-NN: Augmenting tree-based automated machine learning with neural network estimators

Romano JD, Le TT, Fu W, & Moore JH FA

Genetic Programming and Evolvable Machines, 22, 207-227(2021)

DOI

Abstract

Automated machine learning (AutoML) and artificial neural networks (ANNs) have revolutionized the field of artificial intelligence by yielding incredibly high-performing models to solve a myriad of inductive learning tasks. In spite of their successes, little guidance exists on when to use one versus the other. Furthermore, relatively few tools exist that allow the integration of both AutoML and ANNs in the same analysis to yield results combining both of their strengths. Here, we present TPOT-NN—a new extension to the tree-based AutoML software TPOT—and use it to explore the behavior of automated machine learning augmented with neural network estimators (AutoML+NN), particularly when compared to non-NN AutoML in the context of simple binary classification on a number of public benchmark datasets. Our observations suggest that TPOT-NN is an effective tool that achieves greater classification accuracy than standard tree-based AutoML on some datasets, with no loss in accuracy on others. We also provide preliminary guidelines for performing AutoML+NN analyses, and recommend possible future directions for AutoML+NN methods research, especially in the context of TPOT.

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

Manduchi E, Fu W, Romano JD, Ruberto S, & Moore JH

BMC Bioinformatics, 21, 430(2020)

DOI | PubMed | PMC

Abstract

Background: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. Results: We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids 'leakage' during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj. Conclusions: In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field. Keywords: AutoML; Covariate adjustment; Feature importance; Genetic programming; Pathways.

Ten simple rules for writing a paper about scientific software

Romano JD & Moore JH FA CA

PLoS Computational Biology, 16, e1008390(2020)

DOI | PubMed | PMC

Abstract

Papers describing software are an important part of computational fields of scientific research. These “software papers” are unique in a number of ways, and they require special consideration to improve their impact on the scientific community and their efficacy at conveying important information. Here, we discuss 10 specific rules for writing software papers, covering some of the different scenarios and publication types that might be encountered, and important questions from which all computational researchers would benefit by asking along the way.

A Decade of Translational Bioinformatics: A Retrospective Analysis of "Year-in-Review" Presentations

Romano JD, Bernauer M, McGrath SP, Nagar SD, & Freimuth RR FA

AMIA Joint Summits on Translational Science Proceedings, 2019, 335-344(2019)

PubMed | PMC

Abstract

For the past 11 years, the year-in-review (YIR) keynote presentation at the AMIA Informatics summit has been a perennial highlight. We hypothesized that the presented material from these keynotes could be used to assess both the recent trajectory of topics in informatics-especially translational bioinformatics (TBI)-as well as the scientific merit of the crowd-sourced process used to nominate, review, and select the papers presented at the YIR. We compare YIR articles to a background set of non-YIR articles from informatics journals using structured metadata and qualitative thematic analysis, paying specific attention to trends and popularity over time. These trends were inspected both internally (comparing the YIR sessions to each other) and externally (comparing them to the overall content of scientific literature for the same time period). In doing so, we identified some unexpected patterns that suggest important opportunities for TBI research in the future.

Informatics and Computational Methods in Natural Product Drug Discovery: A Review and Perspectives

Romano JD & Tatonetti NP FA

Frontiers in Genetics, 10, 368(2019)

DOI | PubMed | PMC

Abstract

The discovery of new pharmaceutical drugs is one of the preeminent tasks-scientifically, economically, and socially-in biomedical research. Advances in informatics and computational biology have increased productivity at many stages of the drug discovery pipeline. Nevertheless, drug discovery has slowed, largely due to the reliance on small molecules as the primary source of novel hypotheses. Natural products (such as plant metabolites, animal toxins, and immunological components) comprise a vast and diverse source of bioactive compounds, some of which are supported by thousands of years of traditional medicine, and are largely disjoint from the set of small molecules used commonly for discovery. However, natural products possess unique characteristics that distinguish them from traditional small molecule drug candidates, requiring new methods and approaches for assessing their therapeutic potential. In this review, we investigate a number of state-of-the-art techniques in bioinformatics, cheminformatics, and knowledge engineering for data-driven drug discovery from natural products. We focus on methods that aim to bridge the gap between traditional small-molecule drug candidates and different classes of natural products. We also explore the current informatics knowledge gaps and other barriers that need to be overcome to fully leverage these compounds for drug discovery. Finally, we conclude with a "road map" of research priorities that seeks to realize this goal.

Using a Novel Ontology to Inform the Discovery of Therapeutic Peptides from Animal Venoms

Romano JD & Tatonetti NP FA

AMIA Joint Summits on Translational Science Proceedings, 2016, 209-218(2016)

PubMed | PMC

Abstract

Venoms and venom-derived compounds constitute a rich and largely unexplored source of potentially therapeutic compounds. To facilitate biomedical research, it is necessary to design a robust informatics infrastructure that will allow semantic computation of venom concepts in a standardized, consistent manner. We have designed an ontology of venom-related concepts - named Venom Ontology - that reuses an existing public data source: UniProt's Tox-Prot database. In addition to describing the ontology and its construction, we have performed three separate case studies demonstrating its utility: (1) An exploration of venom peptide similarity networks within specific genera; (2) A broad overview of the distribution of available data among common taxonomic groups spanning the known tree of life; and (3) An analysis of the distribution of venom complexity across those same taxonomic groups. Venom Ontology is publicly available on BioPortal at http://bioportal.bioontology.org/ontologies/CU-VO.

Adapting simultaneous analysis phylogenomic techniques to study complex disease gene relationships

Romano JD, Tharp WG, & Sarkar IN FA

Journal of Biomedical Informatics, 54, 10-38(2015)

DOI | PubMed

Abstract

The characterization of complex diseases remains a great challenge for biomedical researchers due to the myriad interactions of genetic and environmental factors. Network medicine approaches strive to accommodate these factors holistically. Phylogenomic techniques that can leverage available genomic data may provide an evolutionary perspective that may elucidate knowledge for gene networks of complex diseases and provide another source of information for network medicine approaches. Here, an automated method is presented that leverages publicly available genomic data and phylogenomic techniques, resulting in a gene network. The potential of approach is demonstrated based on a case study of nine genes associated with Alzheimer Disease, a complex neurodegenerative syndrome. The developed technique, which is incorporated into an update to a previously described Perl script called "ASAP," was implemented through a suite of Ruby scripts entitled "ASAP2," first compiles a list of sequence-similarity based orthologues using PSI-BLAST and a recursive NCBI BLAST+ search strategy, then constructs maximum parsimony phylogenetic trees for each set of nucleotide and protein sequences, and calculates phylogenetic metrics (Incongruence Length Difference between orthologue sets, partitioned Bremer support values, combined branch scores, and Robinson-Foulds distance) to provide an empirical assessment of evolutionary conservation within a given genetic network. In addition to the individual phylogenetic metrics, ASAP2 provides results in a way that can be used to generate a gene network that represents evolutionary similarity based on topological similarity (the Robinson-Foulds distance). The results of this study demonstrate the potential for using phylogenomic approaches that enable the study of multiple genes simultaneously to provide insights about potential gene relationships that can be studied within a network medicine framework that may not have been apparent using traditional, single-gene methods. Furthermore, the results provide an initial integrated evolutionary history of an Alzheimer Disease gene network and identify potentially important co-evolutionary clustering that may warrant further investigation.

VenomKB, a new knowledge base for facilitating the validation of putative venom therapies

Romano JD & Tatonetti NP FA

Scientific Data, 2, 150065(2015)

PubMed | PMC

Abstract

Animal venoms have been used for therapeutic purposes since the dawn of recorded history. Only a small fraction, however, have been tested for pharmaceutical utility. Modern computational methods enable the systematic exploration of novel therapeutic uses for venom compounds. Unfortunately, there is currently no comprehensive resource describing the clinical effects of venoms to support this computational analysis. We present VenomKB, a new publicly accessible knowledge base and website that aims to act as a repository for emerging and putative venom therapies. Presently, it consists of three database tables: (1) Manually curated records of putative venom therapies supported by scientific literature, (2) automatically parsed MEDLINE articles describing compounds that may be venom derived, and their effects on the human body, and (3) automatically retrieved records from the new Semantic Medline resource that describe the effects of venom compounds on mammalian anatomy. Data from VenomKB may be selectively retrieved in a variety of popular data formats, are open-source, and will be continually updated as venom therapies become better understood.

Systems biology approaches for identifying adverse drug reactions and elucidating their underlying biological mechanisms

Boland MR, Jacunski A, Lorberbaum T, Romano JD, Moskovitch R, & Tatonetti NP

Wiley Interdisciplinary Reviews: Systems Biology and Medicine, 8, 104-122(2015)

DOI | PubMed | PMC

Abstract

Small molecules are indispensable to modern medical therapy. However, their use may lead to unintended, negative medical outcomes commonly referred to as adverse drug reactions (ADRs). These effects vary widely in mechanism, severity, and populations affected, making ADR prediction and identification important public health concerns. Current methods rely on clinical trials and postmarket surveillance programs to find novel ADRs; however, clinical trials are limited by small sample size, whereas postmarket surveillance methods may be biased and inherently leave patients at risk until sufficient clinical evidence has been gathered. Systems pharmacology, an emerging interdisciplinary field combining network and chemical biology, provides important tools to uncover and understand ADRs and may mitigate the drawbacks of traditional methods. In particular, network analysis allows researchers to integrate heterogeneous data sources and quantify the interactions between biological and chemical entities. Recent work in this area has combined chemical, biological, and large-scale observational health data to predict ADRs in both individual patients and global populations. In this review, we explore the rapid expansion of systems pharmacology in the study of ADRs. We enumerate the existing methods and strategies and illustrate progress in the field with a model framework that incorporates crucial data elements, such as diet and comorbidities, known to modulate ADR risk. Using this framework, we highlight avenues of research that may currently be underexplored, representing opportunities for future work.