Gene-Disease Association Prediction Task Overview

Definition: Many diseases are driven by genes aberrations. Gene-disease associations (GDA) quantify the relation among a pair of gene and disease. The GDA is usually constructed as a network where we can probe the gene-disease mechanisms by taking into account multiple genes and diseases factors. This task is to predict the association of any gene and disease from both a biochemical modeling and network edge classification perspectives.

Impact: A high association between a gene and disease could hint at a potential therapeutics target for the disease. Thus, to fill in the vastly incomplete GDA using machine learning accurately could bring numerous therapeutic opportunities.

Generalization: Extrapolating to unseen gene and disease pairs with accurate association prediction.

Product: Any therapeutics.

Pipeline: Basic biomedical research, target discovery.


Dataset Description: DisGeNET is a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases. DisGeNET integrates data from expert curated repositories, GWAS catalogues, animal models and the scientific literature. DisGeNET data are homogeneously annotated with controlled vocabularies and community-driven ontologies. TDC uses the curated subset from UNIPROT, CGI, ClinGen, Genomics England, CTD (human subset), PsyGeNET, and Orphanet. TDC maps disease ID to disease definition through MedGen and maps GeneID to uniprot amino acid sequence.

Task Description: Regression. Given the disease description and the amino acid sequence of the gene, predict their association.

Dataset Statistics: 52,476 gene-disease pairs, 7,399 genes, 7,095 diseases

Dataset Split: Random Split

from tdc.multi_pred import GDA
data = GDA(name = 'DisGeNET')
split = data.get_split()


[1] Piñero, Janet, et al. “The DisGeNET knowledge platform for disease genomics: 2019 update.” Nucleic acids research 48.D1 (2020): D845-D855.

[2] Halavi, Maryam, et al. “MedGen.” The NCBI Handbook [Internet]. 2nd edition. National Center for Biotechnology Information (US), 2018.

Dataset License: CC BY-NC-SA 4.0.