Overview of Dataset Splits

The data splitting function splits data into training, validation and test set for machine learning practitioners to train, tune and evaluate their models. This function is called directly on the data loader class. It mainly takes in the following three inputs:

  • method the spliting scheme. TDC provides various spliting schemes to reflect realistic evaluations (details in section below). Default is random split.
  • seed the random seed.
  • frac the fraction of train/validation/test set, in default, it is set to be [0.7, 0.1, 0.2].

As the default TDC data format is Pandas DataFrame, it will return a dictionary with key 'train', 'valid', and 'test' and value of each set's data frame.

from tdc.X import Y
data = Y(name = Z)
split = data.get_split(method = 'random', seed = 42, frac = [0.7, 0.1, 0.2])
# split: {'train': train dataframe, 'valid': valid dataframe, 'test': test dataframe}

Important Note: in this part, TDC provides a generic data split function where you can tune splitting schemes for your various research needs. In the benchmark mode, it is set to be a specific splitting method with fixed seed and fractions. To get that split, please use the BenchmarkGroup class in the Leaderboard page.

Below describes the various split schemes:

Random Split

Description: The default for any split function. Randomly split the data into train, validation, and test set.

from tdc.single_pred import ADME
data = ADME(name = 'Caco2_Wang')
split = data.get_split(method = 'random')

Scaffold Split

Description: Scaffold split is based on the scaffold of the molecules so that train/val/test set is more structurally different. It is more challenging than random split.

Note: Scaffold split only applies to single-instance drugs-related tasks (ADME, Tox, HTS). Scaffold split also requires RDKit installed.

from tdc.single_pred import ADME
data = ADME(name = 'Caco2_Wang')
split = data.get_split(method = 'scaffold')

Cold-Start Split

Description: Cold-start split is for multi-instance prediction problems such as DTI, GDA, DrugRes, MTI, where they present two entity types. It first splits on one entity type into train/valid/test and then move all pairs associated with that entity in each set as the final splits. To use that, set column_name to be the entity that you want to split on. For example, to do cold drug split on DTI task:

from tdc.multi_pred import DTI
data = DTI(name = 'DAVIS')
split = data.get_split(method = 'cold_split', column_name = 'Drug')

Combination Split

Description: Drug Combination Split is for drug combination dataset, where we split on drug combinations such that training/validation/testing would have distinct set of combinations.

Note: Combination split only applies to drug combination tasks such as DrugSyn.

from tdc.multi_pred import DrugSyn
data = DrugSyn(name = 'DrugComb')
split = data.get_split(method = 'combination')