Case Study: Kinase Cancer Inhibitor Dataset & Performance

Background

When selecting a dataset for a case study in biotech, it was important to find something unique to the domain. Ideally, we utilize a task with many and sparse features, difficult to discern signal-to-noise ratio, and a small number of examples. These are key characteristics of this dataset of cancer inhibitor protein interactions.

How useful is the Ensemble ‘Dark Matter’ algorithm and can it be used on data of this format? What kinds of downstream model improvements or detriments are observed while applying Ensemble to the training pipeline? What ratio of bias/variance is observed in the downstream models before and after using Ensemble? Keep these questions in mind as we explore the results.

Dataset

Small molecules play an non-trivial role in cancer chemotherapy. Here [the dataset owner will] focus on inhibitors of 8 protein kinases(name: abbr):

  1. Cyclin-dependent kinase 2: cdk2
  2. Epidermal growth factor receptor erbB1: egfr_erbB1
  3. Glycogen synthase kinase-3 beta: gsk3b
  4. Hepatocyte growth factor receptor: hgfr
  5. MAP kinase p38 alpha: map_k_p38a
  6. Tyrosine-protein kinase LCK: tpk_lck
  7. Tyrosine-protein kinase SRC: tpk_src
  8. Vascular endothelial growth factor receptor 2: vegfr2

 

For each protein kinase, several thousand inhibitors are collected from chembl database, in which molecules with IC50 lower than 10 uM are usually considered as inhibitors, otherwise non-inhibitors.

Based on those labeled molecules, build your model, and try to make the right prediction.

Additionally, more than 70,000 small molecules are generated from pubchem database. And you can screen these molecules to find out potential inhibitors. P.S. the majority of these molecules are non-inhibitors.

Dataset Info

  • 1890 examples
  • 6117 sparse features
  • After split
    • training set shape:      (1790, 6117)
    • test set shape:              (100, 6117)
This holdout set represents 5% of the total datasets size and was found to be the preferred number of holdout examples by others experimenting with this data. Future work on this dataset will include other size splits.
  • Total target counts:
All:
1      1271
0      619

Train:
1      1199
0      591

Test:
1      72
0      28

The imbalance in the target values is around 4:1, not significant enough to require dataset balancing techniques like over/under sampling.

  • Sparcity:
(X_train.sum(axis=0) / X_train.shape[0]).mean()
>> 0.25362096474428353

 A quick check shows us that on average, each feature in this dataset has

  • 25% (450 / 6000) 1s
  • 75% (5545 / 6000) 0s
  • Nulls:
pd.DataFrame(X_train).isnull().sum().sum()
>> 0

This dataset does not have NaNs/Null values. If it did, we would need to impute these missing values in a meaningful way, as Dark Matter will not handle these at this time.Com

Metrics comparison

  • Compares baseline metrics (no Ensemble AI used) to ensemble metrics (using Ensemble AI to preprocess input features)
  • Original feature count: 6117
  • Ensemble embedding size: 50
  • Ensemble training iteration count: 10,000 epochs

This configuration, when using Ensemble, functions as both feature enhancement as well as dimensionality reduction for very large, sparse vectors.

Ex:
Baseline modeling pipeline
→   <6117 features>   →   model   →   prediction

Ensemble modeling pipeline
→   <6117 features>   →   Ensemble   →   <50 features>   →   model   →   predictionMetric Slope Plots

Metric Slope Plots

The following plots show the performance change before and after using Ensemble for a variety of metrics:
→  Accuracy, Precision, Recall, AUC, F1, and Specificity.

The downstream models tested include a collection of both linear and non-linear classification algorithms:
→  Logistic Regression, Random Forest, XGBoost, CatBoost, Support Vector Machine Classification, and Multi-Layer Perceptron.

Models typically improve, with many achieving similar accuracy. In future work, we will monitor metrics as we increase the holdout set size.

Precision remains steady across different models. This means that with Ensemble, almost any downstream model can achieve good precision with this holdout set size.

Recall stays consistent across all models. This means that detecting false positives is congruent no matter which downstream model is used when training with enhanced features from the Ensemble algorithm.

The ROC AUC shrinks for some boosting algorithms, but grows for other linear and non-linear algorithms used as the downstream model. In future work, testing other classification thresholds could reveal better ROC AUC scores for each downstream model.

The F1 score stays steady across models, showing that using Ensemble improves both precision and recall for any downstream model.

Looking at the True Negative Rate (Specificity) shows that some models predict the majority class (1s) too often. Sometimes, they act like a single-value predictor (like Random Forest). Using Ensemble emphasizes variance in the data, even with few data points and dataset imbalance.

Conclusion

This case study shows the benefit of using the Ensemble ‘Dark Matter’ algorithm to generate embeddings in a domain with sparse and limited data. The algorithm highlights points of variance that are otherwise lost on common and uncommon brands of machine learning algorithms.

Random Forest struggled to predict effectively, and would likely be discarded after this baseline test in another experiment. However, when given the opportunity to understand the dataset’s variance at a deeper level, Random Forest and other linear and non-linear algorithms are all able to converge on a similar high performance model.

Ready for better model performance?

Get in touch to learn more and book a demo. 

Join the Waitlist

Early Access Form