INDUSTRIAL ENGINEERING AND OPERATIONS MANAGEMENT PHD THESIS DEFENSE BY AYYÜCE BEGÜM BEKTAŞ



Title: Efficient Machine Learning Models for Cancer Biology

Speaker: Ayyüce Begüm Bektaş 

Time: August 5, 2022, 14:00

Thesis Committee Members:

Assoc. Prof. Mehmet Gönen (Advisor, Koç University)

Prof. Ceyda Oğuz (Koç University)

Prof. Füsun Can (Koç University)

Prof. Mehmet Güray Güler (Yıldız Technical University)

Assoc. Prof. Arzucan Özgür (Boğaziçi University)

Abstract:

In the recent past, a variety of multiple kernel learning algorithms has been proposed in machine learning literature. A kernel corresponds to a measure of similarity between the data instances while multiple kernels correspond to multiple different measures of similarity. Learning with multiple kernels, in brief, serves to perform learning while integrating different inputs originated from different feature representations. This thesis contains three main extensions to original multiple kernel learning framework together with their implementations on cancer data sets.

Identification of molecular mechanisms that determine tumor progression in cancer patients is a prerequisite for developing new disease treatment guidelines. Even though the predictive performance of current machine learning models is promising, extracting significant and meaningful knowledge from the data simultaneously during the learning process is a difficult task considering the high-dimensional and highly correlated nature of genomic data sets. Thus, there is a need for models that not only predict tumor volume from gene expression data of patients but also use prior information coming from pathways/gene sets during the learning process to distinguish molecular mechanisms that play crucial role in tumor progression and disease prognosis.

In this thesis, we demonstrate a novel machine learning algorithm, PrognosiT, that combines optimization and kernel learning. Instead of initially choosing several pathways/gene sets from a candidate set and training a model on this previously chosen subset of features, our proposed algorithm accomplishes both tasks together. We tested our algorithm on thyroid carcinoma patients using gene expression profiles and cancer-specific pathways/gene sets. Predictive performance of our novel multiple kernel learning algorithm was comparable or even better than random forest (RF) and support vector regression (SVR). It is also notable that, to predict tumor volume, PrognosiT used gene expression features less than one-tenth of what RF and SVR algorithms used. We demonstrated that during the learning process, our algorithm managed to extract relevant and meaningful pathway/gene set information related to the studied cancer type, which provides insights about its progression and aggressiveness. We also compared gene expressions of the selected genes by our algorithm in tumor and normal tissues, and we then discussed up- and down-regulated genes selected by our algorithm, which could be beneficial for determining new biomarkers.

The thesis also provides a novel multiple approximate kernel learning framework, namely, MAKL, that is fast, scalable and interpretable. Data set sizes in computational biology have been increased drastically with the help of improved data collection tools and increasing size of patient cohorts. Previous kernel-based machine learning algorithms proposed for increased interpretability started to fail with large sample sizes, owing to their lack of scalability. To overcome this problem, we proposed MAKL, a fast and efficient multiple kernel learning algorithm to be particularly used with large-scale data that integrates kernel approximation and group Lasso formulations into a conjoint model. Our method extracts significant and meaningful information from the genomic data while conjointly learning a model for out-of-sample prediction. It is scalable with increasing sample size by approximating instead of calculating distinct kernel matrices. To test MAKL, we demonstrated our experiments on three cancer data sets (i.e., created using multiple cancer cohort data sets from The Cancer Genome Atlas (TCGA) consortium and a melanoma single-cell data set) and showed that MAKL is capable to outperform the baseline algorithm, extreme gradient boosting, while using only a small fraction of the input features. We also reported selection frequencies of low-dimensional approximation matrices associated with feature subsets (i.e., pathways/gene sets), which helps seeing their relevance for the given classification task. Our fast and interpretable MKL algorithm producing sparse solutions is promising for computational biology applications considering its scalability and highly correlated structure of genomic data sets, and it can be used to discover new biomarkers and new therapeutic guidelines.

As another contribution, this thesis provides a novel multiple approximate kernel clustering framework. Kernel-based clustering algorithms are essential to genomic data analysis since they provide detection of nonlinear relationships within the data while offering interpretability using one or more prior information sources. With this motivation, we designed a scalable multiple approximate kernel k-means clustering framework that is compatible with large-scale data sets and combines kernel approximation and k-means clustering approach into the same model. To test our algorithm, we combined information from multiple cancer cohorts provided by TCGA consortium. Our algorithm extracts relevant parts of the prior information to the clustering task while maximizing the silhouette score to improve the clustering results. To test the findings of our clustering framework, we performed k-means clustering on the full data set as a baseline method. The silhouette score resulted from the baseline experiment was 35.8% lower than the one resulted from our algorithm which uses information from only four gene sets instead of all 19 814 genes. Results of our proposed algorithm are supported by the existing literature and our approach is promising to be an easy and efficient way to provide sparse and interpretable results while integrating prior information as approximation matrices to k-means algorithm in a novel way, which may give hope to discover new cancer subtypes and insights related to novel cancer treatment options.