dc.description.abstract |
Due to the rapid increases of biomedical data, there is a huge demand for meaningful analytics on diseases. Hence, diverse Machine Learning (ML) and Bioinformatics (BI) techniques are being applied in various areas of disease informatics, the performance whereof, however, can still be improved. Additionally, their results sometimes are difficult to interpret. This thesis advances the knowledge and state of the art in disease informatics by not only providing new methods with superior performance but also through an attempt to interpret the results in the actual biomedical context.
Respiratory diseases are a key focus of this thesis, wherein we first study possible linkages of impaired lung function (in infants later in life) with maternal Human Immunodeficiency Virus (HIV) and mechanisms thereof (Aim 1). As this avenue is hitherto relatively less explored in the context of application of ML and BI techniques, this thesis has made an attempt to close this gap. On another dimension, recorded speech could be conveniently used to assess asthma (e.g., through voice recording via smartphones) as part of regular (self-) monitoring. This has been another focus of this thesis as a huge room for improvement is there (Aim 2). This thesis also deals with some other non-respiratory diseases as well as health related issues (Aim 3) thereby covering a rather large part of disease informatics spectrum and while doing so, this research gives the due importance on the issue of interpretability.
This thesis has designed, developed and validated a computational framework to handle mul- tiple prediction and association identification problems mentioned above combining ML and BI techniques with emphasis on feature selection and interpretation. Particularly, computational models have been developed using differential gene expression (DGE) analysis techniques on the gene expression of Umbilical Cord Blood (UCB) to identify the significant genes for maternal HIV. Subsequently, Weighted Gene Co-Expression Network Analysis (WGCNA) has been applied to identify some clusters of the co-expressed genes to get highly co-expressed (i.e., hub) genes from these clusters for maternal HIV. These hub genes have been explored for the association of mater- nal HIV with offspring lung function using some statistical/machine learning (ML) models (e.g., linear regression). Multiple testing corrections have been undertaken to identify significant genes for maternal HIV. To interpret the biological relevance of the features (i.e., DEGs) for maternal HIV, Fast Gene Set Enrichment Analysis (fGSEA) and REACTOME pathway analysis have been performed.
Another computational approach has been developed to extract acoustic features from the audio files of breath and speech relating to lung function. In the sequel, three predictive models have been built on the important acoustic features to predict the lung function. Three separate predictive models with different goals have been developed: lung function prediction as both classification and regression tasks and also classification of severity of lung functions. Similarly, appropriate ML algorithms have been applied on the selected features to train and develop several models to predict the disease and/or mortality. Particularly, for the latter, several ML based models have been experimented and evaluated on over 3000 settings on six Intensive Care Unit (ICU) patient databases and, for the former, a generalized framework has been developed for disease diagnostics focusing on 10 publicly available benchmark disease datasets.
This thesis has reported several interesting results from different dimensions. In one hand, it has presented significant advancement of computational approaches manifested through different evaluation metrics; on the other hand, it has reported some important and significant associations and interpretations that have advanced the relevant knowledgebase and is expected to spark further
research endeavours. Notably, for the association identification task in Aim 1, SRXN1 has been significantly differentially expressed for maternal HIV. fGSEA has identified enrichment of 243 GO terms and 24 KEGG pathways, the majority of which are related to immune function or inflammation. WGCNA has identified two clusters, with hub genes enriched in immune system related pathways. Hub genes have been associated with offspring lung function.
For predicting the lung functions from the sound files (in Aim 2), a total of 23 acoustic features have been extracted from the voice recorded files. Using these, the predictive model based on Random Forest has classified normal verses abnormal lung function with 84.54% accuracy, 84% F1-score, and 0.88 AUROC. The corresponding regression model for predicting the FEV1% value registers an MAE value of 8.96 with R2 value of 0.44. The third model, for predicting the severity of abnormal lung function, achieved a decent accuracy of 73.2%. As for our Aim 3, highly accurate Random Forest based predictors for 10 different diseases (with accuracies ranging from 69.35 to 98.5%) have been presented and for survivability prediction of ICU patients a Support Vector Machine (SVM) based classifier with a high weighted average F1 score (Fwa) of 82.6% has been developed. All our results have been investigated thoroughly to interpret those from the biological and/or medical point of view.
For Aim 1, the results of the computational models (e.g., DGE, WGCNA, fGSEA, linear re- gression) show a potential path for association of maternal HIV exposure with lung function of the offspring mediated by altered gene expression in-utero. As for Aim 2, the developed predictive models are capable to predict the lung function from the voice recorded sound files with good accu- racy by suitably extracting and identifying the necessary acoustic features. And finally the feature ranking based approach for medical data classification and/or predicting ICU mortality has been shown to have outperformed existing methods. Overall, this thesis has developed and validated a computational framework for multiple prediction and association identification problems men- tioned above in disease informatics combining appropriate ML and BI techniques with emphasis on feature selection and interpretation in a heterogeneous feature space. The proposed method- ologies, techniques, curated datasets, the biological observations, and the insightful discussions are believed to have advanced the knowledgebase and the current state of the art. |
en_US |