Variable Selection and Dimension Reduction for Structured Large Datasets
Candidate: Matthew Sutton
When: Monday, 10th December 2018 11:00 AM-12:00 PM
Where: GP-D Block, Level 3, Room 307
- Associate Professor James McGree
- Professor Benoit Liquet
- Associate Professor Dale Nyholt
Recent advances in biomedical technology have given rise to high dimensional datasets which offer the potential for great insights into complex biological relationships. However, standard statistical methods can struggle to perform well for large datasets; where a large number of variables may be irrelevant or there is some underlying structure in the data. This dissertation aims to develop novel statistical methods for variable selection and dimension reduction in the analysis of biological datasets.
In this research, we consider large datasets that are common in biomedical research. Typically these datasets contain observations of: genes (genomics), proteins (proteomics) and metabolites (metabolomics) and collectively are known as ‘Omics’ datasets (Schneider and Orchard, 2011). Variable selection and dimension reduction methods play a pivotal role in contemporary statistics and scientific discoveries for Omics datasets. Identifying a reduced dataset through either of these approaches can improve interpretation, reduce computational issues and lead to more stable inferences. Moreover, these methods can be greatly enhanced by the incorporation of structure that is known beforehand. For example, genes with similar biological functions will usually work together and can be detected as a group.
We begin by proposing Bayesian variable selection methods, incorporating both grouping structure in the predictors and correlation structure in the multivariate responses in the modelling. Our methods utilise spike and slab priors to yield solutions which are sparse at either a group level or both a group and individual feature level. The proposed methodology is illustrated on a genetic dataset to identify biological markers across chromosomes that explain the joint variability in multiple tissues.
Based on the Partial Least Squares (PLS) family of dimension reduction methods, we propose a novel sparse PLS method. Our method performs simultaneous dimension reduction and variable selection for datasets with multiple grouping levels. We apply this method to identify important relationships between genomic expression and cytokine data from a human immunodeficiency virus vaccine trial. The proposed approach incorporates group and subgroup structure corresponding to grouping of genetic markers (e.g. gene sets) and temporal effects. Our third contribution investigates the objective function and fitting process for PLS methods which incorporate variable selection. We note theoretical limitations in the standard method for fitting the model and propose a sparse subspace constrained PLS approach.
The final contribution of this thesis is motivated by the joint analysis of multiple cancer datasets. In particular, we develop novel variable selection methods where the information from multiple independent datasets is combined to improve the analysis. The combined analysis is treated as a large logistic regression problem with a novel grouping structure.