Variable Selection for Structured Large Datasets
Student: Matthew Sutton
When: Friday, 17 March 2017 11:00 AM-12:00 PM
Where: GP-S Block, Level 3, Room 307
- Prof Kerrie Mengersen (Principal)
- Dr Benoit Liquet
- Kerrie Mengersen (Chair)
- Benoit Liquet
- Ian Turner
- Dale Nyholt
Over the last century, new technologies have brought about the study of large datasets in multiple disciplines, both research-based and in industry. The emergence of these large datasets has resulted in a fundamental change to traditional statistics. In traditional statistics, domain experts worked for years to provide small clean datasets for statisticians to analyse. The techniques developed for in this setting struggle to perform well for large datasets; where a large number of variables (features) may be irrelevant or there are a large number of outlying observations. In this research we consider large datasets that are common in biomedical research, typically these datasets contain observations of: genes (genomics), mRNA (transcriptomics), proteins (proteomics) and metabolites (metabolomics) and collectively are known as `Omics’ datasets (Schneider and Orchard, 2011).
Variable selection plays a pivotal role in contemporary statistics and scientific discoveries for Omics datasets. These methods assume that an unknown subset of the predictors exhibit the strongest effects for the underlying system. Identifying this set of variables improves interpretation, reduces computational issues and leads to more stable inferences. While many recent approaches to variable selection have been very successful, the majority of the literature focus on models with a univariate response.
This research will develop novel methods for analysing large multivariate datasets with additional structure including: underlying group structure, repeated measures, and known strong correlations. These methods will be based on variable selection techniques developed in both frequentist and Bayesian paradigms.