Definition of Scalable Models
In terms of statistical analysis, a model is scalable when it can be applied to subsets of a dataset too large to analyse on one machine, and produce results about the whole dataset. There are many types of scalable models; they include the parallel application of methods on subsets followed by an aggregation step, as well as more sophisticated approaches such as the use of “training sets” to train a classifier and models which are able to process chunks of data sequentially or in real-time.
The Bayesian approach is already based on the concept of combining sources of information, and thus much of our work is already scalable by definition. For example, prior distributions can incorporate results from analyses on previous chunks of data, and commonly used Markov Chain Monte Carlo samplers are also scalable. Hierarchical models can be used to summarize different levels of abstraction for later aggregation. Methods which reduce the number of dimensions of a problem are also naturally scalable in the sense that they facilitate the reduction of a potentially large amount of information in order to solve a specific problem.
The goals of Big Data Analysis can be different to classical analysis, which has been a factor when creating scalable models. With large volumes of information, fast approximations are more desirable than slow exact results; essentially, a collection of rough approximations can become very accurate when data is nearly infinite. This has led to the field being very forgiving of relatively simplistic methods (K-means, Naive Bayes) which are easily understood and implemented by programmers.
Who works in Scalable Models
Our group is experienced with such methods as well as their more complicated counterparts, and are working to improve the range and quality of models currently available for Big Data Analysis. Some of the applications listed below pertain the the skills used for Big Data.
- Zoe van Havre: Analysis of mixture and Hidden Markov models. MCMC sampling including modelling intricate posteriors using parallel tempering.
- Julie Vercelloni: MCMC Hierarchical models, non-linear models.
- Nick Tierney: CART models, missing data
- Aleysha Thomas: Data exploration skills, finding sub-groups in given a large number of potential covariates, multi-source.
- Paul Wu: Dynamic Bayesian Networks, combining data on different time scales.
- Jannah Baker: Skills with subgroup analysis, including subsampling, stratifications, cross-validation, sensitivity analysis using MCMC
- Earl Duncan: MCMC methods adaptable to Concensus-MCMC. Tessera for cases where datasets too large for MCMC.
What Expertise do we have?
Experienced in Classification and regression trees (CART, BART), Machine learning, Regression analysis, Network analysis
This document links to all other “Capabilities”, but most importantly Spatio-temporal modelling.
Zoé van Havre: firstname.lastname@example.org