Gaussian processes (GPs) are the method of choice for probabilistic nonlinear regression: Their non-parametric nature allows for flexible modelling without specifying low-level assumptions (e.g., the degree of a polynomial) in advance. Inference can be performed in a principled way simply by applying Bayes’ theorem. GPs have had substantial impact in various research areas, including geostatistics, optimization, data visualization, robotics and reinforcement learning, spatio-temporal modelling, and active learning. A strength of the GP is that it is a fairly reliable black-box function approximator, i.e., it produces reasonable predictions without manual parameter tuning.
A practical limitation of the GP is its computational demand: Training and predicting scale in O(N^3) and O(N^2), respectively, where N is the size of the training data set.
In this project, we scale GPs to arbitrarily large data sets by using parallelization and distributed computing. In particular, we exploit a conceptually straightforward, but effective, distributed GP model that scales Gaussian processes to (in principle) arbitrarily large data sets. We introduce the robust Bayesian Committee Machine (rBCM), a practical and scalable product-of-experts model for large-scale distributed GP regression. This model addresses shortcomings of other hierarchical models by appropriately incorporating the GP prior when combining predictions. Furthermore, it parallelizes computations by distributing them among independent computational units. A recursive and closed-form recombination of these independent computations results in a practical model that is both computationally and memory efficient. Training and predicting is independent of the computational graph. Thus, our model can be used on heterogeneous computing infrastructures, ranging from laptops to large clusters: Training with a million data points takes less than 30 minutes on a laptop. With more computing power training with more than 10^7 data points can be done in a few hours.
Compared to the most recent sparse GP approximations, our model performs very well, learns fast, requires little memory, and does not suffer from high-dimensional optimization.
The model is designed to be independent of the computational graph, i.e., given the same number of GP experts, the model produces the same results independent of the tree architecture. This allows the model to be adapted to the computing infrastructure at hand, ranging from laptops and individual workstations to large clusters.
Jun Wei Ng