Refresher on Bayesian and Frequentist Concepts
Frequentists: From Neymann/Pearson/Wald setup. An orthodox view that sampling is infinite and decision rules can be sharp.
Bayesians: From Bayes/Laplace/de Finetti tradition. Unknown quantities are treated probabilistically and the state of the world can always be updated.
Likelihoodists: From Fisher. Single sample inference based on maximizing the likelihood function and relying on the Birnbaum (1962) Theorem. Bayesians - But they don’t know it.
Frequentist: Data are a repeatable random sample; Underlying parameters remain constant during this repeatable process;Parameters are fixed.
Bayesian: Data are observed from the realized sample; Parameters are unknown and described probabilistically; Data are fixed.
Three General Steps for Bayesian Modeling:
I. Specify a probability model for unknown parameter values that includes some prior knowledge about the parameters if available.
II. Update knowledge about the unknown parameters by conditioning this probability m odel on observed data.
III. Evaluate the fit of the model to the data and the sensitivity of conclusions to the assumptions.
The History of Bayesian Statistics–Milestones:
Reverend Thomas Bayes (1702-1761). Pierre Simon Laplace. Pearson (Karl), Fisher, Neyman and Pearson (Egon), Wald. Jeffreys, de Finetti, Good, Savage, Lindley, Zellner. A world divided (mainly over practicality). The revolution: Gelfand and Smith (1990). Today. . .
Differences Between Bayesians and Frequentists:
Bayesian: View the world probabilistically, rather than as a set of fixed phenomena that are either known or unknown. Prior information abounds and it is important and helpful to use it. Very careful about stipulating assumptions and are willing to defend them. Every statistical model ever created in the history of the human race is subjective; we are willing to admit it.
Frequentist: Parameters of interest are fixed and unchanging under realistic circumstances. No information prior to the model specification. Statistical results assume that data were from a controlled experiment. Nothing is more important than repeatability, no matter what we pay for it.
Bring what is needed to Solve the Problem !
Frequentist: Evaluative Paradigm;Repeatability can be Important.
Bayesian: Modeling Paradigm; Inference can be appropriate.
Bayesian Modelling: An Information Revolution?
We are in an era of abundant data;We need tools for modelling, searching, visualising, and understanding large data sets.
+ Society: the web, social networks, mobile networks, government, digital archives.
+ Science: large-scale scientific experiments, biomedical data, climate data, scientific literature.
+ Business: e-commerce, electronic trading, advertising, personalisation.
A model describes data that one could observe from a system. If we use the mathematics of probability theory to express all forms of uncertainty and noise associated with our model...then inverse probability (i.e. Bayes rule) allows us to infer unknown quantities, adapt our models, make predictions and learn from data.
Bayes Rule:
Machine Learning seeks to learn models of data: define a space of possible models; learn the parameters and structure of the models from data; make predictions and decisions
Canonical Machine Learning Problems: Linear Regression; Polynomial Regression; Clustering with Gaussian Mixtures (Density Estimation).
Bayesian Machine Learning - Everything follows from two simple rules:
.
Prediction: .
Model Comparison:
; .
Consider a robot. In order to behave intelligently, the robot should be able to represent beliefs about propositions in the world:
“my charging station is at location (x,y,z)”
“my rangefinder is malfunctioning”
“that stormtrooper is hostile”
We want to represent the strength of these beliefs numerically in the brain of the robot, and we want to know what mathematical rules we should use to manipulate those beliefs. Let’s use b(x) to represent the strength of belief in (plausibility of) proposition x.
Consistency:
+ If a conclusion can be reasoned in several ways, then each way should lead to the same answer.
+ The robot must always take into account all relevant evidence.
+ Equivalent states of knowledge are represented by equivalent plausibility assignments.
Consequence: Belief functions (e.g. b(x), b(x|y), b(x, y)) must satisfy the rules of probability theory, including sum rule, product rule and therefore Bayes rule.
Asymptotic Certainty:
Assume that data set , consisting of n data points, was generated from some true , then under some regularity conditions, as long as : . In the unrealizable case, where data was generated from some which cannot be modelled by any θ, then the posterior will converge to where minimizes :
.
Asymptotic Consensus:
Consider two Bayesians with different priors, & , who observe same data D. Assume both Bayesians agree on the set of possible and impossible values of θ: . Then, in the limit of , the posteriors, and will converge (in uniform distance between distributions ).
On Choosing Priors:
+ Objective Priors: non-informative priors that attempt to capture ignorance and have good frequentist properties.
+ Subjective Priors: priors should capture our beliefs as well as possible. They are subjective but not arbitrary.
+ Hierarchical Priors: multiple levels of priors, .
+ Empirical Priors: learn some of the parameters from data, ; Robust — overcomes some limitations of misspecification of the prior.
Approximation Methods for Posteriors and Marginal Likelihoods: Laplace approximation; Bayesian Information Criterion; Variational Approximations; Expectation Propagation (EP); Markov chain Monte Carlo methods (MCMC); Exact Sampling......
The Variational Bayesian EM algorithm has been used to approximate Bayesian learning in a wide range of models such as:
• mixtures of Gaussians and mixtures of factor analysers
• hidden Markov models
• state-space models (linear dynamical systems)
• independent components analysis (ICA)
• discrete graphical models...
The main advantage is that it can be used to automatically do model selection and does not suffer from over fitting to the same extent as ML methods do.
Infinite mixture models: .
Start from a finite mixture model with K components and take the limit as the number of components . But you have infinitely many parameters, you integrate them out using:
– MCMC sampling (Escobar & West 1995; Neal 2000; Rasmussen 2000)
– expectation propagation (EP; Minka and Ghahramani, 2003)
– variational methods (Blei and Jordan, 2005)
– Bayesian hierarchical clustering (Heller and Ghahramani, 2005)
Myths and misconceptions about Bayesian methods:
+ Bayesian methods make assumptions where other methods don’t: All methods make assumptions! Otherwise it’s impossible to predict. Bayesian methods are transparent in their assumptions whereas other methods are often opaque.
+ If you don’t have the right prior you won’t do well. Certainly a poor model will predict poorly but there is no such thing as the right prior! Your model (both prior and likelihood) should capture a reasonable range of possibilities. When in doubt you can choose vague priors (cf nonparametrics).
+ Maximum A Posteriori (MAP) is a Bayesian method. MAP is similar to regularization and offers no particular Bayesian advantages. The key ingredient in Bayesian methods is to average over your uncertain variables and parameters, rather than to optimize.
+ Bayesian methods don’t have theoretical guarantees. One can often apply frequentist style generalization error bounds to Bayesian methods (e.g. PAC-Bayes). Moreover, it is often possible to prove convergence, consistency and rates for Bayesian methods.
+ Bayesian methods are generative. You can use Bayesian approaches for both generative and discriminative learning (e.g. Gaussian process classification). With the right inference methods (variational, MCMC) it is possible to scale to very large datasets, but it’s true that averaging is often more expensive than optimization.
Frequentist theory tends to focus on sampling properties of estimators, i.e. what would have happened had we observed other data sets from our model. Also look at minimax performance of methods – i.e. what is the worst case performance if the environment is adversarial. Frequentist methods often optimize some penalized cost function.
Bayesian methods focus on expected loss under the posterior. Bayesian methods generally do not make use of optimization, except at the point at which decisions are to be made.
Cons and pros of Bayesian methods: Bayesian machine learning treats learning as an probabilistic inference problem. Bayesian methods work well when the models are flexible enough to capture relevant properties of the data. The closed world assumption: need to consider all possible hypotheses for the data before observing the data. Often good performance. The use of approximations weakens the coherence argument.