700字范文 > 贝叶斯统计推断_统计推断对决：频繁主义者与贝叶斯主义者

贝叶斯统计推断_统计推断对决：频繁主义者与贝叶斯主义者

时间：2021-11-11 06:55:25

贝叶斯统计推断

by Kirill Dubovikov

通过基里尔·杜博维科夫(Kirill Dubovikov)

统计推断对决：频繁主义者与贝叶斯主义者 (Statistical Inference Showdown: The Frequentists VS The Bayesians)

推理 (Inference)

Statistical Inference is a very important topic that powers modern Machine Learning and Deep Learning algorithms. This article will help you to familiarize yourself with the concepts and mathematics that make up inference.

统计推断是推动现代机器学习和深度学习算法发展的一个非常重要的主题。本文将帮助您熟悉构成推理的概念和数学。

Imagine we want to trick some friends with an unfair coin. We have 10 coins and want to judge whether any one of them is unfair — meaning it will come up as heads more often than tails, or vice versa.

想象我们想用不公平的硬币欺骗一些朋友。我们有10个硬币，要判断它们中的任何一个是否不公平-意味着它正面朝上的频率要比正面朝上的频率高，反之亦然。

So we take each coin, toss it a bunch of times — say 100 — and record the results. The thing is we now have a subset of measurements from a true distribution (a sample) for each coin. We’ve considered the condition of our thumbs and concluded that collecting more data would be very tedious.

因此，我们取每个硬币，扔一堆(例如100次)并记录结果。问题是，我们现在从每个硬币的真实分布(样本)中获得了测量的子集。我们已经考虑过拇指的状况，并得出结论，收集更多的数据将非常繁琐。

It is uncommon to know parameters of the true distribution. Frequently, we want to infer true population parameters them from the sample.

知道真实分布的参数并不常见。通常，我们想从样本中推断出真实的种群参数。

So now we want to estimate the probability of a coin landing on Heads. We are interested in thesample mean.

因此，现在我们要估计硬币落在正面上的概率。我们对样本均值感兴趣。

By now you’ve likely thought, “Just count number of heads and divide by the total number of attempts already!” Yep, this is the way to find an unfair coin, but how could we come up with this formula if we didn’t know it in the first place?

到现在为止，您可能已经想到：“只需计算正面的数目，再除以已尝试的总数即可！” 是的，这是找到不公平硬币的方法，但是如果我们首先不知道这个公式，我们怎么想出这个公式？

惯常推论 (Frequentist Inference)

Recall that coin tosses are best modeled with Bernoulli distribution, so we are sure that it represents our data well. Probability Mass Function (PMF) for Bernoulli distribution looks like this:

回想一下，抛硬币最好用伯努利分布建模，因此我们确信它可以很好地代表我们的数据。伯努利分布的概率质量函数(PMF)如下所示：

xis a random variable that represents an observation of a coin toss (assume 1 for Heads and 0 for Tails) andpis a parameter — probability of Heads. We will refer to all possible parameters asθonward.This function represents how probable each value ofxis according to the distribution law we have chosen.

x是一个随机变量，代表抛硬币的观察结果(假设正面为1，尾部为0)，p是参数-正面的概率。我们将所有可能的参数称为θ向前。该函数表示根据我们选择的分布定律，x的每个值有多大可能。

Whenxis equal to 1 we getf(1; p) = p,and when it is zerof(0; p) = 1-p.Thus, Bernoulli distribution answers the question ‘How probable is it that we get a heads with a coin that lands on heads with probabilityp?’. Actually, it is one of the simplest examples of a discrete probability distribution.

当x等于1时，我们得到f(1; p)= p，而当它为零时，f(0; p)= 1-p。因此，伯努利分布回答了一个问题：“让一个硬币正面朝着概率为p落在正面的可能性有多大？'。实际上，它是离散概率分布的最简单示例之一。

So, we are interested in determining parameterpfrom the data. A frequentist statistician will probably suggest using a Maximum Likelihood Estimation (MLE) procedure. This method takes approach of maximizing likelihood of parameters given the datasetD:

因此，我们有兴趣根据数据确定参数p。一位常客统计学家可能会建议使用最大似然估计(MLE)程序。在给定数据集D的情况下，此方法采用最大化参数可能性的方法：

This means thatlikelihoodis defined as a probability of the data given parameters of the model. To maximize this probability, we will need to find parameters that help our model to match the data as close as possible. Doesn’t itlook like learning? Maximum Likelihood is one of the methods that make supervised learning work.

这意味着将可能性定义为给定模型参数的数据的概率。为了最大程度地提高这种可能性，我们将需要找到有助于我们的模型尽可能匹配数据的参数。看起来不是学习吗？最大可能性是使有监督的学习工作的方法之一。

Now let’s assume all observations we make are independent. This means that joint probability in the expression above may be simplified to a product by basic rules of probability:

现在让我们假设我们所做的所有观察都是独立的。这意味着可以通过基本的概率规则将以上表达式中的联合概率简化为乘积：

Now goes the main part: how do we maximize a likelihood function? We call calculus for help, differentiate likelihood function in respect to model parametersθ, set it to 0 and solve the equation. There is a neat trick that makes differentiation much easier most of the times — logarithms do not change function’s extrema (minimum and maximum).

现在开始主要部分：我们如何最大化似然函数？我们称微积分为帮助，针对模型参数θ区分似然函数，将其设置为0并求解方程。在大多数情况下，有一个巧妙的技巧可以使微分变得更容易-对数不会改变函数的极值(最小值和最大值)。

Maximum Likelihood Estimation has immense importance and almost every machine learning algorithm. It is one of the most popular ways to formulate a process of learning mathematically.

最大似然估计非常重要，几乎所有机器学习算法都非常重要。这是制定数学学习过程的最流行的方法之一。

And now let’s apply what we’ve learned and play with our coins. We’ve donenindependent Bernoulli trials to evaluate the fairness of our coin. Thus, all probabilities can be multiplied and likelihood function will look like this:

现在，让我们应用我们所学到的知识，并使用我们的硬币。我们已经进行了n次独立的伯努利试验，以评估我们硬币的公平性。因此，所有概率都可以相乘，似然函数将如下所示：

Taking the derivative of the expression above won’t be nice. So, we need to find the log-likelihood:

采取上面的表达式的派生将不是很好。因此，我们需要找到对数似然：

That looks easier. Moving on to differentiation

看起来比较容易。走向差异化

Here we split derivatives using standardd(f + g) = df + dg.Next, we move the constants out and differentiate logarithms:

在这里，我们使用标准d(f + g)= df + dg拆分导数。接下来，我们将常量移出并区分对数：

The last step might seem funny because of the sign flip. The cause is thatlog(1-p)is actually a composition of two functions and we must use the chain rule here:

由于符号翻转，最后一步可能看起来很有趣。原因是log(1-p)实际上是两个函数的组合，我们必须在这里使用链式规则：

Voilà, we are done with the log-likelihood! Now we are close to find the maximum likelihood statistic for the mean of Bernoulli distribution. The last step is to solve the equation:

瞧，我们已经完成对数似然法！现在，我们即将找到伯努利分布平均值的最大似然统计量。最后一步是求解方程：

Multiplying everything byp(1-p)and expanding parenthesis we get

将所有内容乘以p(1-p)并扩展括号，我们得到

Canceling out the terms and rearranging:

取消条款并重新安排：

So, here is the derivation of ourintuitive formula ?.You may now play with Bernoulli distribution and its MLE estimate of the mean in the visualization below

所以，这是我们的推导直观的公式？Ÿ欧可以立即使用伯努利分布和下面的可视化的意味其MLE估计玩

Congratulations on your new awesome skill of Maximum Likelihood Estimation! Or just for refreshing your existing knowledge.

恭喜您获得了新的惊人的最大似然估计技能！或者只是为了刷新您现有的知识。

贝叶斯推理 (Bayesian Inference)

Recall that there exists another approach to probability. Bayesian statistics has its own way to do probabilistic inference. We want to find the probability distribution of parameters THETA given sample —P(THETA | D). But how can we infer this probability? Bayes theorem comes to rescue:

回想一下，存在另一种概率方法。贝叶斯统计有其自己的方式来进行概率推断。我们想要找到给定样本-P(THETA | D)的参数THETA的概率分布。但是，我们如何推断这种可能性呢？贝叶斯定理可以解救：

P(θ)is called a prior distribution and incorporates our beliefs in what parameters could be before we have seen any data. The ability to state prior beliefs is one of the main differences between maximum likelihood and Bayesian inference. However, this is also the main point of criticism for the Bayesian approach. How do we state the prior distribution if we do not know anything about the problem in interest? What if we choose bad prior?

P(θ)称为先验分布，它结合了我们对看到任何数据之前可能具有哪些参数的信念。陈述先验信念的能力是最大似然和贝叶斯推理之间的主要区别之一。但是，这也是贝叶斯方法批评的重点。如果我们对所关注的问题一无所知，该如何陈述事先的分配？如果我们选择不好的先验怎么办？

P(D | θ)is a likelihood, we have encountered it in Maximum Likelihood Estimation

P(D |θ)是一个可能性，我们在最大似然估计中遇到了它

P(D)is called evidence or marginal likelihood

P(D)称为证据或边际可能性

P(D)is also callednormalization constantsince it makes sure that results we get are a valid probability distribution. If we rewriteP(D)as

P(D)也称为归一化常数，因为它可以确保我们得到的结果是有效的概率分布。如果我们将P(D)重写为

We will see that it is similar to the numerator in the Bayes Theorem, but the summation goes over all possible parametersθ. This way we get two things:

我们将看到它类似于贝叶斯定理中的分子，但是求和遍及所有可能的参数θ。这样，我们得到两件事：

The output is always valid probability distribution in the domain of[0, 1].

输出始终是[0，1]域中的有效概率分布。

Major difficulties when we try to computeP(D)since this requires integrating or summing over all possible parameters. This is impossible in most of the real word problems.

我们尝试计算P(D)时遇到了主要困难，因为这需要对所有可能的参数进行积分或求和。在大多数实际单词问题中，这是不可能的。

But does marginal likelihoodP(D)make all things Bayesian impractical? The answer is not quite. In most of the times, we will use one of the two options to get rid of this problem.

但是，边际可能性P(D)是否使所有事情都变得不可行？答案并不完全。在大多数情况下，我们将使用两个选项之一来解决此问题。

The first one is to somehow approximateP(D). This can be achieved by using various sampling methods like Importance Sampling or Gibbs Sampling, or a technique called Variational Inference (which is a cool name by the way ?).

第一个是某种程度上近似P(D)。这可以通过使用诸如重要性采样或吉布斯采样之类的各种采样方法，或称为变分推理的技术(顺便说一下，这是一个很酷的名字)来实现。

The second is to get it out of the equation completely. Let’s explore this approach in more detail. What if we concentrate on finding one most probable parameter combination (that is the best possible one)? This procedure is called Maximum A Posteriori estimation (MAP).

第二个是完全摆脱方程式。让我们更详细地探讨这种方法。如果我们专注于找到一种最可能的参数组合(即最佳组合)怎么办？此过程称为最大后验估计(MAP)。

The equation above means that we want to findθfor which expression insidearg maxtakes its maximum value — theargument of amaximum. The main thing to notice here is thatP(D)is independent of parameters and may be excluded fromarg max:

一个最大imum的ARGument -上述手段，我们希望找到θ为在arg最大内部表达取最大值的计算公式。这里要注意的主要事情是P(D)与参数无关，并且可以从arg max中排除：

In other words,P(D)will always be constant with respect to model parameters and its derivative will be equal to1.

换句话说，P(D)相对于模型参数将始终是恒定的，并且其导数将等于1。

This fact is so widely used that it is common to see Bayes Theorem written in this form:

这个事实被广泛使用，以至于经常看到贝叶斯定理是这样写的：

The wired incomplete infinity sign in the expression above means “proportional to” or “equal up to a constant”.

上面的表达式中的有线不完整无穷大符号表示“与...成比例”或“等于一个常数”。

Thus, we have removed the most computationally heavy part of the MAP. This makes sense since we basically discarded all possible parameter values from probability distribution and just skimmed off the best most probable one.

因此，我们删除了MAP中计算量最大的部分。这是有道理的，因为我们基本上从概率分布中丢弃了所有可能的参数值，而只是略去了最可能的参数值。

MLE和MAP之间的链接 (A link between MLE and MAP)

And now consider what happens when we assume the prior to be uniform (a constant probability).

现在考虑当我们假设先验是统一的(恒定概率)时会发生什么。

We have moved out constantCout of thearg maxsince it does not affect the result as it was with the evidence. It certainly looks alike to a Maximum Likelihood estimate! In the end, the mathematical gap between frequentist and Bayesian inference is not that large.

我们已将常量C从arg max中移出，因为它不会像证据那样影响结果。最大似然估计当然看起来很像！最后，频繁主义者和贝叶斯推理之间的数学差距并不大。

We can also build the bridge from the other side and view maximum likelihood estimation through Bayesian glasses. In specific, it can be shown that Bayesian priors have close connections with regularization terms. But that topic deserves another post (see this SO question and ESLR book for more details).

我们还可以从另一侧建造桥梁，并通过贝叶斯眼镜查看最大似然估计。具体而言，可以证明贝叶斯先验与正则化项有着密切的联系。但是该主题值得再发表一遍(有关更多详细信息，请参阅此SO问题和ESLR书 )。

结论 (Conclusion)

Those differences may seem subtle at first, but they give a start to two schools of statistics. Frequentist and Bayesian approaches differ not only in mathematical treatment but in philosophical views on fundamental concepts in stats.

乍一看，这些差异似乎微妙，但它们为两个统计学流派开了一个开端。频繁主义和贝叶斯方法不仅在数学处理上有所不同，而且在统计数据基本概念的哲学观点上也有所不同。

If you take on a Bayesian hat you view unknowns as probability distributions and the data as non-random fixed observations. You incorporate prior beliefs to make inferences about events you observe.

如果您戴上贝叶斯帽子，则将未知数视为概率分布，将数据视为非随机固定观测值。您结合了先前的信念来推断观察到的事件。

As a Frequentist, you believe that there is a single true value for the unknowns that we seek and it’s the data that is random and incomplete. Frequentist randomly samples data from unknown population and makes inferences about true values of unknown parameters using this sample.

作为常客，您认为我们所寻求的未知因素只有一个真实的价值，而数据是随机的和不完整的。频密者从未知总体中随机采样数据，并使用该样本推断未知参数的真实值。

In the end, Bayesian and Frequentist approaches have their own strengths and weaknesses. Each has the tools to solve almost any problem the other can. Like different programming languages, they should be considered as tools of equal strength that may be a better fit for a certain problem and fall short at the other. Use them both, use them wisely, and do not fall into the fury of a holy war between two camps of statisticians!

最后，贝叶斯方法和频率论方法各有优缺点。每个工具都有解决其他问题的工具。像不同的编程语言一样，它们应被视为具有同等强度的工具，可能更适合于某个问题，但在另一个方面却不如以前。都使用它们，明智地使用它们，不要陷入两个统计学家阵营之间的一场圣战的狂怒中！

Learned something? Click the ? to say “thanks!” and help others find this article.

学到了什么？点击？说“谢谢！” 并帮助其他人找到本文。