Probability Basics

Dec 3, 2024

Probability And Distribution

Probability And Random Variables

In ML, we often avoid explicitly referring to the probability space, but instead refer to the probability of quantities of interest, which we denote by T\mathcal{T}. We refer to T\mathcal{T} as the target space and refer to elements of T\mathcal{T} as states.
We introduce a function X:ΩTX:\Omega \rightarrow \mathcal{T} that takes an element of Ω\Omega (an outcome) and returns a particular quantity of interest xx, a value in T\mathcal{T}. e.g., In the case of tossing two coins and counting the number of heads, we can define a random variable XX that maps outcomes to the number of heads: X(hh)=2X(hh) = 2, \dots.
For any subset STS \subseteq \mathcal{T}, we associate a probability PX(S)[0,1]P_X(S) \in [0, 1] to the event that the random variable XX takes a value in SS.

Statistics

Difference between probability theory and statistics

ProbabilityStatistics
Underlying uncertainty is captured by random variablesWe observe something has happened and try to figure out the underlying process that explains the observations
Known (or just an assumed) model \rightarrow predicted resultKnown data \rightarrow predicted model

Discrete And Continuous Probabilities

Discrete Probabilities

Here the target space T\mathcal{T} is discrete. Consider the probability distribution of two (or more than two) random variables. The target space of joint probability is the Cartesian product of the target spaces of each random variable. We define the joint probability as the probability of both values occurring jointly

P(X=xi,Y=yj)=nijN P(X = x_i, Y = y_j) = \frac{n_{ij}}{N}

where nijn_{ij} is the number of events with state xix_i and yjy_j, and NN is the total number of events. The joint probability is the probability of the intersection of both events.

In machine learning, we use discrete probability distributions to model categorical variables, i.e., variables that take a finite set of unordered values.

The function to describe the (joint) probability of discrete variable(s) is called probability mass function (pmf).(中文称为 概率质量函数,用于衡量离散随机变量的概率的)

Continuous Probabilities

Definition 6.1 (Probability Density Function, pdf). A function f:RDRf: \mathbb{R}^D \rightarrow \mathbb{R} is called a probability density function (pdf) if

  1. xRD\forall x \in \mathbb{R}^D: f(x)0f(\mathbf{x}) \ge 0
  2. Its integral exists and
RDf(x)dx=1 \int_{\mathbb{R}^D}f(\mathbf{x})\,d\mathbf{x} = 1

In contrast to discrete random variables, the probability of a continuous random variable XX taking a particular value P(X=x)P(X=x) is zero. The following is the definition of cdf.

Definition 6.2 (Cumulative Distribution Function, cdf). A cumulative distribution function (cdf) of a multivariate real-valued random variable XX with states xRD\mathbf{x} \in \mathbb{R}^D is given by

FX(x)=P(X1x1,,XDxD) F_X(\mathbf{x}) = P(X_1 \le x_1, \dots, X_D \le x_D)

where X=[X1,,XD]X = [X_1, \dots, X_D]^\top, x=[x1,,xD]\mathbf{x} = [x_1, \dots, x_D]^\top

The cdf can also be expressed as the integral of the probability density function f(x)f(\mathbf{x}):

FX(x)=x1xDf(z1,,zD)dz1dzD F_X(\mathbf{x}) = \int_{-\infty}^{x_1} \cdots \int_{-\infty}^{x_D} f(z_1, \dots, z_D)\,d z_1 \cdots dz_D

where ziz_i are dummy variables.

Sum Rule / Product Rule / Bayesian Theorem

The Sum Rule

If p(x,y)p(x, y) is a joint distribution, then p(x)p(x) and p(y)p(y) are the corresponding marginal distributions, and p(yx)p(y|x) is the conditional distribution of yy given xx.

The sum rule states that,

p(x)={yYp(x,y)if Y is discreteYp(x,y)dyif Y is continuous p(x) = \begin{cases} \sum_{y \in \mathcal{Y}} p(x, y) & \text{if } Y \text{ is discrete} \\ \int_{\mathcal{Y}} p(x, y) d\mathbf{y} & \text{if } Y \text{ is continuous} \end{cases}

The sum rule is also known as the marginalization property. In general, when the joint distribution contains more than two variables, the sum rule can be applied to any subset of variables. More concretely, if x=[x1,x2,,xn]\mathbf{x} = \begin{bmatrix} x_1, x_2, \cdots, x_n \end{bmatrix}^\top, we obtain the marginal

p(xi)=p(x1,x2,,xn)dxi p(x_i) = \int p(x_1, x_2, \dots, x_n) d\mathbf{x}_{-i}

where xi\mathbf{x}_{-i} means all variables except xix_i.

The Product Rule

The product rule relates the joint distribution to the conditional distribution:

p(x,y)=p(yx)p(x) p(\mathbf{x}, \mathbf{y}) = p(\mathbf{y}|\mathbf{x})p(\mathbf{x})

The product rule can be interpreted as the fact that every joint distribution of two variables can be factorized into two other distributions. The two factors are the marginal distribution of the first random variable x\mathbf{x} and the conditional distribution of y\mathbf{y} given x\mathbf{x}.

Bayes’ Theorem

p(xy)posterior=p(yx)likelihoodp(x)priorp(y)evidence \begin{equation} \underbrace{p(\mathbf{x}|\mathbf{y})}_{\text{posterior}} = \frac{\overbrace{p(\mathbf{y}|\mathbf{x})}^{\text{likelihood}} \overbrace{p(\mathbf{x})}^{\text{prior}}}{\underbrace{p(\mathbf{y})}_{\text{evidence}}} \end{equation}

p(x)p(\mathbf{x}) is the prior, which encapsulates our subjective prior knowledge of the unobserved (latent) variable x\mathbf{x} before observing any data. We can choose any prior that makes sense, but it is critical to ensure the pdf/pmf of the prior is non-zero for all plausible x\mathbf{x}, even if they are rare. The prior is a probability distribution that does not rely on observed data and is independent of other random variables.

The likelihood (different from likelihood function, but related) p(yx)p(\mathbf{y}|\mathbf{x}) describes how x\mathbf{x} and y\mathbf{y} are related. In the case of discrete probability distributions, it is the probability of the data y\mathbf{y} given the latent variable x\mathbf{x}.

The posterior p(xy)p(\mathbf{x}|\mathbf{y}) is the quantity of interest in Bayesian statistics. It expresses “what we know about x\mathbf{x} given y\mathbf{y}”, which is exactly what we care about.

Summary Statistics And Independence

A statistic of a random variable is a deterministic function of that variable.

Summary statistics provide a useful view of how a random variable behaves and characterize the distribution.

Means And (Co)Variances

Also called the population mean and covariance, as it refers to the true statistics for the population.

Definition 6.3 (Expected Value). The expected value of a function g:RRg: \mathbb{R} \rightarrow \mathbb{R} of a univariate continuous random variable Xp(x)X \sim p(x) is

E[g(X)]=Xg(x)p(x)dx. \mathbb{E}[g(X)] = \int_{\mathcal{X}}g(x)p(x)dx\,.

Correspondingly, the expected value of a function g(x)g(x) of a univariate discrete random variable Xp(x)X \sim p(x) is

E[g(X)]=Xg(x)p(x), \mathbb{E}[g(X)] = \sum_{\mathcal{X}}g(x)p(x)\,,

where X\mathcal{X} is the target space of XX.

The expected value of a random vector X=[X1,X2,,XD]\mathbf{X} = [X_1, X_2, \dots, X_D]^\top is

E[X]=[E[X1]E[X2]E[XD]]RD. \mathbb{E}[\mathbf{X}] = \begin{bmatrix} \mathbb{E}[X_1] \\ \mathbb{E}[X_2] \\ \vdots \\ \mathbb{E}[X_D] \end{bmatrix} \in \mathbb{R}^D\,.

Definition 6.4 (Mean). The mean is a special case of expected value, obtained by choosing gg as the identity function. The mean of a random variable XX with states xRD\mathbf{x} \in \mathbb{R}^D is defined as

E[X]=[E[X1]E[X2]E[XD]]RD, \mathbb{E}[\mathbf{X}] = \begin{bmatrix} \mathbb{E}[X_1] \\ \mathbb{E}[X_2] \\ \vdots \\ \mathbb{E}[X_D] \end{bmatrix} \in \mathbb{R}^D\,,

where

E[Xd]={Xdxdp(xd)if X is discreteXdxdf(xd)dxdif X is continuous \begin{equation} \mathbb{E}[X_d] = \begin{cases} \sum_{\mathcal{X}_d} x_d p(x_d) & \text{if } X \text{ is discrete} \\ \int_{\mathcal{X}_d} x_d f(x_d)\,d x_d & \text{if } X \text{ is continuous} \end{cases} \end{equation}

Other terminologies:

Properties of expected value

  1. Expected value is a linear operator. Given f(x)=ag(x)+bh(x)f(\mathbf{x}) = ag(\mathbf{x}) + bh(\mathbf{x}) where a,bRa, b \in \mathbb{R} and xRD\mathbf{x} \in \mathbb{R}^D:
EX[f(x)]=Xf(x)p(x)dx=X[ag(x)+bh(x)]p(x)dx=aXg(x)p(x)dx+bXh(x)p(x)dx=aEX[g(x)]+bEX[h(x)] \begin{aligned} \mathbb{E}_\mathbf{X}[f(\mathbf{x})] &= \int_{\mathcal{X}}f(\mathbf{x})p(\mathbf{x})\,d\mathbf{x} \\ &= \int_{\mathcal{X}}[ag(\mathbf{x})+bh(\mathbf{x})]p(\mathbf{x})\,d\mathbf{x} \\ &= a\int_{\mathcal{X}}g(\mathbf{x})p(\mathbf{x})\,d\mathbf{x} + b\int_{\mathcal{X}}h(\mathbf{x})p(\mathbf{x})\,d\mathbf{x} \\ &= a\mathbb{E}_\mathbf{X}[g(\mathbf{x})] + b\mathbb{E}_\mathbf{X}[h(\mathbf{x})] \end{aligned}

Definition 6.5 (Covariance (Univariate)). The covariance between two univariate random variables X,YRX, Y \in \mathbb{R} is

CovX,Y[X,Y]=EX,Y[(XEX[X])(YEY[Y])]=EX,Y[XY]EX[X]EY[Y] \begin{aligned} \operatorname{Cov}_{X,Y}[X,Y] &= \mathbb{E}_{X,Y}[(X - \mathbb{E}_X[X])(Y - \mathbb{E}_Y[Y])] \\ &= \mathbb{E}_{X,Y}[XY] - \mathbb{E}_X[X]\mathbb{E}_Y[Y] \end{aligned}

The covariance of a variable with itself is its variance, denoted VX[X]=Cov[X,X]\mathbb{V}_X[X] = \operatorname{Cov}[X, X]. The square root of variance is the standard deviation, denoted σ(X)=VX[X]\sigma(X) = \sqrt{\mathbb{V}_X[X]}.

Definition 6.6 (Covariance (Multivariate)). For multivariate random variables XX and YY with states xRD\mathbf{x} \in \mathbb{R}^D and yRE\mathbf{y} \in \mathbb{R}^E, the covariance is

CovX,Y[X,Y]=E[XY]E[X]E[Y]=CovY,X[Y,X]RD×E. \operatorname{Cov}_{X,Y}[\mathbf{X}, \mathbf{Y}] = \mathbb{E}[\mathbf{X}\mathbf{Y}^\top] - \mathbb{E}[\mathbf{X}]\mathbb{E}[\mathbf{Y}]^\top = \operatorname{Cov}_{Y,X}[\mathbf{Y}, \mathbf{X}]^{\top} \in \mathbb{R}^{D \times E}\,.

Definition 6.7 (Variance (Multivariate)). The variance of XX with states xRD\mathbf{x} \in \mathbb{R}^D and mean μRD\mathbf{\mu} \in \mathbb{R}^D is

VX[X]=CovX[X,X]=EX[(Xμ)(Xμ)]=EX[XX]EX[X]EX[X]=[Cov[X1,X1]Cov[X1,X2]Cov[X1,XD]Cov[X2,X1]Cov[X2,X2]Cov[X2,XD]Cov[XD,X1]Cov[XD,X2]Cov[XD,XD]]. \begin{aligned} \mathbb{V}_X[\mathbf{X}] &= \operatorname{Cov}_X[\mathbf{X}, \mathbf{X}] \\ &= \mathbb{E}_X[(\mathbf{X} - \mathbf{\mu})(\mathbf{X} - \mathbf{\mu})^\top] = \mathbb{E}_X[\mathbf{X}\mathbf{X}^\top] - \mathbb{E}_X[\mathbf{X}]\mathbb{E}_X[\mathbf{X}]^\top \\ &= \begin{bmatrix} \operatorname{Cov}[X_1, X_1] & \operatorname{Cov}[X_1, X_2] & \dots & \operatorname{Cov}[X_1, X_D] \\ \operatorname{Cov}[X_2, X_1] & \operatorname{Cov}[X_2, X_2] & \dots & \operatorname{Cov}[X_2, X_D] \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov}[X_D, X_1] & \operatorname{Cov}[X_D, X_2] & \dots & \operatorname{Cov}[X_D, X_D] \\ \end{bmatrix}\,. \end{aligned}

This D×DD \times D matrix is the covariance matrix. It is symmetric, positive semidefinite, and describes the spread of the data. Its diagonal contains the variances of the marginal distributions.

Definition 6.8 (Correlation). The normalized covariance is the correlation:

corr[X,Y]=Cov[X,Y]V[X]V[Y][1,1]. \operatorname{corr}[X, Y] = \frac{\operatorname{Cov}[X,Y]}{\sqrt{\mathbb{V}[X]\mathbb{V}[Y]}} \in [-1, 1]\,.

The correlation matrix is the covariance matrix of standardized variables X/σ(X)X/\sigma(X) and Y/σ(Y)Y/\sigma(Y):

corr[X,Y]=Cov[X,Y]V[X]V[Y]=Cov(X,Y)σ(X)σ(Y)=E(XY)E(X)E(Y)σ(X)σ(Y)=E(Xσ(X)Yσ(Y))E(Xσ(X))E(Yσ(Y))=Cov[Xσ(X),Yσ(Y)] \begin{aligned} \operatorname{corr}[X, Y] &= \frac{\operatorname{Cov}[X,Y]}{\sqrt{\mathbb{V}[X]\mathbb{V}[Y]}} \\ &= \frac{\operatorname{Cov}(X, Y)}{\sigma(X)\sigma(Y)} \\ &= \frac{\mathbb{E}(XY)-\mathbb{E}(X)\mathbb{E}(Y)}{\sigma(X)\sigma(Y)} \\ &= \mathbb{E}\left(\frac{X}{\sigma(X)}\frac{Y}{\sigma(Y)}\right) - \mathbb{E}\left(\frac{X}{\sigma(X)}\right)\mathbb{E}\left(\frac{Y}{\sigma(Y)}\right) \\ &= \operatorname{Cov}\left[\frac{X}{\sigma(X)},\frac{Y}{\sigma(Y)}\right] \end{aligned}

Empirical Means And Covariances

Definition 6.9 (Empirical Mean and Covariance). Given data, we estimate the mean (Definition 6.4) as the empirical mean (or sample mean):

xˉ:=1Nn=1Nxn \bar{\mathbf{x}} := \frac{1}{N}\sum_{n=1}^{N}\mathbf{x}_n

where xnRD\mathbf{x}_n \in \mathbb{R}^D. The empirical covariance is

Σ:=1Nn=1N(xnxˉ)(xnxˉ) \Sigma := \frac{1}{N}\sum_{n=1}^{N}(\mathbf{x}_n-\bar{\mathbf{x}})(\mathbf{x}_n-\bar{\mathbf{x}})^\top

Note the empirical covariance presented above is a biased estimate. The unbiased one, which sometimes called corrected, has the factor N1N-1 in the denominator instead of NN. The unbiased empirical covariance is

Σ:=1N1n=1N(xnxˉ)(xnxˉ) \Sigma := \frac{1}{N-1}\sum_{n=1}^{N}(\mathbf{x}_n-\bar{\mathbf{x}})(\mathbf{x}_n-\bar{\mathbf{x}})^\top

Three Expressions For The Variance

For a random variable XX, the variance has three equivalent expressions:

VX[X]:=EX[(Xμ)2] \mathbb{V}_X[X] := \mathbb{E}_X[(X-\mu)^2]

Empirically estimating this requires a two-pass algorithm: one pass to compute μ^\hat{\mu}, then another to compute the variance. Rearranging gives the raw-score formula:

VX[X]=EX[X2](EX[X])2. \mathbb{V}_X[X] = \mathbb{E}_X[X^2] - (\mathbb{E}_X[X])^2\,.

This is the mean of the square minus the square of the mean.

A third expression uses pairwise differences:

1N2i,j=1N(xixj)2=2[1Ni=1Nxi2(1Ni=1Nxi)2]. \frac{1}{N^2}\sum_{i,j=1}^N(x_i-x_j)^2 = 2\left[\frac{1}{N}\sum_{i=1}^Nx_i^2 - \left(\frac{1}{N}\sum_{i=1}^Nx_i\right)^2\right]\,.

This equals twice the raw-score expression. Thus, the sum of N2N^2 pairwise squared differences equals twice the sum of NN squared deviations from the mean.

Sums And Transformation Of Random Variables

For random vectors X,YRD\mathbf{X}, \mathbf{Y} \in \mathbb{R}^D:

E[X+Y]=E[X]+E[Y]E[XY]=E[X]E[Y]V[X+Y]=V[X]+V[Y]+Cov[X,Y]+Cov[Y,X]V[XY]=V[X]+V[Y]Cov[X,Y]Cov[Y,X]V[C]=0. \begin{aligned} \mathbb{E}[\mathbf{X}+\mathbf{Y}] &= \mathbb{E}[\mathbf{X}] + \mathbb{E}[\mathbf{Y}] \\ \mathbb{E}[\mathbf{X}-\mathbf{Y}] &= \mathbb{E}[\mathbf{X}] - \mathbb{E}[\mathbf{Y}] \\ \mathbb{V}[\mathbf{X}+\mathbf{Y}] &= \mathbb{V}[\mathbf{X}] + \mathbb{V}[\mathbf{Y}] + \operatorname{Cov}[\mathbf{X}, \mathbf{Y}] + \operatorname{Cov}[\mathbf{Y}, \mathbf{X}]\\ \mathbb{V}[\mathbf{X}-\mathbf{Y}] &= \mathbb{V}[\mathbf{X}] + \mathbb{V}[\mathbf{Y}] - \operatorname{Cov}[\mathbf{X}, \mathbf{Y}] - \operatorname{Cov}[\mathbf{Y}, \mathbf{X}]\\ \mathbb{V}[\mathbf{C}] &= 0\,. \end{aligned}

where CRD\mathbf{C} \in \mathbb{R}^D is constant.

For an affine transformation Y=AX+b\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b} with mean μ\mathbf{\mu} and covariance Σ\Sigma:

EY[Y]=Aμ+b,VY[Y]=AΣA, \begin{aligned} \mathbb{E}_Y[\mathbf{Y}] &= \mathbf{A}\mathbf{\mu}+\mathbf{b}\,, \\ \mathbb{V}_Y[\mathbf{Y}] &= \mathbf{A}\Sigma\mathbf{A}^\top\,, \end{aligned}

and

Cov[X,Y]=ΣA \begin{aligned} \operatorname{Cov}[\mathbf{X}, \mathbf{Y}] &= \Sigma\mathbf{A}^\top \end{aligned}

Definition 6.10 (Independence). X,YX, Y are statistically independent iff

p(x,y)=p(x)p(y). p(\mathbf{x}, \mathbf{y}) = p(\mathbf{x})p(\mathbf{y})\,.

Independence means yy provides no additional information about xx. If X,YX, Y are independent:

The converse of the last point does not hold: zero covariance does not imply independence.

In machine learning, problems are often modeled with independent and identically distributed (i.i.d.) random variables.

Definition 6.11 (Conditional Independence). XX and YY are conditionally independent given ZZ iff

p(x,yz)=p(xz)p(yz) p(x, y | z) = p(x|z)p(y|z)

for all zZz \in \mathcal{Z}. Using the product rule:

p(x,yz)=p(xy,z)p(yz) p(x,y|z) = p(x|y,z)p(y|z)

Inner Products Of Random Variables

Random variables can be viewed as vectors in a vector space. Define the inner product:

X,Y:=Cov[X,Y] \langle X, Y \rangle := \operatorname{Cov}[X, Y]

The length of XX is

X=Cov[X,X]=V[X]=σ[X] \|X\| = \sqrt{\operatorname{Cov}[X, X]} = \sqrt{\mathbb{V}[X]} = \sigma[X]

The angle θ\theta between XX and YY satisfies

cosθ=X,YXY=Cov[X,Y]V[X]V[Y], \cos\theta = \frac{\langle X, Y \rangle}{\|X\|\|Y\|} = \frac{\operatorname{Cov}[X, Y]}{\sqrt{\mathbb{V}[X]\mathbb{V}[Y]}}\,,

which is the correlation (Definition 6.8). Thus, correlation is the cosine of the angle between two random variables in this geometric view.

Gaussian Distribution

a.k.a. Normal distribution.

For a univariate random variable, the Gaussian distribution has a density that is given by

p(xμ,σ2)=12πσ2exp((xμ)22σ2)(6.62) \begin{equation} p(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(x-\mu)^2}{2\sigma^2}) \end{equation} \tag{6.62}

The multivariate Gaussian distribution is characterized by mean vector μ\mathbf{\mu} and a covariance matrix Σ\mathbf{\Sigma}, defined as

p(xμ,Σ)=(2π)D2Σ12exp(12(xμ)Σ1(xμ)) \begin{equation} p(\mathbf{x}|\mathbf{\mu}, \mathbf{\Sigma}) = (2\pi)^{-\frac{D}{2}}|\mathbf{\Sigma}|^{-\frac{1}{2}}\exp(-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})) \end{equation}

where xRD\mathbf{x} \in \mathbb{R}^D. We write XN(μ,Σ)X \sim \mathcal{N}(\mu, \mathbf{\Sigma}). The standard normal distribution is the special case of multivariate normal distribution with zero mean and identity covariance μ=0\mathbf{\mu} = 0 and Σ=I\mathbf{\Sigma} = \mathbf{\mathit{I}}.

Side note, I want to write the multivariate Gaussian distribution as

f(x)=1(2π)D2Σ12e(xμ)Σ1(xμ)2 f(x) = \frac{1}{ (2\pi)^{\frac{D}{2}}|\mathbf{\Sigma}|^{\frac{1}{2}} }e^{ \frac{(\mathbf{x}-\mathbf{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})}{2} }

A major advantage of Gaussian distribution is that variable transformations are often note needed. Since the Gaussian distribution is fully specified by its mean and covariance, we often can obtain the transformed distribution by applying the transformation to the mean and covariance of the random variable.