Probability Basics

Dec 3, 2024

Probability And Distribution

Probability And Random Variables

Probability space allows us to quantify the idea of probability. But often we do not work directly with probability space. Instead, we work with random variables, which transfer the probability to a more convenient (often numerical) space.
The sample space $\Omega$ is the set of all possible outcomes of the experiment.
The event space $\mathcal{A}$ is the space of potential results of the experiment. It is obtained by considering the collection of subsets of $\Omega$ , and for discrete probability distributions $\mathcal{A}$ is often the power set of $\Omega$ . The event space is a $\sigma$ -algebra.
The probability $P$ . With each event $A \in \mathcal{A}$ , we associate a number $P(A)$ that measures the degree of belief that the event will occur. $P(A)$ is called the probability of event $A$ .

In ML, we often avoid explicitly referring to the probability space, but instead refer to the probability of quantities of interest, which we denote by $\mathcal{T}$ . We refer to $\mathcal{T}$ as the target space and refer to elements of $\mathcal{T}$ as states.
We introduce a function $X:\Omega \rightarrow \mathcal{T}$ that takes an element of $\Omega$ (an outcome) and returns a particular quantity of interest $x$ , a value in $\mathcal{T}$ . e.g., In the case of tossing two coins and counting the number of heads, we can define a random variable $X$ that maps outcomes to the number of heads: $X(hh) = 2$ , $\dots$ .
For any subset $S \subseteq \mathcal{T}$ , we associate a probability $P_X(S) \in [0, 1]$ to the event that the random variable $X$ takes a value in $S$ .

Statistics

Difference between probability theory and statistics

Probability	Statistics
Underlying uncertainty is captured by random variables	We observe something has happened and try to figure out the underlying process that explains the observations
Known (or just an assumed) model $\rightarrow$ predicted result	Known data $\rightarrow$ predicted model

Discrete And Continuous Probabilities

Discrete Probabilities

Here the target space $\mathcal{T}$ is discrete. Consider the probability distribution of two (or more than two) random variables. The target space of joint probability is the Cartesian product of the target spaces of each random variable. We define the joint probability as the probability of both values occurring jointly

P(X = x_i, Y = y_j) = \frac{n_{ij}}{N}

where $n_{ij}$ is the number of events with state $x_i$ and $y_j$ , and $N$ is the total number of events. The joint probability is the probability of the intersection of both events.

In machine learning, we use discrete probability distributions to model categorical variables, i.e., variables that take a finite set of unordered values.

The function to describe the (joint) probability of discrete variable(s) is called probability mass function (pmf).（中文称为 概率质量函数，用于衡量离散随机变量的概率的）

Continuous Probabilities

Definition 6.1 (Probability Density Function, pdf). A function $f: \mathbb{R}^D \rightarrow \mathbb{R}$ is called a probability density function (pdf) if

$\forall x \in \mathbb{R}^D$ : $f(\mathbf{x}) \ge 0$
Its integral exists and

\int_{\mathbb{R}^D}f(\mathbf{x})\,d\mathbf{x} = 1

In contrast to discrete random variables, the probability of a continuous random variable $X$ taking a particular value $P(X=x)$ is zero. The following is the definition of cdf.

Definition 6.2 (Cumulative Distribution Function, cdf). A cumulative distribution function (cdf) of a multivariate real-valued random variable $X$ with states $\mathbf{x} \in \mathbb{R}^D$ is given by

F_X(\mathbf{x}) = P(X_1 \le x_1, \dots, X_D \le x_D)

where $X = [X_1, \dots, X_D]^\top$ , $\mathbf{x} = [x_1, \dots, x_D]^\top$

The cdf can also be expressed as the integral of the probability density function $f(\mathbf{x})$ :

F_X(\mathbf{x}) = \int_{-\infty}^{x_1} \cdots \int_{-\infty}^{x_D} f(z_1, \dots, z_D)\,d z_1 \cdots dz_D

where $z_i$ are dummy variables.

Sum Rule / Product Rule / Bayesian Theorem

The Sum Rule

If $p(x, y)$ is a joint distribution, then $p(x)$ and $p(y)$ are the corresponding marginal distributions, and $p(y|x)$ is the conditional distribution of $y$ given $x$ .

The sum rule states that,

p(x) = \begin{cases} \sum_{y \in \mathcal{Y}} p(x, y) & \text{if } Y \text{ is discrete} \\ \int_{\mathcal{Y}} p(x, y) d\mathbf{y} & \text{if } Y \text{ is continuous} \end{cases}

The sum rule is also known as the marginalization property. In general, when the joint distribution contains more than two variables, the sum rule can be applied to any subset of variables. More concretely, if $\mathbf{x} = \begin{bmatrix} x_1, x_2, \cdots, x_n \end{bmatrix}^\top$ , we obtain the marginal

p(x_i) = \int p(x_1, x_2, \dots, x_n) d\mathbf{x}_{-i}

where $\mathbf{x}_{-i}$ means all variables except $x_i$ .

The Product Rule

The product rule relates the joint distribution to the conditional distribution:

p(\mathbf{x}, \mathbf{y}) = p(\mathbf{y}|\mathbf{x})p(\mathbf{x})

The product rule can be interpreted as the fact that every joint distribution of two variables can be factorized into two other distributions. The two factors are the marginal distribution of the first random variable $\mathbf{x}$ and the conditional distribution of $\mathbf{y}$ given $\mathbf{x}$ .

Bayes’ Theorem

\begin{equation} \underbrace{p(\mathbf{x}|\mathbf{y})}_{\text{posterior}} = \frac{\overbrace{p(\mathbf{y}|\mathbf{x})}^{\text{likelihood}} \overbrace{p(\mathbf{x})}^{\text{prior}}}{\underbrace{p(\mathbf{y})}_{\text{evidence}}} \end{equation}

$p(\mathbf{x})$ is the prior, which encapsulates our subjective prior knowledge of the unobserved (latent) variable $\mathbf{x}$ before observing any data. We can choose any prior that makes sense, but it is critical to ensure the pdf/pmf of the prior is non-zero for all plausible $\mathbf{x}$ , even if they are rare. The prior is a probability distribution that does not rely on observed data and is independent of other random variables.

The likelihood (different from likelihood function, but related) $p(\mathbf{y}|\mathbf{x})$ describes how $\mathbf{x}$ and $\mathbf{y}$ are related. In the case of discrete probability distributions, it is the probability of the data $\mathbf{y}$ given the latent variable $\mathbf{x}$ .

The posterior $p(\mathbf{x}|\mathbf{y})$ is the quantity of interest in Bayesian statistics. It expresses “what we know about $\mathbf{x}$ given $\mathbf{y}$ ”, which is exactly what we care about.

Bayes’ theorem allows us to invert the relationship between $\mathbf{x}$ and $\mathbf{y}$ given by the likelihood. Therefore, Bayes’ theorem is sometimes called the probabilistic inverse.

Summary Statistics And Independence

A statistic of a random variable is a deterministic function of that variable.

Summary statistics provide a useful view of how a random variable behaves and characterize the distribution.

Means And (Co)Variances

Also called the population mean and covariance, as it refers to the true statistics for the population.

Definition 6.3 (Expected Value). The expected value of a function $g: \mathbb{R} \rightarrow \mathbb{R}$ of a univariate continuous random variable $X \sim p(x)$ is

\mathbb{E}[g(X)] = \int_{\mathcal{X}}g(x)p(x)dx\,.

Correspondingly, the expected value of a function $g(x)$ of a univariate discrete random variable $X \sim p(x)$ is

\mathbb{E}[g(X)] = \sum_{\mathcal{X}}g(x)p(x)\,,

where $\mathcal{X}$ is the target space of $X$ .

The expected value of a random vector $\mathbf{X} = [X_1, X_2, \dots, X_D]^\top$ is

\mathbb{E}[\mathbf{X}] = \begin{bmatrix} \mathbb{E}[X_1] \\ \mathbb{E}[X_2] \\ \vdots \\ \mathbb{E}[X_D] \end{bmatrix} \in \mathbb{R}^D\,.

Definition 6.4 (Mean). The mean is a special case of expected value, obtained by choosing $g$ as the identity function. The mean of a random variable $X$ with states $\mathbf{x} \in \mathbb{R}^D$ is defined as

\mathbb{E}[\mathbf{X}] = \begin{bmatrix} \mathbb{E}[X_1] \\ \mathbb{E}[X_2] \\ \vdots \\ \mathbb{E}[X_D] \end{bmatrix} \in \mathbb{R}^D\,,

where

\begin{equation} \mathbb{E}[X_d] = \begin{cases} \sum_{\mathcal{X}_d} x_d p(x_d) & \text{if } X \text{ is discrete} \\ \int_{\mathcal{X}_d} x_d f(x_d)\,d x_d & \text{if } X \text{ is continuous} \end{cases} \end{equation}

Other terminologies:

Median: 中位数
Mode: 众数
- For discrete distributions, the mode is the value with the highest frequency.
- For continuous distributions, the mode is the peak of the density $p(x)$ . A distribution may have multiple modes, and finding all modes can be computationally challenging in high dimensions.

Properties of expected value

Expected value is a linear operator. Given $f(\mathbf{x}) = ag(\mathbf{x}) + bh(\mathbf{x})$ where $a, b \in \mathbb{R}$ and $\mathbf{x} \in \mathbb{R}^D$ :

\begin{aligned} \mathbb{E}_\mathbf{X}[f(\mathbf{x})] &= \int_{\mathcal{X}}f(\mathbf{x})p(\mathbf{x})\,d\mathbf{x} \\ &= \int_{\mathcal{X}}[ag(\mathbf{x})+bh(\mathbf{x})]p(\mathbf{x})\,d\mathbf{x} \\ &= a\int_{\mathcal{X}}g(\mathbf{x})p(\mathbf{x})\,d\mathbf{x} + b\int_{\mathcal{X}}h(\mathbf{x})p(\mathbf{x})\,d\mathbf{x} \\ &= a\mathbb{E}_\mathbf{X}[g(\mathbf{x})] + b\mathbb{E}_\mathbf{X}[h(\mathbf{x})] \end{aligned}

Definition 6.5 (Covariance (Univariate)). The covariance between two univariate random variables $X, Y \in \mathbb{R}$ is

\begin{aligned} \operatorname{Cov}_{X,Y}[X,Y] &= \mathbb{E}_{X,Y}[(X - \mathbb{E}_X[X])(Y - \mathbb{E}_Y[Y])] \\ &= \mathbb{E}_{X,Y}[XY] - \mathbb{E}_X[X]\mathbb{E}_Y[Y] \end{aligned}

The covariance of a variable with itself is its variance, denoted $\mathbb{V}_X[X] = \operatorname{Cov}[X, X]$ . The square root of variance is the standard deviation, denoted $\sigma(X) = \sqrt{\mathbb{V}_X[X]}$ .

Definition 6.6 (Covariance (Multivariate)). For multivariate random variables $X$ and $Y$ with states $\mathbf{x} \in \mathbb{R}^D$ and $\mathbf{y} \in \mathbb{R}^E$ , the covariance is

\operatorname{Cov}_{X,Y}[\mathbf{X}, \mathbf{Y}] = \mathbb{E}[\mathbf{X}\mathbf{Y}^\top] - \mathbb{E}[\mathbf{X}]\mathbb{E}[\mathbf{Y}]^\top = \operatorname{Cov}_{Y,X}[\mathbf{Y}, \mathbf{X}]^{\top} \in \mathbb{R}^{D \times E}\,.

Definition 6.7 (Variance (Multivariate)). The variance of $X$ with states $\mathbf{x} \in \mathbb{R}^D$ and mean $\mathbf{\mu} \in \mathbb{R}^D$ is

\begin{aligned} \mathbb{V}_X[\mathbf{X}] &= \operatorname{Cov}_X[\mathbf{X}, \mathbf{X}] \\ &= \mathbb{E}_X[(\mathbf{X} - \mathbf{\mu})(\mathbf{X} - \mathbf{\mu})^\top] = \mathbb{E}_X[\mathbf{X}\mathbf{X}^\top] - \mathbb{E}_X[\mathbf{X}]\mathbb{E}_X[\mathbf{X}]^\top \\ &= \begin{bmatrix} \operatorname{Cov}[X_1, X_1] & \operatorname{Cov}[X_1, X_2] & \dots & \operatorname{Cov}[X_1, X_D] \\ \operatorname{Cov}[X_2, X_1] & \operatorname{Cov}[X_2, X_2] & \dots & \operatorname{Cov}[X_2, X_D] \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov}[X_D, X_1] & \operatorname{Cov}[X_D, X_2] & \dots & \operatorname{Cov}[X_D, X_D] \\ \end{bmatrix}\,. \end{aligned}

This $D \times D$ matrix is the covariance matrix. It is symmetric, positive semidefinite, and describes the spread of the data. Its diagonal contains the variances of the marginal distributions.

Definition 6.8 (Correlation). The normalized covariance is the correlation:

\operatorname{corr}[X, Y] = \frac{\operatorname{Cov}[X,Y]}{\sqrt{\mathbb{V}[X]\mathbb{V}[Y]}} \in [-1, 1]\,.

The correlation matrix is the covariance matrix of standardized variables $X/\sigma(X)$ and $Y/\sigma(Y)$ :

\begin{aligned} \operatorname{corr}[X, Y] &= \frac{\operatorname{Cov}[X,Y]}{\sqrt{\mathbb{V}[X]\mathbb{V}[Y]}} \\ &= \frac{\operatorname{Cov}(X, Y)}{\sigma(X)\sigma(Y)} \\ &= \frac{\mathbb{E}(XY)-\mathbb{E}(X)\mathbb{E}(Y)}{\sigma(X)\sigma(Y)} \\ &= \mathbb{E}\left(\frac{X}{\sigma(X)}\frac{Y}{\sigma(Y)}\right) - \mathbb{E}\left(\frac{X}{\sigma(X)}\right)\mathbb{E}\left(\frac{Y}{\sigma(Y)}\right) \\ &= \operatorname{Cov}\left[\frac{X}{\sigma(X)},\frac{Y}{\sigma(Y)}\right] \end{aligned}

Covariance (and correlation) indicate how two random variables are related.
Positive correlation: When $X$ increases, $Y$ tends to increase.
Negative correlation: When $X$ increases, $Y$ tends to decrease.

Empirical Means And Covariances

Definition 6.9 (Empirical Mean and Covariance). Given data, we estimate the mean (Definition 6.4) as the empirical mean (or sample mean):

\bar{\mathbf{x}} := \frac{1}{N}\sum_{n=1}^{N}\mathbf{x}_n

where $\mathbf{x}_n \in \mathbb{R}^D$ . The empirical covariance is

\Sigma := \frac{1}{N}\sum_{n=1}^{N}(\mathbf{x}_n-\bar{\mathbf{x}})(\mathbf{x}_n-\bar{\mathbf{x}})^\top

Note the empirical covariance presented above is a biased estimate. The unbiased one, which sometimes called corrected, has the factor $N-1$ in the denominator instead of $N$ . The unbiased empirical covariance is

\Sigma := \frac{1}{N-1}\sum_{n=1}^{N}(\mathbf{x}_n-\bar{\mathbf{x}})(\mathbf{x}_n-\bar{\mathbf{x}})^\top

Three Expressions For The Variance

For a random variable $X$ , the variance has three equivalent expressions:

\mathbb{V}_X[X] := \mathbb{E}_X[(X-\mu)^2]

Empirically estimating this requires a two-pass algorithm: one pass to compute $\hat{\mu}$ , then another to compute the variance. Rearranging gives the raw-score formula:

\mathbb{V}_X[X] = \mathbb{E}_X[X^2] - (\mathbb{E}_X[X])^2\,.

This is the mean of the square minus the square of the mean.

A third expression uses pairwise differences:

\frac{1}{N^2}\sum_{i,j=1}^N(x_i-x_j)^2 = 2\left[\frac{1}{N}\sum_{i=1}^Nx_i^2 - \left(\frac{1}{N}\sum_{i=1}^Nx_i\right)^2\right]\,.

This equals twice the raw-score expression. Thus, the sum of $N^2$ pairwise squared differences equals twice the sum of $N$ squared deviations from the mean.

Sums And Transformation Of Random Variables

For random vectors $\mathbf{X}, \mathbf{Y} \in \mathbb{R}^D$ :

\begin{aligned} \mathbb{E}[\mathbf{X}+\mathbf{Y}] &= \mathbb{E}[\mathbf{X}] + \mathbb{E}[\mathbf{Y}] \\ \mathbb{E}[\mathbf{X}-\mathbf{Y}] &= \mathbb{E}[\mathbf{X}] - \mathbb{E}[\mathbf{Y}] \\ \mathbb{V}[\mathbf{X}+\mathbf{Y}] &= \mathbb{V}[\mathbf{X}] + \mathbb{V}[\mathbf{Y}] + \operatorname{Cov}[\mathbf{X}, \mathbf{Y}] + \operatorname{Cov}[\mathbf{Y}, \mathbf{X}]\\ \mathbb{V}[\mathbf{X}-\mathbf{Y}] &= \mathbb{V}[\mathbf{X}] + \mathbb{V}[\mathbf{Y}] - \operatorname{Cov}[\mathbf{X}, \mathbf{Y}] - \operatorname{Cov}[\mathbf{Y}, \mathbf{X}]\\ \mathbb{V}[\mathbf{C}] &= 0\,. \end{aligned}

where $\mathbf{C} \in \mathbb{R}^D$ is constant.

For an affine transformation $\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}$ with mean $\mathbf{\mu}$ and covariance $\Sigma$ :

\begin{aligned} \mathbb{E}_Y[\mathbf{Y}] &= \mathbf{A}\mathbf{\mu}+\mathbf{b}\,, \\ \mathbb{V}_Y[\mathbf{Y}] &= \mathbf{A}\Sigma\mathbf{A}^\top\,, \end{aligned}

and

\begin{aligned} \operatorname{Cov}[\mathbf{X}, \mathbf{Y}] &= \Sigma\mathbf{A}^\top \end{aligned}

Definition 6.10 (Independence). $X, Y$ are statistically independent iff

p(\mathbf{x}, \mathbf{y}) = p(\mathbf{x})p(\mathbf{y})\,.

Independence means $y$ provides no additional information about $x$ . If $X, Y$ are independent:

$p(\mathbf{y}|\mathbf{x}) = p(\mathbf{y})$
$p(\mathbf{x}|\mathbf{y}) = p(\mathbf{x})$
$\mathbb{V}[\mathbf{X}+\mathbf{Y}] = \mathbb{V}[\mathbf{X}] + \mathbb{V}[\mathbf{Y}]$
$\mathbb{V}[\mathbf{X}-\mathbf{Y}] = \mathbb{V}[\mathbf{X}] + \mathbb{V}[\mathbf{Y}]$
$\operatorname{Cov}[\mathbf{X}, \mathbf{Y}] = 0$

The converse of the last point does not hold: zero covariance does not imply independence.

In machine learning, problems are often modeled with independent and identically distributed (i.i.d.) random variables.

Definition 6.11 (Conditional Independence). $X$ and $Y$ are conditionally independent given $Z$ iff

p(x, y | z) = p(x|z)p(y|z)

for all $z \in \mathcal{Z}$ . Using the product rule:

p(x,y|z) = p(x|y,z)p(y|z)

Inner Products Of Random Variables

Random variables can be viewed as vectors in a vector space. Define the inner product:

\langle X, Y \rangle := \operatorname{Cov}[X, Y]

The length of $X$ is

\|X\| = \sqrt{\operatorname{Cov}[X, X]} = \sqrt{\mathbb{V}[X]} = \sigma[X]

The angle $\theta$ between $X$ and $Y$ satisfies

\cos\theta = \frac{\langle X, Y \rangle}{\|X\|\|Y\|} = \frac{\operatorname{Cov}[X, Y]}{\sqrt{\mathbb{V}[X]\mathbb{V}[Y]}}\,,

which is the correlation (Definition 6.8). Thus, correlation is the cosine of the angle between two random variables in this geometric view.

Gaussian Distribution

a.k.a. Normal distribution.

For a univariate random variable, the Gaussian distribution has a density that is given by

\begin{equation} p(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(x-\mu)^2}{2\sigma^2}) \end{equation} \tag{6.62}

The multivariate Gaussian distribution is characterized by mean vector $\mathbf{\mu}$ and a covariance matrix $\mathbf{\Sigma}$ , defined as

\begin{equation} p(\mathbf{x}|\mathbf{\mu}, \mathbf{\Sigma}) = (2\pi)^{-\frac{D}{2}}|\mathbf{\Sigma}|^{-\frac{1}{2}}\exp(-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})) \end{equation}

where $\mathbf{x} \in \mathbb{R}^D$ . We write $X \sim \mathcal{N}(\mu, \mathbf{\Sigma})$ . The standard normal distribution is the special case of multivariate normal distribution with zero mean and identity covariance $\mathbf{\mu} = 0$ and $\mathbf{\Sigma} = \mathbf{\mathit{I}}$ .

Side note, I want to write the multivariate Gaussian distribution as

f(x) = \frac{1}{ (2\pi)^{\frac{D}{2}}|\mathbf{\Sigma}|^{\frac{1}{2}} }e^{ \frac{(\mathbf{x}-\mathbf{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})}{2} }

A major advantage of Gaussian distribution is that variable transformations are often note needed. Since the Gaussian distribution is fully specified by its mean and covariance, we often can obtain the transformed distribution by applying the transformation to the mean and covariance of the random variable.