Probability space allows us to quantify the idea of probability. But often we do not work directly with probability space. Instead, we work with random variables, which transfer the probability to a more convenient (often numerical) space.
The sample spaceΩ is the set of all possible outcomes of the experiment.
The event spaceA is the space of potential results of the experiment. It is obtained by considering the collection of subsets of Ω, and for discrete probability distributions A is often the power set of Ω. The event space is a σ-algebra.
The probabilityP. With each event A∈A, we associate a number P(A) that measures the degree of belief that the event will occur. P(A) is called the probability of event A.
In ML, we often avoid explicitly referring to the probability space, but instead refer to the probability of quantities of interest, which we denote by T. We refer to T as the target space and refer to elements of T as states. We introduce a function X:Ω→T that takes an element of Ω (an outcome) and returns a particular quantity of interest x, a value in T. e.g., In the case of tossing two coins and counting the number of heads, we can define a random variable X that maps outcomes to the number of heads: X(hh)=2, …. For any subset S⊆T, we associate a probability PX(S)∈[0,1] to the event that the random variable X takes a value in S.
Statistics
Difference between probability theory and statistics
Probability
Statistics
Underlying uncertainty is captured by random variables
We observe something has happened and try to figure out the underlying process that explains the observations
Known (or just an assumed) model → predicted result
Known data → predicted model
Discrete And Continuous Probabilities
Discrete Probabilities
Here the target spaceT is discrete. Consider the probability distribution of two (or more than two) random variables. The target space of joint probability is the Cartesian product of the target spaces of each random variable. We define the joint probability as the probability of both values occurring jointly
P(X=xi,Y=yj)=Nnij
where nij is the number of events with state xi and yj, and N is the total number of events. The joint probability is the probability of the intersection of both events.
In machine learning, we use discrete probability distributions to model categorical variables, i.e., variables that take a finite set of unordered values.
The function to describe the (joint) probability of discrete variable(s) is called probability mass function (pmf).(中文称为 概率质量函数,用于衡量离散随机变量的概率的)
Continuous Probabilities
Definition 6.1 (Probability Density Function, pdf). A function f:RD→R is called a probability density function (pdf) if
∀x∈RD: f(x)≥0
Its integral exists and
∫RDf(x)dx=1
In contrast to discrete random variables, the probability of a continuous random variable X taking a particular value P(X=x) is zero. The following is the definition of cdf.
Definition 6.2 (Cumulative Distribution Function, cdf). A cumulative distribution function (cdf) of a multivariate real-valued random variable X with states x∈RD is given by
FX(x)=P(X1≤x1,…,XD≤xD)
where X=[X1,…,XD]⊤, x=[x1,…,xD]⊤
The cdf can also be expressed as the integral of the probability density function f(x):
FX(x)=∫−∞x1⋯∫−∞xDf(z1,…,zD)dz1⋯dzD
where zi are dummy variables.
Sum Rule / Product Rule / Bayesian Theorem
The Sum Rule
If p(x,y) is a joint distribution, then p(x) and p(y) are the corresponding marginal distributions, and p(y∣x) is the conditional distribution of y given x.
The sum rule states that,
p(x)={∑y∈Yp(x,y)∫Yp(x,y)dyif Y is discreteif Y is continuous
The sum rule is also known as the marginalization property. In general, when the joint distribution contains more than two variables, the sum rule can be applied to any subset of variables. More concretely, if x=[x1,x2,⋯,xn]⊤, we obtain the marginal
p(xi)=∫p(x1,x2,…,xn)dx−i
where x−i means all variables exceptxi.
The Product Rule
The product rule relates the joint distribution to the conditional distribution:
p(x,y)=p(y∣x)p(x)
The product rule can be interpreted as the fact that every joint distribution of two variables can be factorized into two other distributions. The two factors are the marginal distribution of the first random variable x and the conditional distribution of y given x.
p(x) is the prior, which encapsulates our subjective prior knowledge of the unobserved (latent) variable x before observing any data. We can choose any prior that makes sense, but it is critical to ensure the pdf/pmf of the prior is non-zero for all plausible x, even if they are rare. The prior is a probability distribution that does not rely on observed data and is independent of other random variables.
The likelihood (different from likelihood function, but related) p(y∣x) describes how x and y are related. In the case of discrete probability distributions, it is the probability of the data y given the latent variable x.
The posteriorp(x∣y) is the quantity of interest in Bayesian statistics. It expresses “what we know about x given y”, which is exactly what we care about.
Bayes’ theorem allows us to invert the relationship between x and y given by the likelihood. Therefore, Bayes’ theorem is sometimes called the probabilistic inverse.
Summary Statistics And Independence
A statistic of a random variable is a deterministic function of that variable.
Summary statistics provide a useful view of how a random variable behaves and characterize the distribution.
Means And (Co)Variances
Also called the population mean and covariance, as it refers to the true statistics for the population.
Definition 6.3 (Expected Value). The expected value of a function g:R→R of a univariate continuous random variable X∼p(x) is
E[g(X)]=∫Xg(x)p(x)dx.
Correspondingly, the expected value of a function g(x) of a univariate discrete random variable X∼p(x) is
E[g(X)]=X∑g(x)p(x),
where X is the target space of X.
The expected value of a random vector X=[X1,X2,…,XD]⊤ is
E[X]=E[X1]E[X2]⋮E[XD]∈RD.
Definition 6.4 (Mean). The mean is a special case of expected value, obtained by choosing g as the identity function. The mean of a random variable X with states x∈RD is defined as
E[X]=E[X1]E[X2]⋮E[XD]∈RD,
where
E[Xd]={∑Xdxdp(xd)∫Xdxdf(xd)dxdif X is discreteif X is continuous
Other terminologies:
Median: 中位数
Mode: 众数
For discrete distributions, the mode is the value with the highest frequency.
For continuous distributions, the mode is the peak of the density p(x). A distribution may have multiple modes, and finding all modes can be computationally challenging in high dimensions.
Properties of expected value
Expected value is a linear operator. Given f(x)=ag(x)+bh(x) where a,b∈R and x∈RD:
The covariance of a variable with itself is its variance, denoted VX[X]=Cov[X,X]. The square root of variance is the standard deviation, denoted σ(X)=VX[X].
Definition 6.6 (Covariance (Multivariate)). For multivariate random variables X and Y with states x∈RD and y∈RE, the covariance is
CovX,Y[X,Y]=E[XY⊤]−E[X]E[Y]⊤=CovY,X[Y,X]⊤∈RD×E.
Definition 6.7 (Variance (Multivariate)). The variance of X with states x∈RD and mean μ∈RD is
This D×D matrix is the covariance matrix. It is symmetric, positive semidefinite, and describes the spread of the data. Its diagonal contains the variances of the marginal distributions.
Definition 6.8 (Correlation). The normalized covariance is the correlation:
corr[X,Y]=V[X]V[Y]Cov[X,Y]∈[−1,1].
The correlation matrix is the covariance matrix of standardized variables X/σ(X) and Y/σ(Y):
Covariance (and correlation) indicate how two random variables are related.
Positive correlation: When X increases, Y tends to increase.
Negative correlation: When X increases, Y tends to decrease.
Empirical Means And Covariances
Definition 6.9 (Empirical Mean and Covariance). Given data, we estimate the mean (Definition 6.4) as the empirical mean (or sample mean):
xˉ:=N1n=1∑Nxn
where xn∈RD. The empirical covariance is
Σ:=N1n=1∑N(xn−xˉ)(xn−xˉ)⊤
Note the empirical covariance presented above is a biased estimate. The unbiased one, which sometimes called corrected, has the factor N−1 in the denominator instead of N. The unbiased empirical covariance is
Σ:=N−11n=1∑N(xn−xˉ)(xn−xˉ)⊤
Three Expressions For The Variance
For a random variable X, the variance has three equivalent expressions:
VX[X]:=EX[(X−μ)2]
Empirically estimating this requires a two-pass algorithm: one pass to compute μ^, then another to compute the variance. Rearranging gives the raw-score formula:
VX[X]=EX[X2]−(EX[X])2.
This is the mean of the square minus the square of the mean.
This equals twice the raw-score expression. Thus, the sum of N2 pairwise squared differences equals twice the sum of N squared deviations from the mean.
For an affine transformation Y=AX+b with mean μ and covariance Σ:
EY[Y]VY[Y]=Aμ+b,=AΣA⊤,
and
Cov[X,Y]=ΣA⊤
Definition 6.10 (Independence). X,Y are statistically independent iff
p(x,y)=p(x)p(y).
Independence means y provides no additional information about x. If X,Y are independent:
p(y∣x)=p(y)
p(x∣y)=p(x)
V[X+Y]=V[X]+V[Y]
V[X−Y]=V[X]+V[Y]
Cov[X,Y]=0
The converse of the last point does not hold: zero covariance does not imply independence.
In machine learning, problems are often modeled with independent and identically distributed (i.i.d.) random variables.
Definition 6.11 (Conditional Independence). X and Y are conditionally independent given Z iff
p(x,y∣z)=p(x∣z)p(y∣z)
for all z∈Z. Using the product rule:
p(x,y∣z)=p(x∣y,z)p(y∣z)
Inner Products Of Random Variables
Random variables can be viewed as vectors in a vector space. Define the inner product:
⟨X,Y⟩:=Cov[X,Y]
The length of X is
∥X∥=Cov[X,X]=V[X]=σ[X]
The angle θ between X and Y satisfies
cosθ=∥X∥∥Y∥⟨X,Y⟩=V[X]V[Y]Cov[X,Y],
which is the correlation (Definition 6.8). Thus, correlation is the cosine of the angle between two random variables in this geometric view.
Gaussian Distribution
a.k.a. Normal distribution.
For a univariate random variable, the Gaussian distribution has a density that is given by
p(x∣μ,σ2)=2πσ21exp(−2σ2(x−μ)2)(6.62)
The multivariate Gaussian distribution is characterized by mean vectorμ and a covariance matrixΣ, defined as
p(x∣μ,Σ)=(2π)−2D∣Σ∣−21exp(−21(x−μ)⊤Σ−1(x−μ))
where x∈RD. We write X∼N(μ,Σ). The standard normal distribution is the special case of multivariate normal distribution with zero mean and identity covariance μ=0 and Σ=I.
Side note, I want to write the multivariate Gaussian distribution as
f(x)=(2π)2D∣Σ∣211e2(x−μ)⊤Σ−1(x−μ)
A major advantage of Gaussian distribution is that variable transformations are often note needed. Since the Gaussian distribution is fully specified by its mean and covariance, we often can obtain the transformed distribution by applying the transformation to the mean and covariance of the random variable.