Although we often do not really care about the formalities of a framework in practice, from a theoretical point of view it is important to carefully treat all objects. For this reason, this appendix will (briefly) cover the main notions and results that will be used throughout the posts on probability theory and related subjects.


In the beginning of the previous century, people realized that probability theory could be formally founded in the same mathematical theory that was being used to formalize notions of area and volume: ‘measure theory’. The problem with assigning a volume to all subsets of a Euclidean space $\mathbb{R}^n$ was that, given the axioms of set theory, mathematicians could not consistently define such an operation. No matter how crazy this might sound, there is no way to consistently define a notion of volume for all subsets of Euclidean space, i.e. there is no way to ‘measure’ all these sets. A famous example is given by the Vitali sets.1

The relation between measuring sets and assigning probabilities should, after all, not come as a surprise. Consider, for example, the set $[n]$ of the first $n\in\mathbb{N}$ integers. What is the probability that a point, uniformly sampled from $[n]$, lies in a subset $S\subseteq[n]$? This is simply the size (or cardinality) of the subset divided by $n$:

\[\mathrm{Prob}(S) = \frac{|S|}{|[n]|} = \frac{|S|}{n}\,.\]

As such, calculating this probability is equivalent to determining the size of $S$, i.e. ‘measuring’ $S$. This uniform probability distribution corresponds to what is also called the counting measure, since it simply counts the number of elements in a set (a formalization of the notion of measure will be introduced further below):

\[\mu_\text{count}(A) := |A|\,.\]

Sadly, in the case of infinite sets, the axiom of choice makes our lives slightly more miserable. For example, as Wikipedia so beautifully explains, the Banach–Tarski paradox shows that there is no way to consistently define a notion of volume in three dimensions unless one of the following five concessions is made:

  • The volume of a set changes when it is rotated.
  • The volume of the union of two disjoint sets is different from the sum of their individual volumes.
  • Some sets are deemed ‘nonmeasurable’, and we need to check whether a set is `measurable’ before being able to talk about its volume.
  • The axioms of ZFC (Zermelo–Fraenkel set theory with the axiom of choice) have to be altered.
  • The volume of $[0,1]^3$ is either 0 or $+\infty$.

In the case of measure theory, the third option is chosen, i.e. the whole procedure is simply turned around. Instead of starting from a measure and finding out that not all sets are measurable, we start with a collection of sets that we would like to be measurable and study all measures consistent with this collection. This leads to the following notion.

A $\sigma$-algebra is a collection $\Sigma\subseteq2^\mathcal{X}$ of subsets such that:
  1. Triviality: The empty set is measurable: $\emptyset\in\Sigma$.
  2. Complements: Complements of measurable sets are measurable: \[A\in\Sigma\implies A^c\in\Sigma\,.\]
  3. Countable unions: Countable (disjoint) unions of measurable sets are measurable: \[(A_n)_{n\in\mathbb{N}}\subseteq\Sigma\implies\bigsqcup_{i=1}^{+\infty}A_i\in\Sigma\,.\]
The elements of a $\sigma$-algebra are called measurable sets and the pair $(\mathcal{X},\Sigma)$ is called a measurable space. (If the choice of $\sigma$-algebra is irrelevant, we will simply write $\mathcal{X}$.)

Before continuing the discussion about measure theory, let us first see why these conditions make sense. The triviality condition should be clear. If any set should be measurable, then let it at least be the empty one. It has no internal structure and it contains no elements, so measuring it should be trivial. The second condition is also quite straightforward. If we can measure $\mathcal{X}$ and a subset of it, then we should also be able to measure the complement. The fact that we can measure $\mathcal{X}$ itself is just a simple consequence of the first two conditions. For the third condition, we can argue that if every set $A_n$ can be measured and the results can be enumerated, we can iteratively combine the results to measure the union. However, what if someone asks how to measure intersections of measurable sets? The solution is given by a basic result from set theory that relates these operations. De Morgan’s laws state that

\[\mathcal{X}\backslash\bigcup_{n=1}^{+\infty}A_n = \bigcap_{n=1}^{+\infty}\mathcal{X}\backslash A_n\]

and

\[\mathcal{X}\backslash\bigcap_{n=1}^{+\infty}A_n = \bigcup_{n=1}^{+\infty}\mathcal{X}\backslash A_n\,.\]

Using one of these equalities, together with the second condition on $\sigma$-algebras, the intersection of measurable sets can be rewritten as a union of measurable sets and, hence, this intersection is itself measurable by virtue of the third condition above. Now that we have seen the definition of measurable sets, it might also be a good idea to consider some examples to get a feeling of what $\sigma$-algebras might entail.

Two trivial examples of $\sigma$-algebras can be defined on any set $\mathcal{X}$, no matter the size or structure. These are the trivial (or codiscrete) $\sigma$-algebra

\[\Sigma_\mathrm{codisc}(\mathcal{X}) := \\{\emptyset,\mathcal{X}\\}\]

and the discrete $\sigma$-algebra

\[\Sigma_\mathrm{disc}(X) := 2^{\mathcal{X}}\,.\]

The first example can be interpreted as the situation where we are trying to measure the objects in a sealed box. The individual elements cannot be measured, only the box as a whole can be measured. The second example is the situation where we have perfect control over or knowledge about the whole set.

To obtain more interesting examples, we could pass to $\mathbb{R}^n$. However, just for fun (and because it is sometimes also of interest in practice), we can pass to an even more general setting, namely that of topological spaces. Formally introducing these objects would lead us too far astray, so let it suffice to say that these allow to formalize what it means for a set to be ‘open’, ‘closed’, ‘compact’, etc. The definition of a topology is stated in terms of unions and intersections of sets and, hence, this structure always induces that of a measurable space.

Consider a topological space $\mathcal{X}$. The Borel $\sigma$-algebra on $\mathcal{X}$ is the smallest $\sigma$-algebra containing all open sets of $\mathcal{X}$.
  • The Borel algebra on $\mathbb{R}$ is generated by the open intervals $]a,b[$ for all $a,b\in\mathbb{R}$. This is the common choice on all Euclidean spaces $\mathbb{R}^n$ and is often implicitly assumed.
  • The trivial and discrete $\sigma$-algebras are induced by (and coincide with) the trivial and discrete topologies, respectively.

As usual, the prototypical example is the real line $\mathbb{R}$. In this setting, we know that every open set can be written as a (countable) union of open intervals. So, the Borel $\sigma$-algebra on $\mathbb{R}$ is the one ‘generated’ (in the sense of applying the operations defining a $\sigma$-algebra) by the open intervals. However, an open set is always the complement of closed set2, so we also obtain that the Borel $\sigma$-algebra of $\mathbb{R}$ and, in fact, of any topological space also contains all closed sets.

Note
The trivial and discrete $\sigma$-algebras are actually the Borel $\sigma$-algebras of codiscrete and discrete topologies, respectively.

As a last example, we consider the multisets in a measurable space. For the construction of (transductive) conformal predictors, we need to have a measurable structure on multisets. (Casual readers might prefer to skip this example.) Consider a measurable space $(\mathcal{X},\Sigma)$. The set $\mathcal{X}^*$ can also be turned into a measurable space as follows. For every $n\in\mathbb{N}$, the product $\sigma$-algebra on $\mathcal{X}^n$ is defined to be the smallest $\sigma$-algebra such that all Cartesian products $\prod_{i=1}^nA_i$, where $A_i\in\Sigma$ for all $i\leq n$, are measurable.3 Given these measurable spaces $(\mathcal{X}^n,\Sigma_n)$, we then take the (countable) disjoint union. The $\sigma$-algebra $\Sigma_*$ on $\mathcal{X}^*$ is defined such that4

\[B\in\Sigma_*\iff B\cap\mathcal{X}^n\in\Sigma_n\]

for all $B\subseteq\mathcal{X}^*$.


Now that the notion of a measurable space has been introduced, it is time to move on. The first step as a true mathematician would be to ask which functions preserve the structure of a measurable space. However, to avoid having to immediately delve into the technicalities of set theory, it is better to first introduce the notion of a measure, which will serve as a motivation for further concepts.

Let $(\mathcal{X},\Sigma)$ be a measurable space. A measure on $(\mathcal{X},\Sigma)$ is a set function $\mu:\Sigma\rightarrow\overline{\mathbb{R}}$ satisfying the Kolmogorov axioms:
  1. Nonnegativity: $\mu(A)\geq0$ for all $A\in\Sigma$.
  2. Emptiness: $\mu(\emptyset)=0$.
  3. Countable additivity (or $\sigma$-additivity): If $(A_n)_{n\in\mathbb{N}}\subset\Sigma$ are disjoint, then \[\mu\left(\bigsqcup_{n=0}^{+\infty}A_n\right)=\sum_{n=0}^{+\infty}\mu(A_n)\,.\]
The triple $(\mathcal{X},\Sigma,\mu)$ is called a measure space. If $\mu(\mathcal{X})=1$, the measure is called a probability measure or (probability) distribution. It should be clear that any measure space for which $\mu(\mathcal{X})<+\infty$ can be turned into a probability space by a suitable normalization (these are also said to be finite). If a measure space is not finite, but admits a countable cover by finite measures space, it is said to be $\sigma$-finite. The set of all probability measures on a set $\mathcal{X}$ will be denoted by $\mathbb{P}(\mathcal{X})$.

Requiring nonnegativity is simply a matter of convenience. There exist generalizations to so-called signed measures, but for most practical purposes, especially those of probability theory, the nonnegative ones suffice. The motivation for the third condition is the same as the one for the definition of $\sigma$-algebras. We want to be able to measure complex sets by decomposing them into smaller parts. This condition also has a more important consequence, namelya ll measures are monotonic functions:

\[A\subseteq B\implies\mu(A)\leq\mu(B)\,.\]

At last, we come to the emptiness condition. For ordinary set functions $\kappa:\Sigma\rightarrow\mathbb{R}$, the $\sigma$-additivity condition would allow us to perform the following deduction:

\[\kappa(\emptyset)=\kappa(\emptyset\cup\emptyset)=2\kappa(\emptyset)\implies\kappa(\emptyset)=0\,.\]

However, a measure is allowed to take on the value $+\infty$, which makes this argument invalid. The second condition is simply there to exclude the highly degenerate possibility where the measure is identically $+\infty$.

As in the previous subsection, before introducing even more exotic concepts, we first give some examples of (probability) measures. The first one is simply the volume or \textbf{Lebesgue measure} on $\mathbb{R}^n$. It formalizes the way we measure the volume of everyday objects. On intervals, it is defined as follows (the choice of open, half-open or closed intervals does not matter):

\[\lambda\bigl(]a,b]\bigr) := b-a\,.\]

To measure arbitrary Borel subsets of $\mathbb{R}$, we consider covers by such intervals and define the Lebesgue measure of a subset to be the infimum of the measures of its covers. Similarly, for higher-dimensional Euclidean spaces, we first define the Lebesgue measure of hyperrectangles $\mathbf{I}:=I_1\times I_2\times\cdots I_n$ as the product of the lengths of the intervals:

\[\lambda(\mathbf{I}) := \lambda(I_1)\cdots\lambda(I_n)\,,\]

and again define the Lebesgue measure of arbitrary Borel subsets as the infimum of the measures of covers by such hyperrectangles. This procedure can be generalized to arbitrary ($\sigma$-finite) measure spaces to obtain a (tensor) product of measure spaces. The formulas above have the interpretation of the length of an interval (or volumne of a box). But what if instead of simply taking the values of the endpoints, we first apply a function $F:\mathbb{R}\rightarrow\mathbb{R}$? In this case, the expression becomes:

\[\lambda_F\bigl(]a,b]\bigr) := F(b)-F(a)\,.\]

To make this formula well-defined, we need $F$ to be right-continuous, i.e.

\[\lim_{\varepsilon\rightarrow0^+}F(x+\varepsilon) = F(x)\]

for all $x\in\mathbb{R}$. Moreover, to obtain a nonnegative and monotonic measure, we also need to require that $F$ is itself nonnegative and monotonic. But wait a minute! Do these equations not remind us of a well-known object from probability theory? Indeed, all cumulative distribution functions (CDFs) are exactly of this form. In the context of measure theory, these are called Lebesgue–Stieltjes measures.

Note
Functions that are right-continuous and that admit all left limits are also said to be càdlàg (abbreviation of the French expression "continue à droite, limite à gauche"). It can be shown that càdlàg functions $F:\mathbb{R}\rightarrow[0,1]$ satisfying $$\lim_{x\rightarrow-\infty}F(x)=0\qquad\text{and}\qquad\lim_{x\rightarrow\infty}F(x)=1$$ are equivalent to cumulative distribution functions on $\mathbb{R}$.

The fact that CDFs only need to be right-continuous, allows them to have jump discontinuities, i.e. isolated points where the value of the function suddenly changes in a discontinuous way. The measure of a singleton can be shown to be determined by exactly that jump:

\[\lambda_F(\\{x_0\\}) = F(x_0) - \lim_{\varepsilon\rightarrow0^+}F(x_0-\varepsilon)\,.\]

Singletons with nonzero measure are examples of atoms, sets with nonzero measure for which every proper subset has vanishing measure. (By classical arguments from calculus, it can be shown that all atoms of a Lebesgue–Stieltjes measure are necessarily singletons.) The Lebesgue measure does not have any atoms since it is induced by the identity function. More generally, for continuous CDFs, all singletons are null sets, i.e. sets with measure zero. Null sets are in some sense the subsets that we can forget about when talking about a property in probabilistic terms. If a property holds everywhere, except for some null set, it is said to hold almost everywhere (a.e.) in measure theory or almost surely (a.s.) in probability theory.

Now that the notion of an atom has been introduced, it is time to move to another class of examples, namely the discrete probability measures. Discrete probability measures are atomic in the sense that any measurable set of nonzero measure contains an atom. More specifically, discrete measures are of the form

\[\mu = \sum_{i=1}^{+\infty}\lambda_i\delta_{x_i}\,,\]

where all $\lambda_i\in\mathbb{R}^+$ and

\[\delta_{x_i}(A) := \mathbb{1}_A(x_i) = \begin{cases} 1&\text{if } x_i\in A,\\ 0&\text{if } x_i\not\in A \end{cases}\]

are so-called Dirac measures. Common examples of such measures are the Poisson and binomial distributions. By the normalization property of probability measures, every distribution on the (discrete) finite space $[k]$ can be represented as an element of the $(k-1)$-simplex

\[\Delta^{k-1} := \left\{x\in\mathbb{R}^k\,\middle\vert\,\sum_{i=1}^kx^i=1\right\}\]

in the following way:

\[\mu = \sum_{i=1}^kx^i\mathbb{1}_i\,.\]

As promised, there is one last important notion to be introduced, namely the functions between measurable spaces. To this end, consider two measurable spaces $(\mathcal{X},\Sigma_{\mathcal{X}})$ and $(\mathcal{Y},\Sigma_{\mathcal{Y}})$. Given a function $f:\mathcal{X}\rightarrow\mathcal{Y}$, it might be tempting to induce a $\sigma$-algebra on $\mathcal{Y}$ generated by the images $\{f(A)\mid A\in\Sigma_{\mathcal{X}}\}$. However, a set-theoretic problem arises at this point. To make this construction work, functions should preserve the operations defining a $\sigma$-algebra and, although empty sets and unions are preserved, complements are not: $f(A\backslash B)\neq f(A)\backslash f(B)$. This has as a consequence that we cannot ‘pull back’ measures from $\mathcal{Y}$ to $\mathcal{X}$ by the would-be definition

\[f^*\mu(A) := \mu\bigl(f(A)\bigr)\,.\]

Luckily, there is another possibility. The image might not preserve all required operations, but the preimage does, i.e. $f^*\Sigma_{\mathcal{Y}}:=\{f^{-1}(A)\mid A\in\Sigma_{\mathcal{Y}}\}$ is a $\sigma$-algebra, where the preimage is defined as follows:

\[f^{-1}(A) := \{x\in\mathcal{X}\mid f(x)\in A\}\,.\]

So, instead of pushing forward measurable sets and pulling back measures, we should work the other way around. This leads to the following definition.

Consider a function $$f:(\mathcal{X},\Sigma_{\mathcal{X}})\rightarrow(\mathcal{Y},\Sigma_{\mathcal{Y}})$$ between measurable spaces. This function is itself said to be measurable if and only if the pullback $\sigma$-algebra $f^*\Sigma_{\mathcal{Y}}$ is a sub-$\sigma$-algebra of $\Sigma_{\mathcal{X}}$, i.e. if $$f^{-1}(A)\in\Sigma_{\mathcal{X}}\qquad\text{for all}\qquad A\in\Sigma_{\mathcal{Y}}\,.$$

Equipped with the measurable functions, we can now also transport measures between spaces.

The pushforward of a measure $\mu$ on $\mathcal{X}$ along a measurable function $f:\mathcal{X}\rightarrow\mathcal{Y}$ is defined by $$f_\ast\mu(A) := \mu\bigl(f^{-1}(A)\bigr)\,.$$ This is well defined, since by definition of measurability, $f^{-1}(A)$ is measurable in $\mathcal{X}$.

For completeness’ sake, it is also worth mentioning that what is called a random variable in probability theory, is simply a measurable function from a probability space into an arbitrary measurable space (e.g. $\mathbb{R}$ in the case of univariate regression). The distribution of a random variable $X:(\Omega,\Sigma,P)\rightarrow(\mathcal{X},\Sigma)$ is then defined as the pushforward of $P$ along this random variable:

\[P_X := X_\ast P\,.\]

One of the most powerful benefits of measure theory is that it allows for a solid theory of integration that can even be applied to situations where the ordinary Riemann integral breaks down. We will not delve too deep into this subject, but some core ideas and notions are of importance to us. The idea behind Lebesgue integration is to approximate functions by so-called simple functions (this is the approach introduced by Daniell).

A function $f:\mathcal{X}\rightarrow\mathbb{R}^+$ of the form $$f(x) = \sum_{i=1}^na_i\mathbb{1}_{A_i}(x)$$ for positive numbers $a_1,\ldots,a_n\in\mathbb{R}^+$ and disjoint (measurable) subsets $A_1,\ldots,A_n\in\Sigma_{\mathcal{X}}$. The Lebesgue integral of the simple function $f$ with respect to a measure $\mu$ on $(\mathcal{X},\Sigma_{\mathcal{X}})$ is defined as $$\int_{\mathcal{X}}f\, d\mu := \sum_{i=1}^na_i\mu(A_i)\,.$$

To define the integral for general measurable functions, we should show that any nonnegative measurable function can be approximated (pointwisely) by a sequence of simple functions and define the integral as the supremum of the integrals over all simple functions bounded from above by it.

A measurable function $f:\mathcal{X}\rightarrow\mathbb{R}$ is said to be (Lebesgue-)integrable with respect to a measure $\mu$ on $\mathcal{X}$ if both $$\int_{\mathcal{X}}f^+\,d\mu<+\infty$$ and $$\int_{\mathcal{X}}f^-\,d\mu<+\infty$$ hold, where $f^+:=\max(f,0)$ and $f^-:=\max(0,-f)$. If $f$ is integrable, its (Lebesgue-)integral is defined as $$\int_{\mathcal{X}}f\,d\mu := \int_{\mathcal{X}}f^+\,d\mu - \int_{\mathcal{X}}f^-\,d\mu\,.$$ If only one of the two conditions holds, $f$ is said to be quasiintegrable.
Note
A crucial difference exists with the Riemannian case, where these conditions would imply absolute integrability, i.e. $$\int_{\mathcal{X}}|f|\,d\mu<+\infty\,.$$ With Lebesgue integrals, positive and negative infinity cannot cancel each other out (even in the limiting sense). Measurable functions are integrable if and only if they are absolutely integrable. However, on a bounded interval, every Riemann-integrable function is also (Lebesgue-)integrable and the integrals coincide. Moreover, if the improper Riemann integral of a nonnegative function exists, it equals the Lebesgue integral of that function.5

We give a simple example of a situation where Riemann integration does not suffice. A theorem by Lebesgue says that a bounded function is Riemann-integrable exactly when its set of discontinuities is a null set, i.e. when it is almost everywhere continuous. The reason for this is that the Lebesgue measure is nonatomic: $\lambda(\{x\})=0$ for all $x\in\mathbb{R}$. We can change the value of a function at any countable collection of points without changing the value of its integral. However, many measures, such as the discrete probability distributions, are atomic, the simplest one being the Dirac measure as introduced above. The integral with respect to this measure is particularly straightforward to calculate:

\[\int_{\mathbb{R}}f\,d\delta_x = f(x)\,.\]

This is one of the formulas that a physics student learns to accept without proper formal reasoning and where physicists have found all kinds of informal arguments and approximation methods because most refuse to accept any other integral besides the Riemann integral.


On many occasions, especially in probability theory, measures are given by summing a sequence or integrating a function. For example, the standard normal distribution admits such a \textbf{probability density function} (PDF):

\[\Phi(x) = \int_{-\infty}^x\frac{1}{\sqrt{2\pi}}\exp\left(-t^2/2\right)\,d t\,.\]

This situation is an example of a more general concept.

A measure $\nu$ on a measurable space $(\mathcal{X},\Sigma)$ is said to be absolutely continuous with respect to a measure $\mu$ on $(\mathcal{X},\Sigma)$ if $$\mu(A)=0\implies\nu(A)=0$$ for all $A\in\Sigma$. This is often denoted by $\mu\gg\nu$.

The following very important result, the Radon–Nikodym theorem, states that absolutely continuous measures admit density functions. If $\mu$ and $\nu$ are ($\sigma$-)finite measures on a measurable space $(\mathcal{X},\Sigma)$ such that $\nu$ is absolutely continuous with respect to $\mu$, there exists a $\mu$-a.e. unique measurable function $f:\mathcal{X}\rightarrow[0,+\infty[$ such that

\[\nu(A) = \int_Af\,d\mu\]

for all $A\in\Sigma$. The function $f$ is called the Radon–Nikodym derivative and is sometimes denoted by $\frac{d\nu}{d\mu}$ in analogy to the ordinary derivative from calculus. This generalized notion of density function also allows us to treat probability mass functions (PMFs) on equal footing with their continuous counterparts. PMFs are simply the Radon–Nikodym derivatives of discrete probability measures with respect to the counting measure (on, for example, $\mathbb{Z}$), while PDFs are the derivatives with respect to the Lebesgue measure.

Now, let $f:(\mathcal{X},\Sigma_{\mathcal{X}})\rightarrow(\mathcal{Y},\Sigma_{\mathcal{Y}})$ be a measurable function and consider a measure $\mu$ on $\mathcal{X}$. The following equality holds for all integrable functions $g:\mathcal{Y}\rightarrow\mathbb{R}$:

\[\int_{f^{-1}(\mathcal{Y})}(g\circ f)\,d\mu=\int_{\mathcal{Y}}g\,d(f_\ast\mu)\,.\]

This change-of-variables formula for absolutely continuous measures on $\mathbb{R}$ with Radon–Nikodym derivative $f_X:\mathbb{R}\rightarrow\mathbb{R}$ implies

\[f_{g_*X}(y) = f_X\left(g^{-1}(y)\right)\left|\frac{dg^{-1}}{dy}(y)\right|=\frac{f_X\left(g^{-1}(y)\right)}{\left|g'\bigl(g^{-1}(y)\bigr)\right|}\]

when $g:\mathbb{R}\rightarrow\mathbb{R}$ is invertible and

\[f_{g_*X}(y) = \sum_{x\in g^{-1}(y)}\frac{f_X(x)}{\left|g'(x)\right|}\]

in general. Recall for example the Dirac measure. Under a pushforward along the function $f:\mathbb{R}\rightarrow\mathbb{R}$, it transforms as

\[\delta\bigl(f(x)\bigr) = \sum_{y\in f^{-1}(0)}\frac{\delta(x-y)}{\left|f'(y)\right|}\,.\]

Another consequence of the change-of-variables formula is the infamous law of the unconscious statistician. Let $g:\mathcal{X}\rightarrow\mathbb{R}$ be an integrable function and $X$ a random variable on $\mathcal{X}$.

\[\mathrm{E}\left[g(X)\right] = \int_{\mathcal{X}}g\,dP_X\]

Without going into too much detail, some things have to be said about the notion of conditional probabilities.

Let $(\Omega,\Sigma,P)$ be a probability space. The conditional probability with respect to an event $A\in\Sigma$ is defined as follows: $$P(B\mid A) := \frac{P(A\cap B)}{P(A)}\,.$$ Note that this formula is only well defined when $A$ is not $P$-null. A (partial) solution exists when it is possible to find sequences $(A_n)_{n\in\mathbb{N}}$ of measurable sets of strictly positive probability converging to $A$. In this case, we could define $$P(B\mid A) := \lim_{n\rightarrow\infty}P(B\mid A_n)\,.$$ However, it can be shown that this does not lead to a well-defined probability distribution, since the resulting value will depend on the choice of sequence (cf. the Borel–Kolmogorov paradox).

Whenever conditional probabilities exist, this definition immediately implies one of the most famous theorems in probability theory, Bayes’ theorem:

\[P(A\mid B)P(B) = P(B\mid A)P(A)\]

One way to express conditional probabilities, which closely follows our intuition, is by modelling them as parametrized probability distributions.

Consider two measurable spaces $(\mathcal{X},\Sigma_\mathcal{X})$ and $(\mathcal{Y},\Sigma_\mathcal{Y})$. A Markov kernel $\mathcal{X}\rightarrow\mathcal{Y}$ is a function $f:\Sigma_\mathcal{Y}\times\mathcal{X}\rightarrow[0,1]$ such that:
  1. For every $A\in\Sigma_\mathcal{Y}$: $x\mapsto f(A\mid x)$ is measurable.
  2. For every $x\in\mathcal{X}$: $A\mapsto f(A\mid x)$ is a probability measure.
More concisely, a Markov kernel is a measurable function $\mathcal{X}\rightarrow\mathbb{P}(\mathcal{Y})$. If the second condition is relaxed to only requiring $f(\cdot\mid x)$ to be a measure, the notion of a transition kernel is obtained.

Let $(X_n)_{n\in\mathbb{N}}$ be a sequence of i.i.d. random variables with expectation $\mu$.6

\[\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{i=0}^nX_i=\mu\qquad\text{a.s.}\]

Consider a sequence of i.i.d. random variables $(X_n)_{n\in\mathbb{N}}$ with distribution $P\in\mathbb{P}(\mathcal{X})$ and, for any $n\in\mathbb{N}$ and $x\in\mathcal{X}$, consider the random variable $\mathbb{1}_{]-\infty,x]}(X_n)$. Since

\[\mathrm{E}\left[\mathbb{1}_{]-\infty,x]}(X_n)\right] = P(X_n\leq x)\,,\]

the law of large numbers implies that

\[\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{i=0}^n\mathbb{1}_{]-\infty,x]}(X_i)=P(X\leq x)\qquad\text{a.s.}\]

The average on the left-hand side is the empirical distribution function $\widehat{F}$. Hence, by the law of large numbers, the empirical distribution function converges to the true distribution function almost surely. In a similar way, for every $P$-integrable function $f:\mathcal{X}\rightarrow\mathcal{Y}$, the average over a large number of i.i.d. draws converges to the expectation value almost surely:

\[\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{i=0}^nf(X_i)=\mathrm{E}_P[f]\qquad\text{a.s.}\]

This is the idea behind Monte Carlo methods!


  • Capiński, M., & Kopp, P. E. (2004). Measure, Integral and Probability (Vol. 14). Springer.





  1. The existence of nonmeasurable sets crucially depends on the axiom of choice, one of the most important, yet controversial axioms of set theory. As a consequence, they do not exist in constructive mathematics

  2. This is how closed sets are defined in general topological spaces. 

  3. This is the smallest $\sigma$-algebra for which the projections $\pi_i:\mathcal{X}^n\rightarrow\mathcal{X}$ are measurable (see further on). 

  4. This is the largest $\sigma$-algebra for which the inclusions $\iota_n:\mathcal{X}^n\hookrightarrow\mathcal{X}^*$ are measurable. 

  5. The nonnegativity condition is crucial in this statement. There exist improper Riemann integrals for which there exists no Lebesgue counterpart. 

  6. This is actually the strong law of large numbers, also known as Kolmogorov’s law. The weak law (or Kinchin’s law) states that this convergence holds in distribution.