C Probability

C.1 Measure theory

Definition C.1 (Sigma algebra) Let \(\Omega\) be a set, and \(\Sigma\) be a subset of the power set of \(\Omega\) (or equivalently a collection of subsets of \(X\)). Then \(\Sigma\) is a \(\sigma\)-algebra if:

  • \(\emptyset\in \Sigma\),
  • \(\Sigma\) is closed under countable union,
  • \(\forall S \in \Sigma, \overline{S} \in \Sigma\).

Observe that thanks to de Morgan’s law, we can equivalently define the sigma algebra to be closed under countable intersection. Oftentimes, it’s common to conflate \(\Sigma\) and \((\Omega, \Sigma)\), and call both \(\sigma\)-algebra.

Definition C.2 (Measurable space) Let \(\Omega\) be a set, and \(\Sigma\) a \(\sigma\)-algebra. The tuple \((\Omega, \Sigma)\) is a measurable space (or Borel space).

Definition C.3 (Measurable function) Let \((\Omega, \Sigma)\), and \((Y, T)\) two different measurable space. A function \(f : \Omega \mapsto Y\) is said to be measurable if for every \(E \in T\): \[f^{-1}(E):=\{x \in \Omega | f(x) \in E \} \in \Sigma\]

A measurable function is a function between the underlying sets of two measurable spaces that preserves the structure of the spaces: the preimage of any measurable set is measurable. This is in direct analogy to the definition that a continuous function between topological spaces preserves the topological structure: the preimage of any open set is open.

Definition C.4 (Continious function) Let \((X, \mathbb{X}), (Y, \mathbb{Y})\) two topological spaces. Let \(f\) be a function between two topological spaces \(f : X \mapsto Y\) is said to be continious if the inverse image of every open subset of \(Y\) is open in \(X\). In other words, if \(V \in \mathbb{Y}\), then its inverse image \(f^{-1}(V) \in \mathbb{X}\)

Definition C.5 (Measure space) The tuple \((\Omega, \Sigma, \mathbb{P})\) is a measure space if:

  • \((\Omega, \Sigma)\) is a measurable space.
  • \(\mu(E)\) is a measure on \((\Omega, \Sigma)\):
    • \(\mu : \Sigma \mapsto \mathbb{R}+\{-\infty, +\infty\}\)
    • non-negativity: \(\mu(E) \geq 0 \forall E \in \Sigma\)
    • Null empty set \(\mu(\emptyset )= 0\)
    • Coutable additivity (or \(\sigma\)-additivity): for all countable collections \(\{E_k \}_{k=1}^\infty\) of pariwise disjoint sets in \(\Sigma\), \[\mu \left(\cup_{k=1}^\infty E_k\right) = \sum_{k=1}^\infty \mu(E_k)\]

Definition C.6 (Probability space) The tuple \((\Omega, \Sigma, \mathbb{P})\) is a probability space if:

  • \((\Omega, \Sigma)\) is a \(\sigma\)-algebra (\(\Omega\) is the set of outcomes of the experiment, and \(\Sigma\) is the set of events)
  • \(\mathbb{P}\) is a measurable function:
    • \(\mathbb{P} : \Sigma \mapsto [0,1]\)
    • Null empty set.
    • \(\mu\) is countably additive.
  • \(\mathbb{P}(\Omega)=1\)

I.e. a probability space is a measure space where the measurable function on \(\Omega\) is \(1\).

Definition C.7 (Complete probability space) For \(B \subset \Sigma\) s.t. \(\mathbb{P}(B)=0\), a \((\Omega, \Sigma, \mathbb{P})\) probability space is complete if \(\forall A \subset B\), \(A \in \Sigma\).

Definition C.8 (Equivalence between probability measures) Let \((\Omega, \Sigma, \mathbb{P}), (\Omega, \Sigma, \mathbb{Q})\) two probability space with the same \(\Omega\) and \(\Sigma\). We say that \(\mathbb{P}\) and \(\mathbb{Q}\) are equivalent iif for every \(A \in \Sigma\), \(\mathbb{P}(A)=0 \Leftrightarrow \mathbb{Q}(A)=0\).

It basically means that the two measures agree on the possible and impossible events (even if it is pretty strange to call them equivalent).

Definition C.9 (Random variable) A (real-valued) random variable on a probability space \((\Omega, \Sigma, \mathbb{P})\) is a measurable function \(X: \Omega \mapsto \mathbb{R}\).

Remember that, for a list of numbers \(x_1, x_n\),

  • The mode can be defined as \(\arg\min_x \sum_i |x_i - x|^0\)
  • The median can be defined as \(\arg\min_x \sum_i |x_i - x|^1\)
  • The mean can be defined as \(\arg\min_x \sum_i |x_i - x|^2\).

C.1.0.1 Union bound

The union bound is used to show that the probability of union (i.e. at least one of them happens) of a finite or countable set of events is less than or equal to the sum of the probabilities of the events.

Theorem C.1 (Union bound) \(\forall\) events \(A_1 \dots A_n \in \Sigma\): \[P(\cup_{i=1}^n A_i) \leq \sum_{i=1}^n P(A_i)\]

Exercise C.1 In Erdős–Rényi graphs \(G(n,p)\), (that is, a graph with \(n\) nodes with probability \(p\) that each of the two nodes are connected). We define the event \(B_n\) as the event where a graph \(G(n,p)\) has at least one isolated node. Show that \(P(B_n) \leq n(1-p)^{n-1}\).

Proof. Let \(A_i, i\in[n]\) the event that node \(i\) is isoldated. Its probability, from the definition of \(G(n,p)\) is \((1-p)^{n-1}\), because there might be an edge with probability \(p\) with other \(n-1\) nods. From this, applying directly the union bonund we obtain an upper bound on the probability that there is at least one isoldated node is in the graph: \[P(B_n) = P(\cup_{i=1}^n A_i) \leq \sum_i P(A_i) \leq nP(A_i) = n(1-p)^{n-1}\]

Exercise C.2 Suppose we run 4 times a randomized algorithm, with success probability \(1-\delta\). Can you bound the probability that we never fail using the union bound?

Proof. Let \(f_i\) the event that we fail running our algorithm at time \(i\). We know that the failure probability \(f_i\) is \(\delta\) for all \(i \in [4]\). Thanks to the union bound we can bound the probability that we fail at least once: \(P(\cup_i^k f_i ) \leq \sum_i^4 \delta = 4\delta\). It follows that the have 4 success in a row is lower bounded by \(1-4\delta\).

Note that we could have also bypassed the union bound and compute this quantity analitically, as the probability of getting 4 success in a row would be \((1-\delta)^4\), which we can compute with the binomial theorem A.2.

Definition C.10 (Variance) \[\begin{align} \operatorname{Var}(X) &= \operatorname{E}\left[(X - \operatorname{E}[X])^2\right] \\[4pt] &= \operatorname{E}\left[X^2 - 2X\operatorname{E}[X] + \operatorname{E}[X]^2\right] \\[4pt] &= \operatorname{E}\left[X^2\right] - 2\operatorname{E}[X]\operatorname{E}[X] + \operatorname{E}[X]^2 \\[4pt] &= \operatorname{E}\left[X^2 \right] - \operatorname{E}[X]^2 \end{align}\]

Exercise C.3 How can we express the variance as expectation of quantum states? What quantum algorithm might we run to estimate the variance of a random variable \(M\)? \[\langle\psi|M^2|\psi\rangle - (\langle\psi|M|\psi\rangle)^2 \] Discuss.

Definition C.11 (Exponential Family [@murphy2012machine]) A probability density function or probability mass function \(p(v|\nu)\) for \(v = (v_1, \cdots, v_m) \in \mathcal{V}^m\), where \(\mathcal{V} \subseteq \mathbb{R}\), is a \(\sigma\)-algebra over a set \(X\) \(\nu \in \mathbb{R}^p\) is said to be in the exponential family if it can be written as: \[p(v|\nu) := h(v)\exp \{ o(\nu)^TT(v) - A(\nu) \}\] where:

  • \(\nu \in \mathbb{R}^p\) is called the parameter of the family,
  • \(o(\nu)\) is a function of \(\nu\) (which often is just the identity function),
  • \(T(v)\) is the vector of sufficient statistics: a function that holds all the information the data \(v\) holds with respect to the unknown parameters,
  • \(A(\nu)\) is the cumulant generating function, or log-partition function, which acts as a normalization factor,
  • \(h(v) > 0\) is the which is a non-informative prior and de-facto is scaling constant.

C.1.0.2 Bias-variance tradeoff

Here is a nice reference to understand the bias-variance tradeoff

C.1.1 Boosting probabilities with “median lemma” (or powering lemma )

In this section we discuss the following, widely known result in CS. It’s used not only in writing algorithms, but also in complexity theory.

Lemma C.1 (Powering lemma [@jerrum1986random]) Let \(\mathcal{A}\) be a quantum or classical algorithm which aims to estimate some quantity \(\mu\), and whose output \(\widetilde{\mu}\) satisfies \(|\mu-\widetilde{\mu} |\leq \epsilon\) except with probability \(\gamma\), for some fixed \(\gamma \leq 1/2\). Then, for any \(\gamma > 0\) it suffices to repeat \(\mathcal{A}\) \(O(\log 1/\delta)\) times and take the median to obtain an estimate which is accurate to within \(\epsilon\) with probability at least \(1-\delta\).

C.2 Markov chains

Useful resources: here, here.

Definition C.12 (Markov chain (Serfozo 2009)) Let \((X_t)_{t \in I}\) be a stochastic process defined over a probability space \((\Omega, \Sigma, \mathbb{P})\), for a countable set \(I\), where \(X_t\) are random variables on a set \(\mathcal{S}\) (called state space). Then \((X_t)_{t \in I}\) is a Markov chain if, for any \(j \in \mathcal{S}\) and \(t \geq 0\), it holds that \[\mathbb{P}[X_{t+1} = j | X_0, X_1, \dots X_t] = \mathbb{P}[X_{t+1} =j | X_t]\] and for all \(j,i \in \mathcal{S}\), it holds that \[\mathbb{P}[X_{t+1} = j | X_t = i] = p_{ij}\], where \(p_{ij}\) is the transition probability for the Markov chain to go from state \(i\) to state \(j\).

Less formally, a Markov chain is a stochastic process with the Markov property, for which we can just use a matrix \(P\) to identify its transition probability. Most of the time, we will discretize the state space \(\mathcal{S}\), so we can label elements of \(\mathcal{S}\) with integers \(i \in [|\mathcal{S}|]\). This fact will allow us to conflate the (push-forward) measure \(\mathcal{P}\) on \(\mathcal{S}\) and the matrix \(P\).

A state \(j\) is said to be accessible from \(i\) (written as \(i \mapsto j\)) if \(P_{ij}^t > 0\) for some \(t\), where \(P^t\) is the \(t\)-th power of the transition matrix \(P\). A communication class is an equivalence releation between states (relatively simple to prove) where two states \(j,i\) are said to communicate if they are mutually accessible.

Definition C.13 (Irreducible markov chain) A Markov Chain \((X_t)_{t \in I}\) is irreducible if and only if

  • there exist some integer \(t \in I\) such that \(p^t_{ij} > 0\) for all \(i,j \in \mathcal{S}\) there exist some integer \(t \in I\) such that \(P[X_t =j| X_0 = i] > 0\), for all \(i,j \in \mathcal{S}\)
  • there is only one communication class.

The previous conditions are equivalent.

In terms of random walks, irreducibility means that: if the graph is undirected, the graph is has only one connected component (i.e. is connected), and if the graph is directed, the graph is strongly connected.

C.3 Distributions

This is a beautiful guide that shows you how to draw samples from a probability distribution.

C.4 Concentration inequalities

Take a look at this and this. Also recall that the Union bound was presented in the section dedicated for probability theory, i.e. Theorem C.1.

C.4.1 Markov inequality

The Markov inequality is an upper bound for the probability that a non-negative function of a random variable, that is greater than or equal to a positive constant. Especially in analysis, people refer to it as Chebyshev’s inequality (sometimes, calling it the first Chebyshev inequality, while referring to the “usual” Chebyshev’s inequality as the second Chebyshev inequality or Bienaymé–Chebyshev inequality).

Theorem C.2 (Markov inequality) For all non-negative random variable, and \(a > 0\), we have that:

  • \(Pr(X \geq a) \leq \frac{E[X]}{a}\)
  • \(Pr(X \geq aE[X]) \leq \frac{1}{a}\)

Proof. Observe that : \[E[X] = P(X < a) \cdot E[X|X<a] + P(X > a) \cdot E[X|X>a]\] As both of these expected values are bigger than zero, (using the nonnegativity hypothesis) we have that \[E[X] \geq P(X > a) \dot E[X|X>a] \]

Now is easy to observe that \(E[X|X>a]\) is at least \(a\), and by rearranging we obtain that: \[ \frac{E[X]}{a} \geq P(X > a) \]

The second statement of the theorem follows from substitution, i.e. setting \(b=aE[X]\) and using the previous statement on \(Pr(X \geq b)\).

A very useful corollary of Markov inequality is the following.

Theorem C.3 (Corollary of Markov inequality) Let \(f\) be a monotone increasing (or noll) function on a space \(I\), and define the random variable on \(Y\). \[ P(Y \geq b) \leq \frac{E[f(Y)]}{f(b)} \]

C.4.2 Chebyshev inequality

This inequality tells us about the probability of finding the random variable \(X\) away from the mean \(\mathbb{E}[X]\) is bounded by the variance of \(X\).

Theorem C.4 (Chebyshev inequality) Let \(X\) be a random variable with finite mean \(\mu\) and variance \(\sigma\). For \(\epsilon > 0\):

\[Pr[|X - \mathbb{E}[X]| \geq \epsilon]\leq \frac{\sigma^2}{\epsilon^2}\]

Moreover, if \(k = \epsilon /\sigma\) we can replace \(\epsilon\) with \(k\sigma\) and obtain:

\[Pr[|X - \mathbb{E}[X]| \geq k\sigma] \leq \frac{1}{k^2}\]

Proof. Observe that \((X-\mu)^2\) is a non-negative random variable. Therefore, \[P(|X-\mu| \geq \epsilon) = P((X-\mu)^2 \geq \epsilon^2)\] Now since \((X-\mu)^2\) is a non-negative random variable, we can apply Markov inequality to get : \[P(|X-\mu| \geq \epsilon) = P((X-\mu)^2 \geq \epsilon^2) \leq \frac{E[(X-\mu)^2]}{\epsilon^2}\] \[(|X-\mu| \geq \epsilon) = P((X-\mu)^2 \geq \epsilon^2) \leq \frac{[Var(X)]}{\epsilon^2}\]

It is very useful to see what happen when we define a new random variable \(Y\) as the sample mean of \(X_1 \dots X_n\) other random variables (iid) indipendent and identically distributed: \(Y= \frac{1}{n}\sum_i^n X_i\). The expected value of \(Y\) is the same as the expected value of \(X\), but the variance is now linearly smaller in the number of samples:

\[E[Y] = \frac{1}{n} \sum_i^n E[X_i] = \mathbb{E}[X_i] \text{ for any } i\] \[Var[Y] = \frac{1}{n^2} \sum_i^n \text{Var}[X_i] = \frac{Var[X_i]}{n} \text{ for any } i\]

This allows us to obtain the following bound:

Theorem C.5 (Chebyshev inequality for sample mean) Let \(Y= \frac{1}{n}\sum_i^n X_i\). Then,

\[Pr[|Y - E[Y]| \geq \epsilon]\leq \frac{\sigma^2}{n\epsilon^2}\]

C.4.3 Weak Law of large numbers

Theorem C.6 (Weak Law of large numbers) Let \(X_1, X_2, \dots, X_n\) be i.i.d random variables with a finite expected value \(\mathbb{E}[X_i]=\mu \leq \infty\), and let \(\overline{X}\) be the average \(\frac{1}{n}\sum_i^n X_i\). Then, for any \(\epsilon > 0\), we have that:

\[\lim_{n\to +\infty} P\left( |\overline{X} - \mu | \geq \epsilon \right) = 0\]

Proof. We know that \(E[\overline{X}] = \mu\) and \(Var(\overline{X}) = \frac{\sigma^2}{n}\). By Chebyshev Inequality for the sample mean (Theorem C.5):

\[P(|\overline{X}-\mu| > \epsilon) \leq \frac{Var(\overline{X})}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2}\]

Trivially, \(\lim_{n \to \infty} \frac{\sigma^2}{n\epsilon^2} = 0\), concluding the proof.

C.4.4 Strong Law of Large Numbers

Theorem C.7 ((Strong) Law of large numbers) Let \(X_1,X_2,X_3,\dots,X_n\) be i.i.d random variables with mean \(\mu\). Let \(\overline{X} = \sum_{i=1}^{n} X_i\) be the sample mean. Then, \(\overline{X}\) converges almost surely to \(\mu\): \[P(\lim_{x \to -\infty}\overline{X}_n = \mu) = 1\]

Remark. The SLLN implies WLLN but not vice-versa.

C.4.5 Chernoff bound

From here and here, here, and here

We focus on a restricted class of random variables, i.e. the case when our random variable is obtained as the sum of indipendent other random variables. Central limit theorem says that, as \(n \to \infty\), the value \(\frac{X-\mu}{\sigma}\) approaches the standard normal distribution \(N(0,1)\). Hoewver, it does not tell any information on the rate of convergence.

Theorem C.8 (Chernoff bound) Let \(X=\sum_i^n X_i\) where \(X_i =1\) with probability \(p_i\) and \(X_i=0\) with probability \((1-p_i)\), and all \(X_i\) are independent. Let \(\mu=E[X] = \sum_i^n p_i\). Then:

  • Upper tail: \(P(X \geq (1+\delta)\mu) \leq e^-\frac{\delta^2}{2+\delta}\mu\) for all \(\delta > 0\)
  • Lower tail: \(P(X \leq (1-\delta)\mu) \leq e^{\mu\delta^2/2}\) for all \(0 \leq \delta \leq 1\)

You can find a nice proof here.

Theorem C.9 (Chernoff bound) Suppose \(X_1, \dots, X_t\) are independent random variables taking values in \(\{0,1\}\). Let \(M_t= (X_1 + \dots X_t)/t\) denote their average value. Then for any \(0 < \epsilon < 1\),

  • (Multiplicative) \(Pr[M_t - \mu \leq -\epsilon \mu] \leq \exp^{-\frac{t\mu\epsilon^2}{2}}\) and \(Pr[M_t - \mu \geq \epsilon \mu] \leq \exp^{-\frac{t\mu\epsilon^2}{3}}\)
  • (Additive) \(Pr[M_t - \mu \leq -\epsilon ] \leq exp^{-2t\epsilon^2}\) and \(Pr[M_t - \mu \geq \epsilon ] \leq \exp^{-2t\epsilon^2}\)

Remark. Trick: if our random variables are not between \(0\) and \(1\), we can define \(Y_i=X_i/max(X_i)\)

C.4.6 Hoeffding inequality

Lemma C.2 (Hoeffding inequality) Let \(X_1,\ldots,X_k\) be independent random variables bounded by the interval \([a, b]\). Define the empirical mean of these variables by \(\overline{X} = \frac{1}{k} (X_1+\cdots+X_k)\), then \[Pr(|\overline{X} - \mathbb{E}[X]|\leq \epsilon) \geq 1-2 \exp\left(-\frac{2k\epsilon^2}{b-a} \right).\] Consequently, if \(k\geq (b-a)\epsilon^{-2}\log(2/\eta)\), then \(\overline{X}\) provides an \(\epsilon\)-approximation of \(\mathbb{E}[X]\) with probability at least \(1-\eta\).

Exercise C.4 Suppose the number of red lights Alex encounters each day to work is on average \(4.8\) (according to historical trips to work). Alex really will be late if he encounters \(8\) or more red lights. Let \(X\) be the number of lights he gets on a given day.

  1. Give a bound for \(P(X \geq 8)\) using Markov’s inequality.
  2. Give a bound for \(P(X \geq 8)\) using Chebyshev’s inequality, if we also assume \(Var(X) = 2.88\).
  3. Give a bound for \(P (X \geq 8)\) using the Chernoff bound. Assume that \(X∼Bin(12, 0.4)\) - that there are 12 traffic lights, and each is independently red with probability \(0.4.\)
  4. Compute \(P(X \geq 8)\) exactly using the assumption from the previous part.
  5. Compare the three bounds and their assumptions.

Proof. We apply all the inequalities learned in this section and discuss them at the end.

  1. Since \(X\) is nonnegative and we know its expectation, we can apply Markov’s inequality: \[P(X \geq 8) \leq \frac{\mathbb{E}[X]}{8} = \frac{4.8}{8} = 0.6\]

  2. Since we know \(X’s\) variance, we can apply Chebyshevs inequality after some manipulation. We have to do this to match the form required: \[P(X \geq 8) \leq P(X \geq 8) + P(X \leq 1.6) = P(|X-4.8| \geq 3.2)\] The reason we chose \(X \leq 1.6\) is so it looks like \(P (|X − µ| \geq \alpha)\). Now, applying Chebyshev’s gives: \[\leq \frac{Var(X)}{3.2^2} = \frac{2.88}{3.2^2} = 0.28125\]

  3. Actually, \(X \sim Bin(12, 0.4)\) also has \(E[X] = np = 4.8\) and \(Var(X) = np(1 − p) = 2.88\) (what a coincidence). The Chernoff bound requires something of the form \(P(X ≥ (1 + \delta)\mu)\), so we first need to solve for \(\delta: (1 + \delta)4.8 = 8\) so that \(\delta = 2/3\). Now, \[P(X \geq 8) = P(X \geq (1+\frac{2}{3}).4.8) \leq exp(\frac{-\frac{2}{3}^2 4.8}{3}) \approx 0.4991\]

  4. The exact probabiltity can be found summing the Binomial probability mass function: \[P(X \geq 8) = \sum_{k=8}^12 {12\choose k} 0.4^k0.6^{(12-k)} \approx 0.0573\]

  5. Usually the bounds are tighter as we move down the list Markov, Chebyshev, Chernoff. But in this case Chebyshev’s gave us the tightest bound, even after being weakened by including some additional \(P(X \leq 1.6)\). Chernoff bounds will typically be better for farther tails - 8 isn’t considered too far from the mean \(4.8\). It’s also important to note that we found out more information progressively - we can’t blindly apply all these inequalities every time. We need to make sure the conditions are satisfied.

Remarkably, note that even our best bound of \(0.28125\) was \(5-6\) times larger than the true probability of \(0.0573\).

G References

Serfozo, Richard. 2009. Basics of Applied Stochastic Processes. Springer Science & Business Media.