# C Probability

## C.1 Measure theory

**Definition C.1 (Sigma algebra) **Let \(\Omega\) be a set, and \(\Sigma\) be a subset of the power set of \(\Omega\) (or equivalently a collection of subsets of \(X\)). Then \(\Sigma\) is a \(\sigma\)-algebra if:

- \(\emptyset\in \Sigma\),
- \(\Sigma\) is closed under countable union,
- \(\forall S \in \Sigma, \overline{S} \in \Sigma\).

Observe that thanks to de Morgan’s law, we can equivalently define the sigma algebra to be closed under countable intersection. Oftentimes, it’s common to conflate \(\Sigma\) and \((\Omega, \Sigma)\), and call both \(\sigma\)-algebra.

**Definition C.2 (Measurable space)**Let \(\Omega\) be a set, and \(\Sigma\) a \(\sigma\)-algebra. The tuple \((\Omega, \Sigma)\) is a measurable space (or Borel space).

**Definition C.3 (Measurable function)**Let \((\Omega, \Sigma)\), and \((Y, T)\) two different measurable space. A function \(f : \Omega \mapsto Y\) is said to be measurable if for every \(E \in T\): \[f^{-1}(E):=\{x \in \Omega | f(x) \in E \} \in \Sigma\]

A measurable function is a function between the underlying sets of two measurable spaces that preserves the structure of the spaces: the preimage of any measurable set is measurable. This is in direct analogy to the definition that a continuous function between topological spaces preserves the topological structure: the preimage of any open set is open.

**Definition C.4 (Continious function)**Let \((X, \mathbb{X}), (Y, \mathbb{Y})\) two topological spaces. Let \(f\) be a function between two topological spaces \(f : X \mapsto Y\) is said to be continious if the inverse image of every open subset of \(Y\) is open in \(X\). In other words, if \(V \in \mathbb{Y}\), then its inverse image \(f^{-1}(V) \in \mathbb{X}\)

**Definition C.5 (Measure space) **The tuple \((\Omega, \Sigma, \mathbb{P})\) is a measure space if:

- \((\Omega, \Sigma)\) is a measurable space.
- \(\mu(E)\) is a measure on \((\Omega, \Sigma)\):
- \(\mu : \Sigma \mapsto \mathbb{R}+\{-\infty, +\infty\}\)
- non-negativity: \(\mu(E) \geq 0 \forall E \in \Sigma\)
- Null empty set \(\mu(\emptyset )= 0\)
- Coutable additivity (or \(\sigma\)-additivity): for all countable collections \(\{E_k \}_{k=1}^\infty\) of pariwise disjoint sets in \(\Sigma\), \[\mu \left(\cup_{k=1}^\infty E_k\right) = \sum_{k=1}^\infty \mu(E_k)\]

**Definition C.6 (Probability space) **The tuple \((\Omega, \Sigma, \mathbb{P})\) is a probability space if:

- \((\Omega, \Sigma)\) is a \(\sigma\)-algebra (\(\Omega\) is the set of
*outcomes*of the experiment, and \(\Sigma\) is the set of*events*) - \(\mathbb{P}\) is a measurable function:
- \(\mathbb{P} : \Sigma \mapsto [0,1]\)
- Null empty set.
- \(\mu\) is countably additive.

- \(\mathbb{P}(\Omega)=1\)

I.e. a probability space is a measure space where the measurable function on \(\Omega\) is \(1\).

**Definition C.7 (Complete probability space)**For \(B \subset \Sigma\) s.t. \(\mathbb{P}(B)=0\), a \((\Omega, \Sigma, \mathbb{P})\) probability space is complete if \(\forall A \subset B\), \(A \in \Sigma\).

**Definition C.8 (Equivalence between probability measures)**Let \((\Omega, \Sigma, \mathbb{P}), (\Omega, \Sigma, \mathbb{Q})\) two probability space with the same \(\Omega\) and \(\Sigma\). We say that \(\mathbb{P}\) and \(\mathbb{Q}\) are equivalent iif for every \(A \in \Sigma\), \(\mathbb{P}(A)=0 \Leftrightarrow \mathbb{Q}(A)=0\).

It basically means that the two measures agree on the possible and impossible events (even if it is pretty strange to call them equivalent).

**Definition C.9 (Random variable)**A (real-valued) random variable on a probability space \((\Omega, \Sigma, \mathbb{P})\) is a measurable function \(X: \Omega \mapsto \mathbb{R}\).

#### C.1.0.1 Union bound

The union bound is used to show that the probability of union fo some finte or countable set of events is less than some value.

**Theorem C.1 (Union bound)**\(\forall\) events \(A_1 \dots A_n \in \Sigma\): \[P(\cup_{i=1}^n A_i) \leq \sum_{i=1}^n P(A_i)\]

**Exercise C.1**In Erdős–Rényi graphs \(G(n,p)\), (that is, a graph with \(n\) nodes with probability \(p\) that each of the two nodes are connected). We define the event \(B_n\) as the event where a graph \(G(n,p)\) has at least one isolated node. Show that \(P(B_n) \leq n(1-p)^{n-1}\).

*Proof.*Let \(A_i, i\in[n]\) the event that node \(i\) is isoldated. Its probability, from the definition of \(G(n,p)\) is \((1-p)^{n-1}\), because there might be an edge with probability \(p\) with other \(n-1\) nods. From this, applying directly the union bonund we obtain an upper bound on the probability that there is at least one isoldated node is in the graph: \[P(B_n) = P(\cup_{i=1}^n A_i) \leq \sum_i P(A_i) \leq nP(A_i) = n(1-p)^{n-1}\]

**Exercise C.2**Suppose we run 4 times a randomized algorithm, with success probability \(1-\delta\). Can you bound the probability that we never fail using the union bound?

*Proof. * Let \(f_i\) the event that we fail running our algorithm at time \(i\).
We know that the failure probability \(f_i\) is \(\delta\) for all \(i \in [4]\). Thanks to the union bound we can bound the probability that we fail at least once:
\(P(\cup_i^k f_i ) \leq \sum_i^4 \delta = 4\delta\).
It follows that the have 4 success in a row is *lower* bounded by \(1-4\delta\).

**Definition C.10 (Variance)**\[\begin{align} \operatorname{Var}(X) &= \operatorname{E}\left[(X - \operatorname{E}[X])^2\right] \\[4pt] &= \operatorname{E}\left[X^2 - 2X\operatorname{E}[X] + \operatorname{E}[X]^2\right] \\[4pt] &= \operatorname{E}\left[X^2\right] - 2\operatorname{E}[X]\operatorname{E}[X] + \operatorname{E}[X]^2 \\[4pt] &= \operatorname{E}\left[X^2 \right] - \operatorname{E}[X]^2 \end{align}\]

**Exercise C.3**How can we express the variance as expectation of quantum states? What quantum algorithm might we run to estimate the variance of a random variable \(M\)? \[\langle\psi|M^2|\psi\rangle - (\langle\psi|M|\psi\rangle)^2 \] Discuss.

**Definition C.11 (Exponential Family (Murphy 2012)) **A probability density function or probability mass function \(p(v|\nu)\) for \(v = (v_1, \cdots, v_m) \in \mathcal{V}^m\), where \(\mathcal{V} \subseteq \mathbb{R}\), is a \(\sigma\)-algebra over a set \(X\) \(\nu \in \mathbb{R}^p\) is said to be in the exponential family if it can be written as:
\[p(v|\nu) := h(v)\exp \{ o(\nu)^TT(v) - A(\nu) \}\]
where:

- \(\nu \in \mathbb{R}^p\) is called the parameter of the family,
- \(o(\nu)\) is a function of \(\nu\) (which often is just the identity function),
- \(T(v)\) is the vector of sufficient statistics: a function that holds all the information the data \(v\) holds with respect to the unknown parameters,
- \(A(\nu)\) is the cumulant generating function, or log-partition function, which acts as a normalization factor,
- \(h(v) > 0\) is the which is a non-informative prior and de-facto is scaling constant.

#### C.1.0.2 Bias-variance tradeoff

Here is a nice reference to understand the bias-variance tradeoff

### C.1.1 Boosting probabilities with “median lemma” (or powering lemma )

In this section we discuss the following, widely known result in CS. It’s used not only in writing algorithms, but also in complexity theory.

**Lemma C.1 (Powering lemma (Jerrum, Valiant, and Vazirani 1986))**Let \(\mathcal{A}\) be a quantum or classical algorithm which aims to estimate some quantity \(\mu\), and whose output \(\widetilde{\mu}\) satisfies \(|\mu-\widetilde{\mu} |\leq \epsilon\) except with probability \(\gamma\), for some fixed \(\gamma \leq 1/2\). Then, for any \(\gamma > 0\) it suffices to repeat \(\mathcal{A}\) \(O(\log 1/\delta)\) times and take the median to obtain an estimate which is accurate to within \(\epsilon\) with probability at least \(1-\delta\).

## C.2 Distributions

This is a beautiful guide that shows you how to draw samples from a probability distribution.

## C.3 Concentration inequalities

### C.3.1 Markov inequality

The Markov inequality is an *upper bound for the probability that a non-negative function of a random variable*, that is greater than or equal to a positive constant. Especially in analysis, people refer to it as Chebyshev’s inequality (sometimes, calling it the *first* Chebyshev inequality, while referring to the “usual” Chebyshev’s inequality as the second Chebyshev inequality or Bienaym{'e}’s inequality).

**Theorem C.2 (Markov inequality) **For all non-negative random variable, and \(a > 0\), we have that:

- \(Pr(X \geq a) \leq \frac{E[X]}{a}\)
- \(Pr(X \geq aE[X]) \leq \frac{1}{a}\)

*Proof. * Observe that :
\[E[X] = P(X < a) \cdot E[X|X<a] + P(X > a) \cdot E[X|X>a]\]
As both of these expected values are bigger than zero, (using the nonnegativity hypothesis) we have that
\[E[X] \geq P(X > a) \dot E[X|X>a] \]

Now is easy to observe that \(E[X|X>a]\) is at least \(a\), and by rearranging we obtain that: \[ \frac{E[X]}{a} \geq P(X > a) \]

The second statement of the theorem follows from substitution, i.e. setting \(b=aE[X]\) and using the previous statement.**Theorem C.3 (Corollary of Markov inequality)**Let \(f\) be a monotone increasing (or noll) function on a space \(I\), and define the random variable on \(Y\). \[ P(Y \geq b) \leq \frac{E[f(Y)]}{f(b)} \]

### C.3.2 Chebyshev inequality

This inequality tells us about the probability of finding the random variable \(X\) away from the mean \(E[X]\) is bounded by the variance of \(X\).

**Theorem C.4 (Chebyshev inequality) **Let \(X\) be a random variable with finite mean and variance, and \(\epsilon > 0\). Then:
\[Pr[|X - E[X]| \geq \epsilon]\leq \frac{\sigma^2}{\epsilon^2}\]

Moreover, if \(k = \epsilon /\sigma\) we can replace \(\epsilon\) with \(k\sigma\) and obtain:

\[Pr[|X - E[X]| \geq k\sigma] \leq \frac{1}{k^2}\]

It is very useful to see what happen when we define a new random variable \(Y\) as the sample mean of \(X_1 \dots X_n\) other random variables (iid) indipendent and identically distributed: \(Y= \frac{1}{n}\sum_i^n X_i\). The expected value of \(Y\) is the same as the expected value of \(X\), but the variance is now linearly smaller in the number of samples:

\[E[Y] = \frac{1}{n} \sum_i^n E[X]\] \[Var[Y] = \frac{1}{n^2} \sum_i^n \text{Var}[X]\]

This allows us to obtain the following bound:

**Theorem C.5 (Chebyshev inequality for sample mean) **Let \(Y= \frac{1}{n}\sum_i^n X_i\). Then,

### C.3.3 (Weak) Law of large numbers

**Theorem C.6 ((Weak) Law of large numbers) **Let \(X_1, X_2, \dots, X_n\) be i.i.d random variables with a finite expected value \(X[X_i]=\mu \leq \infty\). Then, for any \(\epsilon > 0\), we have that:

\[\lim_{n\to +\infty} P\left( |\overline{X} - \mu | \geq \epsilon \right) = 0\]

### C.3.4 Chernoff bound

**Theorem C.7 (Chernoff bound) **Let \(X=\sum_i^n X_i\) where \(X_i =1\) with probability \(p_i\) and \(X_i=0\) with probability \((1-p_i)\), and all \(X_i\) are independent. Let \(\mu=E[X] = \sum_i^n p_i\). Then:

*Upper tail*: \(P(X \geq (1+\delta)\mu) \leq e^-\frac{\delta^2}{2+\delta}\mu\) for all \(\delta > 0\)*Lower tail*: \(P(X \leq (1-\delta)\mu) \leq e^{\mu\delta^2/2}\) for all \(0 \leq \delta \leq 1\)

**Theorem C.8 (Chernoff bound) **Suppose \(X_1, \dots, X_t\) are independent random variables taking values in \(\{0,1\}\). Let \(M_t= (X_1 + \dots X_t)/t\) denote their average value. Then for any \(0 < \epsilon < 1\),

- (Multiplicative) \(Pr[M_t - \mu \leq -\epsilon \mu] \leq exp^{-\frac{t\mu\epsilon^2}{2}}\) and \(Pr[M_t - \mu \geq \epsilon \mu] \leq exp^{-\frac{t\mu\epsilon^2}{3}}\)
- (Additive) \(Pr[M_t - \mu \leq -\epsilon ] \leq exp^{-2t\epsilon^2}\) and \(Pr[M_t - \mu \geq \epsilon ] \leq exp^{-2t\epsilon^2}\)

### C.3.5 Hoeffding inequality

**Lemma C.2 (Hoeffding inequality)**Let \(X_1,\ldots,X_k\) be independent random variables bounded by the interval \([a, b]\). Define the empirical mean of these variables by \(\overline{X} = \frac{1}{k} (X_1+\cdots+X_k)\), then \[ Pr(|\overline{X} - \mathbb{E}[X]|\leq \epsilon) \geq 1-2 \exp\left(-\frac{2k\epsilon^2}{b-a} \right). \] Consequently, if \(k\geq (b-a)\epsilon^{-2}\log(2/\eta)\), then \(\overline{X}\) provides an \(\epsilon\)-approximation of \(\mathbb{E}[X]\) with probability at least \(1-\eta\).