C Probability
C.1 Measure theory
Definition C.1 (Sigma algebra) Let \(\Omega\) be a set, and \(\Sigma\) be a subset of the power set of \(\Omega\) (or equivalently a collection of subsets of \(X\)). Then \(\Sigma\) is a \(\sigma\)-algebra if:
- \(\emptyset\in \Sigma\),
- \(\Sigma\) is closed under countable union,
- \(\forall S \in \Sigma, \overline{S} \in \Sigma\).
Observe that thanks to de Morgan’s law, we can equivalently define the sigma algebra to be closed under countable intersection. Oftentimes, it’s common to conflate \(\Sigma\) and \((\Omega, \Sigma)\), and call both \(\sigma\)-algebra.
A measurable function is a function between the underlying sets of two measurable spaces that preserves the structure of the spaces: the preimage of any measurable set is measurable. This is in direct analogy to the definition that a continuous function between topological spaces preserves the topological structure: the preimage of any open set is open.
Definition C.5 (Measure space) The tuple \((\Omega, \Sigma, \mathbb{P})\) is a measure space if:
- \((\Omega, \Sigma)\) is a measurable space.
- \(\mu(E)\) is a measure on \((\Omega, \Sigma)\):
- \(\mu : \Sigma \mapsto \mathbb{R}+\{-\infty, +\infty\}\)
- non-negativity: \(\mu(E) \geq 0 \forall E \in \Sigma\)
- Null empty set \(\mu(\emptyset )= 0\)
- Coutable additivity (or \(\sigma\)-additivity): for all countable collections \(\{E_k \}_{k=1}^\infty\) of pariwise disjoint sets in \(\Sigma\), \[\mu \left(\cup_{k=1}^\infty E_k\right) = \sum_{k=1}^\infty \mu(E_k)\]
Definition C.6 (Probability space) The tuple \((\Omega, \Sigma, \mathbb{P})\) is a probability space if:
- \((\Omega, \Sigma)\) is a \(\sigma\)-algebra (\(\Omega\) is the set of outcomes of the experiment, and \(\Sigma\) is the set of events)
- \(\mathbb{P}\) is a measurable function:
- \(\mathbb{P} : \Sigma \mapsto [0,1]\)
- Null empty set.
- \(\mu\) is countably additive.
- \(\mathbb{P}(\Omega)=1\)
I.e. a probability space is a measure space where the measurable function on \(\Omega\) is \(1\).
It basically means that the two measures agree on the possible and impossible events (even if it is pretty strange to call them equivalent).
C.1.0.1 Union bound
The union bound is used to show that the probability of union fo some finte or countable set of events is less than some value.
Proof. Let \(f_i\) the event that we fail running our algorithm at time \(i\). We know that the failure probability \(f_i\) is \(\delta\) for all \(i \in [4]\). Thanks to the union bound we can bound the probability that we fail at least once: \(P(\cup_i^k f_i ) \leq \sum_i^4 \delta = 4\delta\). It follows that the have 4 success in a row is lower bounded by \(1-4\delta\).
Note that we could have also bypassed the union bound and compute this quantity analitically, as the probability of getting 4 success in a row would be \((1-\delta)^4\), which we can compute with the binomial theorem A.2.Definition C.11 (Exponential Family (Murphy 2012)) A probability density function or probability mass function \(p(v|\nu)\) for \(v = (v_1, \cdots, v_m) \in \mathcal{V}^m\), where \(\mathcal{V} \subseteq \mathbb{R}\), is a \(\sigma\)-algebra over a set \(X\) \(\nu \in \mathbb{R}^p\) is said to be in the exponential family if it can be written as: \[p(v|\nu) := h(v)\exp \{ o(\nu)^TT(v) - A(\nu) \}\] where:
- \(\nu \in \mathbb{R}^p\) is called the parameter of the family,
- \(o(\nu)\) is a function of \(\nu\) (which often is just the identity function),
- \(T(v)\) is the vector of sufficient statistics: a function that holds all the information the data \(v\) holds with respect to the unknown parameters,
- \(A(\nu)\) is the cumulant generating function, or log-partition function, which acts as a normalization factor,
- \(h(v) > 0\) is the which is a non-informative prior and de-facto is scaling constant.
C.1.0.2 Bias-variance tradeoff
Here is a nice reference to understand the bias-variance tradeoff
C.1.1 Boosting probabilities with “median lemma” (or powering lemma )
In this section we discuss the following, widely known result in CS. It’s used not only in writing algorithms, but also in complexity theory.
C.2 Distributions
This is a beautiful guide that shows you how to draw samples from a probability distribution.
C.3 Concentration inequalities
C.3.1 Markov inequality
The Markov inequality is an upper bound for the probability that a non-negative function of a random variable, that is greater than or equal to a positive constant. Especially in analysis, people refer to it as Chebyshev’s inequality (sometimes, calling it the first Chebyshev inequality, while referring to the “usual” Chebyshev’s inequality as the second Chebyshev inequality or Bienaym{'e}’s inequality).
Theorem C.2 (Markov inequality) For all non-negative random variable, and \(a > 0\), we have that:
- \(Pr(X \geq a) \leq \frac{E[X]}{a}\)
- \(Pr(X \geq aE[X]) \leq \frac{1}{a}\)
Proof. Observe that : \[E[X] = P(X < a) \cdot E[X|X<a] + P(X > a) \cdot E[X|X>a]\] As both of these expected values are bigger than zero, (using the nonnegativity hypothesis) we have that \[E[X] \geq P(X > a) \dot E[X|X>a] \]
Now is easy to observe that \(E[X|X>a]\) is at least \(a\), and by rearranging we obtain that: \[ \frac{E[X]}{a} \geq P(X > a) \]
The second statement of the theorem follows from substitution, i.e. setting \(b=aE[X]\) and using the previous statement.C.3.2 Chebyshev inequality
This inequality tells us about the probability of finding the random variable \(X\) away from the mean \(E[X]\) is bounded by the variance of \(X\).
Theorem C.4 (Chebyshev inequality) Let \(X\) be a random variable with finite mean and variance, and \(\epsilon > 0\). Then: \[Pr[|X - E[X]| \geq \epsilon]\leq \frac{\sigma^2}{\epsilon^2}\]
Moreover, if \(k = \epsilon /\sigma\) we can replace \(\epsilon\) with \(k\sigma\) and obtain:
\[Pr[|X - E[X]| \geq k\sigma] \leq \frac{1}{k^2}\]
It is very useful to see what happen when we define a new random variable \(Y\) as the sample mean of \(X_1 \dots X_n\) other random variables (iid) indipendent and identically distributed: \(Y= \frac{1}{n}\sum_i^n X_i\). The expected value of \(Y\) is the same as the expected value of \(X\), but the variance is now linearly smaller in the number of samples:
\[E[Y] = \frac{1}{n} \sum_i^n E[X]\] \[Var[Y] = \frac{1}{n^2} \sum_i^n \text{Var}[X]\]
This allows us to obtain the following bound:
Theorem C.5 (Chebyshev inequality for sample mean) Let \(Y= \frac{1}{n}\sum_i^n X_i\). Then,
\[Pr[|Y - E[Y]| \geq \epsilon]\leq \frac{\sigma^2}{n\epsilon^2}\]C.3.3 (Weak) Law of large numbers
Theorem C.6 ((Weak) Law of large numbers) Let \(X_1, X_2, \dots, X_n\) be i.i.d random variables with a finite expected value \(X[X_i]=\mu \leq \infty\). Then, for any \(\epsilon > 0\), we have that:
\[\lim_{n\to +\infty} P\left( |\overline{X} - \mu | \geq \epsilon \right) = 0\]
C.3.4 Chernoff bound
Theorem C.7 (Chernoff bound) Let \(X=\sum_i^n X_i\) where \(X_i =1\) with probability \(p_i\) and \(X_i=0\) with probability \((1-p_i)\), and all \(X_i\) are independent. Let \(\mu=E[X] = \sum_i^n p_i\). Then:
- Upper tail: \(P(X \geq (1+\delta)\mu) \leq e^-\frac{\delta^2}{2+\delta}\mu\) for all \(\delta > 0\)
- Lower tail: \(P(X \leq (1-\delta)\mu) \leq e^{\mu\delta^2/2}\) for all \(0 \leq \delta \leq 1\)
Theorem C.8 (Chernoff bound) Suppose \(X_1, \dots, X_t\) are independent random variables taking values in \(\{0,1\}\). Let \(M_t= (X_1 + \dots X_t)/t\) denote their average value. Then for any \(0 < \epsilon < 1\),
- (Multiplicative) \(Pr[M_t - \mu \leq -\epsilon \mu] \leq exp^{-\frac{t\mu\epsilon^2}{2}}\) and \(Pr[M_t - \mu \geq \epsilon \mu] \leq exp^{-\frac{t\mu\epsilon^2}{3}}\)
- (Additive) \(Pr[M_t - \mu \leq -\epsilon ] \leq exp^{-2t\epsilon^2}\) and \(Pr[M_t - \mu \geq \epsilon ] \leq exp^{-2t\epsilon^2}\)