# Chapter 6 Quantum algorithms for Monte Carlo

This notes are based on this paper (Montanaro 2015). Previous works (upon which (Montanaro 2015) is based) on the same topic can be found in (Brassard et al. 2011), (Wocjan et al. 2009) and (Heinrich 2002).

The main goal of Monte Carlo methods is to estimate the expected value \(\mu\) of a randomized algorithm \(A\). The idea is that we have a random variable \(X\) taking values in a subset \(\chi\) of \(\mathbb{R}^d\), for some dimension \(d\). The random variable \(X\) has probability density \(f\) and typically there is a function \(\Phi: \chi \to \mathbb{R}_+\). Monte Carlo methods want to approximate: \[\begin{equation} \mu = \mathbb{E}[\Phi(X)] = \int_{\chi}\Phi(x)f(x)dx \end{equation}\]

They do so by producing an appropriate number \(N\) of samples, each corresponding to an independent execution of \(A\). The sample mean \(\widetilde{\mu}\) is then used as an approximation of \(\mu\). In general, for some random variable \(X_i\), for \(i=1,\dots,N\) the sample mean is defined as (Kennedy 2016): \[\begin{equation} \widetilde{\mu}_N= \frac{1}{N}\sum_{i=1}^N X_i\, . \end{equation}\] The \(\widetilde{\mu}_N\) is a linear combination of the random variables \(X_i\), and so it is itself a random variable. In the language of statistics, the \(\widetilde{\mu}_N\) is called an estimator. Note that:

- we do not know how many samples \(N\) we need to get a good estimate of \(\mu\);
- even worse, we still don’t know whether this procedure leads to a good estimate of \(\mu\);
- if the estimate is good for some \(N\) we don’t know how much good it is.

Two fundamental theorems in probability theory help the way out. The law of large numbers ensures that the sample mean \(\widetilde{\mu}_N\) is a good approximation of \(\mu\) for large enough \(N\), while the central limit theorem precisely states how close is \(\widetilde{\mu}_N\) to \(\mu\) for a given value of \(N\).

**Law of Large Numbers** Let \(X_1,\dots,X_N\) be independent and identically distributed random variables. Assume that their expected value is finite and call it \(\mu\).
Then, as \(N\) tends to infinity:
\[ \widetilde{\mu}_N = \frac{1}{N}\sum_{i=1}^N X_i \to \mu, \, (a.s),\]

which means that: \[ P\left(\lim_{n\to\infty} \frac{1}{N}\sum_{k=1}^N X_k = \mu\right) = 1\, .\] Note that since all the \(X_i\) random variables have expected value \(\mu\) the expected value of the sample mean \(\widetilde{\mu}_N\) is: \[ \mathbb{E}[\widetilde{\mu}_N]=\frac{1}{N}\mathbb{E}[X_1+\dots+X_N] = \frac{1}{N}N\mu = \mu \, . \] When this happens we say that \(\widetilde{\mu}_N\) is an unbiased estimator of \(\mu\). The law of large numbers tells us that \(\widetilde{\mu}_N\) converges to \(\mu\) for \(N\) going to infinity. However in a simulation we can not have \(N\) going to infinity: we must choose a (hopefully large but) finite \(N\). How close is \(\widetilde{\mu}_N\) to \(\mu\) for a given value of \(N\)? The law of large numbers does not reply to that question (Kennedy 2016). To get a first answer, suppose that \(X_i\) have finite variance and call it \(\sigma^2\). The variance of the sample mean is then: \[\begin{equation} var(\widetilde{\mu}_N) = \frac{1}{N^2} var(X_1+\dots+X_N) = \frac{1}{N^2}N\sigma^2 = \frac{\sigma^2}{N}\, . \end{equation}\] This shows that the difference of \(\widetilde{\mu}_N\) from \(\mu\) should be of order \(\sigma/\sqrt{N}\) (Kennedy 2016). The central limit theorem gives a more refined statement on the accuracy of the approximation (Kennedy 2016).

**Central Limit Theorem** Let \(X_i\) be a sequence of independent and identically distributed random variables, such that they have finite mean \(\mu\) and finite variance \(\sigma^2\). The random variable:
\[\begin{equation}
\frac{1}{\sigma\sqrt{N}}\sum_{i=1}^N(X_i-\mu)
\end{equation}\]
converges in distribution to a standard normal random variable. This means that:
\[\begin{equation}
\lim_{n\to\infty} P(a\le \frac{1}{\sigma\sqrt{N}}\sum_{k=1}^N(X_k-\mu) \le b) = \int_{a}^{b}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}\, dx \, .
\end{equation}\]
The central limit theorem can be used to construct confidence intervals for our estimate \(\widetilde{\mu}\), indeed the probability to estimate \(\mu\) up to additive error \(\epsilon\) is:
\[\begin{align}
P(-\epsilon \le \widetilde{\mu}-\mu\le \epsilon) & = P\bigg(-\frac{\epsilon\sqrt{N}}{\sigma}\le \frac{(\widetilde{\mu}-\mu)\sqrt{N}}{\sigma}\le \frac{\epsilon\sqrt{N}}{\sigma} \bigg)\\& \approx P(-\frac{\epsilon\sqrt{N}}{\sigma}\le Z \le \frac{\epsilon\sqrt{N}}{\sigma})\, ,
\tag{6.1}
\end{align}\]
where \(Z\) is a standard normal random variable.
We can find values of \(N\) such that the probability to have a good estimate \(\widetilde{\mu}_N\) up to fixed additive error \(\epsilon\) is almost 1. Usual values for this probability are \(95\%\) or \(99\%\). So for example if we want \(P(-z_c\le Z \le z_c) = 0.99\) where \(z_c = \epsilon\sqrt{N}/\sigma\) then \(z_c =\epsilon\sqrt{N}/\sigma= 2.58\) because of the properties of the normal distribution. Estimating \(\mu\) with additive error \(\epsilon\) would require \(N = 2.58\, \sigma^2/\epsilon^2\) samples.

You can feel the power of Monte Carlo!! All of the above does not depend on the dimension of the sample space of the random variables \(X_i\), but just on the number of repetitions \(N\) and on the variance \(\sigma^2\).

This is amazing at first sight but we want to make two remarks. First note we don’t know the value \(\mu\) because we are trying to estimate it. How can we then know the value of \(\sigma^2\), and use it to construct the confidence intervals? To address this problem we use another estimator: the sample variance. It is defined as: \[\begin{equation} s^2 = \frac{1}{N-1}\sum_{k=1}^N(X_i-\widetilde{\mu})^2 \end{equation}\] where the prefactor is chosen in such a way so that \(\mathbb{E}[s^2]=\sigma^2\), i.e. \(s^2\) is an unbiased estimator. One can prove that confidence intervals can be built as above but with \(\sigma\) replaced by \(s\) (Kennedy 2016).

Second we highlight the main flaw of the Monte Carlo procedure. We saw that estimating \(\mu\) up to additive error \(\epsilon\) with \(99\%\) success probability requires \(n = O(\sigma^2/\epsilon^2)\) repetitions, independently on the dimension of the sample space. This is remarkable but not so efficient. It means that if we want to maintain the confidence at \(99\%\) and we want to decrease the additive error \(\epsilon\) by a factor of \(10\) we need to increase the number of iteration by a factor of \(10^2\). Imagine if we want to estimate \(\mu\) up to four digits. We would need to run \(A\) more than \(100\) million times (Montanaro 2015).

## 6.1 Monte Carlo with quantum computing

Here is where quantum computing comes to help. The number of uses of \(A\) can be reduced almost quadratically beyond the classical bound (Montanaro 2015). The result is based on amplitude estimation.

We first show the quadratic advantage for an algorithm \(A\) whose output is bounded between 0 and 1. This speedup will be smartly used to speedup more general classes of algorithms. This section is also meant to give insights on why quantum speedup happens in the first place.

Without loss of generality, we assume that \(A\) is a quantum algorithm operated on an \(n\) qubit quantum register, whose initial state is the \(|0^{\otimes n}\rangle\) state. We assume that \(A\) makes no measurements until the end of the algorithm, when \(k \leq n\) qubits are measured in the computational basis. The outcome of the measurement \(x \in \{0,1\}^k\) is then plugged in a function \(\Phi :\{0,1\}^k \to [0,1]\). We call \(\nu(A)\) the random variable representing the output of \(A\). Our aim is to find an estimate of \(\mathbb{E}[\nu(A)]\). As usual, we also assume to have access to the inverse of the unitary part of the algorithm, which we write as \(A^{-1}\).

The algorithm \(A\), as a classical randomized algorithm, is built in such a way that when we run it k times, we get some random \(\Phi(x_1),\dots, \Phi(x_k)\) with probabilities \(|\alpha_{x_i}|^2\) (for some, in general, complex \(\alpha_{x_i}\)).

That is to say the random variable \(\nu(A)\) is distributed according to the probabilities \(p_x = |\alpha_x|^2\). We will see that with quantum computers, there is a smarter way to estimate \(\mathbb{E}[\nu(A)]\), instead of repeatedly sampling from \(A\)? Using a quantum computer we can assume that \(A\) is now a quantum algorithm (by taking the classical circuit, making it reversible, and then obtaining a quantum circuit for it). Then, we can use various tricks to encode the value of \(\nu(A)\) in the amplitude of a quantum register, and then estimate it using amplitude estimation.

## 6.2 Bounded output

The procedure when the output of \(A\) is bounded between \(0\) and \(1\) is the following. Instead of doing the measurement at the end of the circuit, we add an ancillary qubit and perform the unitary operation \(W\) on \(k+1\) qubits defined as follows: \[\begin{equation} W|x\rangle|0\rangle = |x\rangle(\sqrt{1-\Phi(x)}|0\rangle+\sqrt{\Phi(x)}|1\rangle) \end{equation}\]

This is a controlled operation which in general requires to first encode \(\Phi(x)\) in the \(|x\rangle\) register. This is done by performing \(k\) subsequent controlled rotations for some specific angle on the \(k+1\)th qubit for each of the \(k\) controlling qubits. Then it is easy to rotate back the \(k\) register to \(|x\rangle\) (more on this can be found in (Duan et al. 2020)).

Assuming that we know how to do \(W\), when \(A\) is run with its final measurement replaced with operation \(W\), we would formally end up in a state of the form: \[\begin{equation} |\psi\rangle=(I \otimes W)(A\otimes I)|0^n\rangle|0\rangle=\sum_x \alpha_x|\psi_x\rangle|x\rangle(\sqrt{1-\Phi(x)}|0\rangle+\sqrt{\Phi(x)}|1\rangle)\, . \end{equation}\] Why is this state useful?

When the sum over \(x\) is developed, it turns out that \(|\psi\rangle\) is such that it can be viewed as a superposition of two orthogonal states. A “good” state, \(|\Psi_G\rangle\) which is a superposition of all the states in the expansion whose qubit sequence ends up with a \(1\) (i.e. the ancilla qubit is in state \(|1\rangle\) and a “bad” state \(|\Psi_B\rangle\), which is a superposition of all the other states. Unfortunately there is no ugly state, otherwise we would have been screening a spaghetti western cult. The good and bad states are orthogonal to each other and thus they can be viewed as spanning two orthogonal sub-spaces in the Hilbert space, which without fantasy we’ll call good and bad sub-spaces.

Projecting \(|\psi\rangle\) onto the good subspace is equivalent to find the amplitude of the good state, which for construction is the expected value of \(\nu(A)\) (exaclty what we wanted!!). Indeed calling \(P\) the projector onto the good subspace, which is in our case \(P= I \otimes |1\rangle\langle 1|\), we have that: \[\langle \psi|P|\psi\rangle = \sum_x |\alpha_x|^2\Phi(x) =\mathbb{E}[\nu(A)]\,.\] Notice that the bound on the output of \(A\) between \(0\) and \(1\) and the form of the operation \(W\) leads to a normalized state \(|\psi\rangle\). This means that we can write: \[ |\psi\rangle = \sin{\theta}|\Psi_G\rangle + \cos{\theta}|\Psi_B\rangle\] for some angle \(\theta\). We can restate our problem as follows: we want to estimate the amplitude squared of the good state, i.e. \(\sin^2(\theta)\). Indeed we created a circuit that create a state where the expected value of \(\nu(A)\) is now encoded in that amplitude of the quantum state, more specifically it is encoded in the probability of measuring \(1\) in an ancilla qubit. The latter is estimated with amplitude estimation, see theorem 4.4. This algorithm actually returns an estimate of the angle \(\theta\) for which we can derive an estimate of \(\sin^2(\theta)\), which in turns return the expectation of the random variable \(\mathbb{E}[\nu(A)]\).

Let’s recap how to use this algorithm. Amplitude estimation takes as input a unitary operator on \(n+1\) qubits such that: \[|\psi\rangle = \sqrt{1-a}|\Psi_B\rangle|0\rangle + \sqrt{a}|\Psi_G\rangle|1\rangle\] The final qubit acts as a label to distinguish good states from bad states.

Then it applies for a number of times \(t\) the following sequence of projections: first apply \(V=I-2P\) where \(P\) is some projector and then \(U=2|\psi\rangle\langle\psi|-I\). The form of \(P\) depends in general on the good state (in our case \(P = I\otimes |1\rangle\langle 1|\)). The projector \(U\) could be seen as a reflection through \(|\psi\rangle\) and \(V\) as a reflection through the bad state \(|\Psi_B\rangle\) in a plane spanned by the good and the bad states (De Wolf 2019). The output of amplitude estimation is an estimate \(\widetilde{a}\) of \(a = \langle \psi|P|\psi\rangle = \sin^2(\theta)\).

\(U\) can also be written as \(AS_0A^{-1}\) where \(S_0\) changes the phase from \(+1\) to \(-1\) of all the basis states but the \(|0^n\rangle\) state. This is why before we required to have access to the inverse of \(A\). \(V\) is instead the operation of changing the phase of all the basis states of the good sub-space from \(+1\) to \(-1\). That is to say that the phase of the good state \(|\Psi_G\rangle\) is changed from \(+1\) to \(-1\). This operation is usually done by setting a target qubit to \(1/\sqrt{2}(|0\rangle-|1\rangle)\) as we have: \[U_f |x\rangle(\frac{|0\rangle-|1\rangle}{\sqrt{2}}) = (-1)^{f(x)}|x\rangle(\frac{|0\rangle-|1\rangle}{\sqrt{2}})\] for any function \(f(x)\) with binary values \(0\) or \(1\). The target qubit will always be just one and applying \(U_f\) is an eigenstate. So we can ignore its presence and consider \(U_f\) as acting only on the first register.

We are then ready to state the quantum algorithm to estimate \(\mathbb{E}[\nu(A)]\) when the output of \(A\) is bounded between \(0\) and \(1\).Applying amplitude estimation together with the powering lemma C.1, one can prove the following theorem valid for algorithm 6.1.

**Theorem 6.1 (Quantum Monte Carlo with bounded output)**Let \(|\psi\rangle = A|0^n\rangle\) and set \(U=2|\psi\rangle\langle\psi|-I\). Algorithm in figure 6.1 uses \(O(\log(1/\delta))\) copies of \(|\psi\rangle = A|0^n\rangle\), uses \(U\) \(O(t\log(1/\delta))\) times, and outputs an estimate \(\widetilde{\mu}\) such that \[\begin{equation*} |\widetilde{\mu}-\mathbb{E}[\nu(A)]|\leq C\bigg(\frac{\sqrt{\mathbb{E}[\nu(A)]}}{t}+\frac{1}{t^2}\bigg) \end{equation*}\] with probability at least \(1-\delta\), where \(C\) is a universal constant. In particular for any fixed \(\delta>0\) and any \(\epsilon\) such that \(0\leq \epsilon \leq 1\), to produce an estimate \(\widetilde{\mu}\) such that with probability at least \(1-\delta\), \(|\widetilde{\mu}-\mathbb{E}[\nu(A)]|\leq \epsilon\mathbb{E}[\nu(A)]\) it suffices to take \(t=O(1/(\epsilon\sqrt{\mathbb{E}[\nu(A)]}))\). To achieve \(|\widetilde{\mu}-\mathbb{E}[\nu(A)]|\leq \epsilon\) with probability at least \(1-\delta\) it suffices to take \(t=O(1/\epsilon)\).

We already described the main idea to get quantum speedup: encode the expected value of \(\nu(A)\) in a suitably defined amplitude of the quantum register and then estimate it. The error bound of theorem 6.1 is the error bound of the amplitude estimation algorithm 4.4. The same applies to the success probability which, in this 4.4 formulation of the amplitude estimation theorem, is \(8/\pi^2\). This can be improved to \(1-\delta\) with the powering lemma C.1. That is why, for example, \(U\) is used \(O(t\log(1/\delta))\) times: \(t\) times by amplitude estimation which is repeated \(\log(1/\delta)\) to apply the powering lemma. To get the absolute error to \(\epsilon\) we just need to set \(t=O(1/\epsilon)\), and to obtain a relative error we set \(t=O(1/(\epsilon\sqrt{\mathbb{E}[\nu(A)]}))\).

## 6.3 Bounded \(\ell_2\) norm

The algorithm described by theorem 6.1 is remarkable, but it is limited by the requirement of having an output value bounded between \(0\) and \(1\). We want more. We want a quantum algorithm to compute the expected value of any (as we will see, we will have anyway some restrictions) random variable.

To arrive at this result, the first step is to show quantum speedup for the class of algorithms whose output value is positive and has bounded \(l_2\) norm. This means that there is a bound on \(\|\nu(A)\|_2 = \sqrt{\mathbb{E}[\nu(A)^2]}\). The key idea is the following: given an algorithm \(A\) belonging to this class, we consider its possible output value as the sum of numbers belonging to intervals of the type \([0,1)\), \([1,2)\), \([2,4)\), \([4,8)\) and so on. Notice that besides the first interval (\([0,1]\)), all the others can be compactly written as \([2^{l-1},2^{l}]\) for \(l=1,...,k\). If we divide all of these intervals by \(2^l\) we would get \(k\) intervals bounded between \(0\) and \(1\) (actually \(1/2\) and \(1\)). In this way we can apply algorithm 6.1 for the interval \([0,1]\) and for all these other \(k\) intervals. If we multiply each of the \(k\) estimates we get by \(2^l\) (for \(l=1,...,k\) ) and add them together with the estimate for the \([0,1]\) interval, we would get an estimate for \(\mathbb{E}[\nu(A)]\).

You may have noticed that the index \(l\) goes from \(1\) to \(k\) and wondered what is this \(k\)? If not, we wonder for you. Set for example \(k=2\) and imagine that the expected value of the algorithm is \(16\). Surely we can not obtain \(16\) as a sum of three numbers in the intervals \([0,1]\), \([1,2]\), \([2,4]\). So how must \(k\) be chosen? And even if \(k\) is chosen properly, how we can be sure that what is left in this approximation is negligible? We will see in a moment that assuming a bound on the \(l_2\) norm of \(\nu(A)\) gives us a way to solve this problem elegantly.

Before going on, we introduce a useful notation. Given any algorithm \(A\) we write as \(A_{<x}\), \(A_{x,y}\), \(A_{\geq y}\) the algorithms defined by executing \(A\) to produce a value \(\nu(A)\) and:

- \(A_{<x}\): if \(\nu(A)<x\), output \(\nu(A)\) otherwise output 0;
- \(A_{x,y}\): if \(x\leq\nu(A)<y\), output \(\nu(A)\) otherwise output 0;
- \(A_{\geq y}\): if \(\nu(A)\geq y\), output \(\nu(A)\) otherwise output 0.

Moreover whenever we have an algorithm \(A\) and a function \(f:\mathbb{R}\to\mathbb{R}\), we write as \(f(A)\) the algorithm produced by evaluating \(\nu(A)\) and then computing \(f(\nu(A))\).

We can now state the second quantum algorithm which estimates the mean value of \(\nu(A)\) with bounded \(\ell_2\) norm.

**Theorem 6.2 (Quantum algorithms with bounded ell2 norm)**Let \(|\psi\rangle=A|0^n\rangle\), \(U = |\psi\rangle\langle \psi|-1\). Algorithm in figure 6.2 uses \(O(\log(1/\epsilon)\log\log(1/\epsilon))\) copies of \(|\psi\rangle\), uses \(U\) \(O((1/\epsilon)\log^{3/2}(1/\epsilon)\log\log(1/\epsilon))\) times, and estimates \(\mathbb{E}[\nu(A)]\) up to additive error \(\epsilon(||\nu(A)||+1)^2\) with probability at least \(4/5\).

We will give here a sketch of the proof for this theorem. The curious can check it on (Montanaro 2015). The resource bounds follow from the resource bounds of algorithm 6.1, multiplied by the number of times we use it. Indeed \(\delta =\Omega(1/log(1/\epsilon))\) and we are using algorithm 6.1 \(O(\log(1/\epsilon))\) times. So the overall number of copies of \(|\psi\rangle\) in all of the uses of algorithm 6.1 is \(O(\log(1/\epsilon)\log\log(1/\epsilon))\) and the total number of uses of \(U\) is \(O((1/\epsilon)\log^{3/2}(1/\epsilon)\log\log(1/\epsilon))\).

To estimate the total error (supposing that every use of algorithm in theorem 6.1 succeed) we write: \[\begin{equation} \mathbb{E}[\nu(A)] = \mathbb[\nu(A_{0,1})]+\sum_{l=1}^k 2^l \mathbb{E}[\nu(A_{2^{l-1},2^l})/2^l]+\mathbb{E}[\nu(A_{\geq 2^k})]\, . \end{equation}\]

Now substituting this expression in the one for the error and using the triangle inequality we have that: \[\begin{equation} |\widetilde{\mu}-\mathbb{E}[\nu(A)]|\leq |\widetilde{\mu}_0-\mathbb{E}[\nu(A_{0,1})]|+\sum_{l=1}^k 2^l|\widetilde{\mu}_l-\mathbb{E}[\nu(A_{2^{l-1},2^l})/2^l]|+\mathbb{E}[\nu(A_{\geq 2^k})]\, . \end{equation}\]

You can see that the last term is a term that we did not estimate with 6.1. This is exactly the leftover we were talking about some lines above. What the last equation says is that when choosing \(k\) as in algorithm 6.2, this term is bounded by the \(l_2\) norm squared of \(\nu(A)\). Indeed denoting with \(p(x)\) the probability that \(A\) outputs \(x\) we have that: \[\begin{equation} \mathbb{E}[\nu(A_{\geq 2^k})]= \sum_{x\geq 2^k}p(x)x \leq \frac{1}{2^k}\sum_x p(x)x^2 = \frac{\|\nu(A)\|^2}{2^k}, . \end{equation}\]

Getting the error bound of theorem 6.2 is not really interesting from the conceptual point of view. It is just a matter of math and the usage of previous results. Indeed, using the error bound obtained for algorithm in theorem 6.1 for \(|\widetilde{\mu}_0-\mathbb{E}[\nu(A_{0,1})]|\) and all the \(|\widetilde{\mu}_l-\mathbb{E}[\nu(A_{2^{l-1},2^l})/2^l]|\), Cauchy-Schwartz, and other techqnieus ( refer to (Montanaro 2015) for the full calculations), one could finally write that: \[\begin{equation} |\widetilde{\mu}-\mathbb{E}[\nu(A)]|= \epsilon\bigg(\frac{C}{D}\bigg(1+\frac{5}{D}+2||\nu(A)||_2\bigg) +||\nu(A)||^2_2\bigg) \end{equation}\] which for a sufficiently large constant \(D\) is upper bounded by \(\epsilon(||\nu(A)||_2+1)^2\). Here \(C\) another constant that comes from the error bound of algorithm of theorem 6.1, wich in turns comes from the constant factors of amplitude estimation.

Finally we comment on the success probability: it is at least \(4/5\), when one uses the union bound, i.e. theorem C.1 on the success probability of all the uses of theorem 6.1. Note that we can exploit the powering lemma to increase this success probability up to \(1-\delta\) for any \(\delta\), by repeating this algorithm \(O(\log(1/\delta)\) times and taking the median.

## 6.4 Bounded variance

We are ready to demonstrate quantum speedup for the most general case: algorithms whose random output have bounded variance. In this case one obtains quantum speedup combining the classical Chebyeshev inequality and the use of quantum speedup of algorithm 6.2. For clarity, we recall that the bound on the variance of \(\nu(A)\) means that \(Var(\nu(A))= \mathbb{E}[\left(\nu(A)-\mathbb{E}[\nu(A)]\right)^2]\leq \sigma^2\).

To get to our desired result, we will also need to scale and shift the random variables of interest. Starting with a random variable \(X\) distributed according to some distribution with mean \(\mu_x\) and standard deviation \(\sigma_x\), whenever we scale the random variable by a factor \(\lambda\), we obtain a random variable \(Z = \lambda X\) with a new distribution with mean \(\mu_z=\lambda\mu_x\) and standard deviation \(\sigma_z= \lambda\sigma_x\). When we shift the random variable by a scalar \(k\), we obtain a random variable \(Z= X+k\) with a new distribution with mean \(\mu_z=\mu_x+k\) and standard deviation \(\sigma_z = \sigma_x\).

The first step of our algorithm is to run the algorithm \(A'=A/\sigma\) obtained by scaling \(A\) with \(\lambda = 1/\sigma\). For what we said above \(\nu(A')\) will be a random variable with mean and standard deviation bounded by \(\mu' \leq \mu/\sigma\) and \(\sigma' \leq 1\) respectively. Having a standard deviation of order unity means that the output \(\widetilde{m}\) of a run of algorithm \(A'\) is with high probability pretty close to the actual mean \(\mu'\). This is because of Chebyeshev inequality, i.e. theorem C.4:

\[\begin{equation} P(|\nu(A')-\mu'|\geq 3) \leq \frac{1}{9}\, . \end{equation}\] Therefore we can assume with high confidence that \(|\widetilde{m}-\mu'|\leq 3\). The second step is to consider algorithm \(B\), which is produced by executing \(A'\) and shift it by subtracting \(\widetilde{m}\). The random variable \(\nu(B)\) has a bound on the \(\ell_2\) norm. Indeed: \[\begin{equation} \begin{split} ||\nu(B)||_2 &= \mathbb{E}[\nu(B)^2]^{1/2}=\mathbb{E}[((\nu(A')-\mu')+(\mu'-\widetilde{m}))^2]^{1/2}\\ &\leq \mathbb{E}[(\nu(A')-\mu')^2]^{1/2}+\mathbb{E}[(\mu'-\widetilde{m})^2]^{1/2} = \sigma'+\mathbb{E}[(\mu'-\widetilde{m})^2]^{1/2}\leq 4 \\ \end{split} \end{equation}\] This bound can also be stated as \(||\nu(B)/4||_2 \leq 1\).

This means that the expected value of \(\nu(B)/4\) could be estimated with algorithm 6.2. Actually algorithm 6.2 also requires a positive \(\nu(B)\geq 0\). We will estimate \(\mathbb{E}[\nu(B_{\geq 0})/4]\) and \(\mathbb{E}[\nu(-B_{< 0})/4]\) using algorithm 6.2 with accuracy \(\epsilon/(32\sigma)\) and failure probability \(1/9\). Notice that both \(\nu(B_{\geq 0})/4\) and \(\nu(-B_{\leq 0})/4\) are positive and the bound \(||\nu(B)/4||_2\leq 1\) implies \(||\nu(B_{<0})/4||_2\leq 1\) and \(||\nu(B_{\geq 0})/4||_2\leq 1\).

**Theorem 6.3 (Quantum Monte Carlo with bounded variance - additive error)**Let \(|\psi\rangle = A|0^n\rangle\), \(U=2|\psi\rangle\langle\psi|-I\). Algorithm 6.3 uses \(O(\log(\sigma/\epsilon)\log\log(\sigma/\epsilon))\) copies of \(|\psi\rangle\), uses \(U\) \(O((\sigma/\epsilon)\log^{3/2}(\sigma/\epsilon)\log\log(\sigma/\epsilon))\) times and estimates \(\mathbb{E}[\nu(A)]\) up to additive error \(\epsilon\) with success probability at least \(2/3\).

The additive error for this theorem is \(\epsilon\) because we required accuracy \(\epsilon/32\sigma\) for both uses of algorithm 6.2. Indeed, this implies that both the estimates we would get are accurate up to \((\epsilon/32\sigma)(||\nu(B_{\geq 0}/4)||_2+1)^2 \leq (\epsilon/32\sigma)(1+1)^2 = \epsilon/8\sigma\). Now we just multiply by a \(4\) factor the estimates of \(\mathbb{E}[\nu(B_{\geq 0})/4]\) and \(\mathbb{E}[\nu(B_{< 0})/4]\) to get the estimates of \(\mathbb{E}[\nu(B_{\geq 0})]\) and \(\mathbb{E}[\nu(B_{< 0})]\). The error then gets \(4\epsilon/8\sigma =\epsilon/2\sigma\). Combining these two errors, one has a total additive error for the estimate of \(A'\) given by \(\epsilon/\sigma\). Since \(A=\sigma A'\) the error for \(A\) is exactly \(\epsilon = \sigma (\epsilon/\sigma)\).

The success probability is at least \(2/3\) when using the union bound (theorem C.1 ) on the success probability of the uses of algorithm 6.2 and the success probability for \(\widetilde{m}\) to be close to \(\mu'\). Also in this case we can exploit the powering lemma to increase this probability up to \(1-\delta\) for any \(\delta\) by repeating this algorithm \(O(\log(1/\delta)\) times and take the median.