Chapter 4 Classical machine learning
Machine learning, also called narrow artificial intelligence, has been defined as ``the study of computer algorithms that allow computer programs to automatically improve through experience (Mitchell et al. 1997). Machine learning is often divided into supervised and unsupervised methods. We use supervised learning when the dataset is supervised, i.e. when the dataset consist of pairs of input objects (usually vectors) and a desired output value (called the supervised signal), which can be a label or a number. In case the output is a label the supervised problem is said to be called classification, and we call regression the other case. Supervised learning can be thought of as the task of learning a mapping or a function form pairs of input and output. When the dataset in unsupervised, the problem is called clustering, and consist in finding hidden structure of the process that has generated the dataset. Computationally, much of the machine learning algorithms can be described by operations on vectors and matrices. For instance, many machine learning algorithms are reduced to computing the eigenvectors of matrices obtained from the data. In the last 15 years machine learning has been applied in all the sectors of information technology. In this chapter, we review and introduce some classical machine learning. Special emphasis is put on formalizing the connection between the machine learning problems and their linear-algebraic formulation.
The dataset that we manipulate in this work are represented by a matrix \(V \in \mathbb{R}^{n \times d}\), i.e. each row can be thought as a vector \(v_i \in \mathbb{R}^{d}\) for \(i \in [n]\) that represents a single data point. We denote as \(V_k\) the optimal rank \(k\) approximation of \(V\), that is \(V_{k} = \sum_{i=0}^k \sigma_i u_i v_i^T\), where \(u_i, v_i\) are the row and column singular vectors respectively and the sum is over the largest \(k\) singular values \(\sigma_{i}\). We denote as \(V_{\geq \tau}\) the matrix \(\sum_{i=0}^\ell \sigma_i u_i v_i^T\) where \(\sigma_\ell\) is the smallest singular value which is greater than \(\tau\). For a matrix \(M\) and a vector \(x\), we define as \(M^{+}_{\leq \theta, \delta}M_{\leq \theta, \delta} x\) the projection of \(x\) onto the space spanned by the singular vectors of \(M\) whose corresponding singular values are smaller than \(\theta\), and some subset of singular vectors whose corresponding singular values are in the interval \([\theta, (1+\delta)\theta]\).
4.1 Supervised learning
Supervised (or predictive) machine learning is the part of machine learning that deals with supervised datasets, i.e. data where each sample \(x_i\) comes along with supervised information, i.e. a piece of data \(y_i\). It helps the intuition thinking that the supervised information comes from a stochastic process that maps vectors \(x_i\) to vectors \(y_i\). The goal is to model the mapping on the whole input space \(\mathcal{X}\) to the output space \(\mathcal{Y}\) given a set of input-output pairs \(D=\{(x_i, y_i) \}_{i=0}^n\). Usually, the input space is a subset of \(\mathbb{R}^d\), and the output space is usually either \(\mathbb{R}\) or a finite set \(K\) of small cardinality. It is practical, for the sake of exposition to consider the training set organized into a matrix \(X \in \mathbb{R}^{n \times d}\) and the matrix \(Y \in R^n\) or \(Y \in [K]^n\). The components of a vector \(x_i\), i.e. a row of \(X\) are called . The matrix \(X\) is called , or simply the dataset. The vector \(y_i\) is called the . If the response variable is categorical (or nominal), the problem is known as classification, or pattern recognition. If the response variable is real-valued we interpret this problem as learning a function \(f : \mathbb{R}^d \mapsto \mathbb{R}\) and we call this problem regression. Different assumptions on the structure of \(f\) lead to different machine learning models. Each model can be trained (or fitted) with different algorithms.
4.2 Unsupervised learning
Unsupervised learning (Murphy 2012), which sometimes goes under the name of knowledge discovery, is the part of machine learning that deals with understanding unlabeled data. In the dataset \(D=\{x_i\}_{i=0}^n\) we don’t have anymore any supervised information. In this case, it is common to understand the structure of the process generating the samples by formalizing a problem: we want to learn the parameters \(\theta\) of a function \(p(x_i|\theta)\) that models the distribution of the process that has generated the samples. The importance of unsupervised learning lies in the stunning similarity with human and animal learning. Furthermore, most of the dataset that we have are unsupervised, as it is costly to provide supervised information from experts or humans. The most common example of unsupervised learning is clustering, where we want to partition into groups a given dataset. As an example, imagine having a set comprising of images of cats and dogs, without knowing which image is a cat or which is a dog. An unsupervised learning algorithm is supposed to learn how to split the dataset correctly, by understanding the characteristics and features that allows discriminating between images of different kinds. Just to name a few of the more concrete examples, in astronomy, clustering is often used to discover new kinds of stars, in biology, it is used to find new kinds of cells, in cybersecurity, to perform anomaly detection, and so on.
We refer to the number of clusters in the dataset with a letter \(K\). The first goal in clustering is to understand the right number of different groups in the data (which might not be known a-priori). The second goal is to estimate which cluster each point \(x_i\) belongs to. We define \(z_i\) for \(z_i \in [K]\) as the cluster to which point \(x_i\) is assigned to. The value of \(z_i\) is often called or . Unsupervised learning can be seen as the task of guessing the value of the hidden variable, by computing \(z_i = \arg\max_k p(z_i = k| x_i, \theta)\). For this, an unsupervised learning algorithm has to model (implicitly or explicitly) the joint probability distribution \(p(x,y)\).
While latent variables have extensive applications, in this thesis we will focus on the case where latent variables are used to represent a discrete latent state (as in clustering).
4.3 Generative and discriminative models
There is another insightful way of organizing machine learning models. They can either be generative or discriminative. A discriminative model learns just the mapping \(p(y|x)\), and provides a way to classify points (i.e. infer the value of \(y_i\)), without actually knowing ``how’’ the point \(x_i\) has been generated. Examples of such models are: k-nearest neighbors, Logistic regression, Support Vector Machines, Decision Trees, Random Forest, Neural Networks, and so on. Examples of such models are the QFDC, QSFA in Chapter 9.2.1. On the other way, generative models output a model for the joint probability distribution \(p(x,y)\). This is similar to the unsupervised learning case, but in this cases the dependence on \(y\) (which can be a hidden variable) is made explicit. In general, discriminative models make fewer assumptions, as generative models often need to do some assumption on the structure of \(p(x)\). Such generative models are one of the most promising approaches to unsupervised problems. The goal of a generative model is to learn a probability distribution that is most likely to have generated the data collected in a training set. Fitting the model consists of learning the parameters of a probability distribution \(p\) in a certain parameterized family, that best describes our vectors \(x_i, y_i\). In case the data is unsupervised, generative models learn the probability distribution \(p(x_i)\) assuming the existence of some hidden variables \(z_i\). Examples of such models in this thesis are q-means and GMM, i.e. chapter @ref{chap-q-means} and @ref{chap-qem}. A possible way to fit a generative model is to formulate the problem of finding the parameters of the family of distribution as an optimization problem. This is often done using the so-called maximum likelihood estimation (MLE). One can think of the as the function that we use to measure how good a model is for explaining a given dataset. For a given machine learning model with parameters \(\gamma\), the likelihood of our data set \(X\) is the probability that the data have been generated by the model with parameters \(\gamma\), assuming each point is independent and identically distributed. We think of the likelihood as a function of \(\gamma\), holding the dataset \(X\) fixed. For \(p(x_i|\gamma)\) the probability that a point \(x_i\) comes from model \(\gamma\), the likelihood is defined as:
\[\begin{equation}\label{eq:likelihood} L(\gamma;X) := \prod_{i =1}^n p(x_i | \gamma) \end{equation}\]
From this formula, we can see that in order to find the best parameters \(\gamma^*\) of our model we need to solve an optimization problem. For numerical and analytical reasons, instead of maximizing the likelihood \(L\), it is common practice to find the best model by maximizing the log-likelihood function \(\ell(\gamma;X) = \log L(\gamma;X) = \sum_{i=1}^n \log p(x_i|\gamma).\) In this context, we want to find the model that maximizes the log-likelihood: \[\begin{equation} \gamma^*_{ML} := \arg\max_{\gamma} \sum_{i=1}^n \log p(x_i|\gamma). \end{equation}\]
The procedure to calculate the log-likelihood depends on the specific algorithm used to model the data. A possible solution would be to use a gradient-based optimization algorithm on \(\ell\). It is often the case that, due to the indented landscape of the likelihood function, gradient-based techniques often do not perform well. Therefore, it is common to find other strategies to find do maximum likelihood estimation. One of which is the Expectation-Maximization (EM) algorithm, detailed in chapter 11.
4.4 Dimensionality reduction
Dimensionality reduction (DR), a technique used both in supervised and unsupervised learning, refers to the procedure by which the dimension of the input data is reduced while retaining most of the meaningful information contained therein. It is often a necessary step when trying to solve practical problems in machine learning and there are many techniques for performing it. For instance, it is used to decrease the variance of a model, since it can lead to models with a fewer number of parameters, and it might just reduce the noise in the data. It is also necessary when the runtime of the algorithm has polynomial dependence on the number of features, as it is often the case for nowadays datasets. In the context of big data analysis, by removing features that carry low information (like features that are strictly proportional to other features, or features for which the data contains too little information), it is possible to optimize the storage space. It can be also used for data visualization. Most importantly, supervised algorithms often suffer from the : by allowing large dimensionality of data, the informative power of the data points in the training set decreases, thus leading to a degradation in classification performances. One solution to improve the accuracy would be to increase the number of elements in the training set, but this is not always possible nor desirable, so the common route is to decrease the dimension of the data. Mathematically, the idea of the dimensionality reduction algorithms is to map vectors from a high dimensional space \(\mathcal{X}\) to a low dimensional space \(\mathcal{Y}\), such that the most meaningful information (according to some criteria) is preserved. Of course, understanding which criterion to use is far from trivial.
The choice of the right DR algorithm depends on the nature of the data as well as on the type of algorithm that will be applied after the dimensionality reduction. A very well-known DR algorithm is the Principal Component Analysis (PCA), which projects the data points onto the subspace spanned by the eigenvectors associated to the \(k\) largest eigenvalues of the covariance matrix of the data. In this way, the projection holds ``most of the information’’ of the dataset. It is possible to show (Murphy 2012) that for a subspace of dimension \(k\), this choice of eigenvectors minimizes the reconstruction error, i.e. the distance between the original and the projected vectors. However, PCA is not always the best choice of dimensionality reduction. PCA projects the data into the subspace along which the data has more variance. This does not take into consideration the information that different points might belong to different classes, and there are cases in which PCA can worsen the performance of the classifier. Other methods, like Fisher Linear Discriminant and Slow Feature Analysis take into account the variance of every single class of points. Indeed, FLD projects the data in a subspace trying to maximize the distance between points belonging to different clusters and minimizing the distance between points belonging to the same cluster, thus preserving or increasing the accuracy.
4.5 Generalized eigenvalue problems in machine learning
Here we review the connection between the so-called Generalized Eigenvalue Problem (GEP) and some models in machine learning. In classical literature, this is a well-known subject (Ghojogh, Karray, and Crowley 2019), (De Bie, Cristianini, and Rosipal 2005), (Borga, Landelius, and Knutsson 1997).
Definition 4.1 (Generalized Eigenvalue Problem) Generalized Eigenvalue Problem] Let \(A,B \in \mathbb{R}^{d \times d}\) be two SPD matrices. The GEP is defined as: \[AW = BW\Lambda\] The columns \(w_i \in \mathbb{R}^d\) of \(W\) and the values \(\lambda_i = \Lambda_{ii} \in \mathbb{R}\) of the diagonal matrix \(\Lambda\) are the so-called generalized eigenvector and eigenvalues.
The generalized eigenvalue problem is denoted by \((A,B)\) (note that the order in the pair matters). As is evident, the canonical eigenvalue problem is a special case of the GEP where \(B=I\). In this work, we will often consider the case when matrices \(A\) and \(B\) consist of expectation values from stochastic processes, that is, these are covariance matrices of some sort. Furthermore, while \(A\) can be symmetric semi-positive definite, we require \(B\) to be invertible, and thus symmetric positive definite. The GEP is related to the so-called : a variational extremum problem related to the ratio of two quadratic forms involving matrix \(A\) and \(B\):
\[\begin{equation} \rho(w):=\frac{w^TAw}{w^TBw} \tag{4.1} \end{equation}\]
There many different optimization problems that can be reduced to a GEP, which we report here for completeness (De Bie, Cristianini, and Rosipal 2005), (Ghojogh, Karray, and Crowley 2019). One can see that the norm of \(w\) does not change the value of the optimization problem. Therefore, we can impose an additional constraint on \(w\). In this way, we can reformulate the problem as a constrained optimization problem, without losing any solution. This constraint is \(w^TBw=1\). We will describe the relation between Equation (4.1) and Equation in definition 4.1. (To appear)
4.6 How to evaluate a classifier
Practitioners in quantum machine learning should not only build their skills in quantum algorithms, and having some basic notions of statistics and data science won’t hurt. In the following the see some ways to evaluate a classifier. What does it means in practice? Imagine you have a medical test that is able to tell if a patient is sick or not. You might want to consider the behavior of your classier with respect to the following parameters: the cost of identifying a sick patient as healthy is high, and the cost of identifying a healthy patient as sick. For example, if the patient is a zombie and it contaminates all the rest of the humanity you want to minimize the occurrences of the first case, while if the cure for “zombiness” is lethal for a human patient, you want to minimize the occurrences of the second case. With P and N we count the number of patients tested Positively or Negatively. This is formalized in the following definitions, which consists in statistics to be calculated on the test set of a data analysis.
TP True positives (statistical power) : are those labeled as sick that are actually sick.
FP False positives (type I error): are those labeled as sick but that actually are healthy
FN False negatives (type II error) : are those labeled as healthy but that are actually sick.
TN True negative: are those labeled as healthy that are healthy.
Given this simple intuition, we can take a binary classifier and imagine to do an experiment over a data set. Then we can measure:
True Positive Rate (TPR) = Recall = Sensitivity: is the ratio of correctly identified elements among all the elements identified as sick. It answer the question: “how are we good at detecting sick people?”. \[\frac{ TP }{ TP + FN} + \frac{TP }{P} \simeq P(test=1|sick=1)\] This is an estimator of the probability of a positive test given a sick individual.
True Negative Rate (TNR) = Specificity is a measure that tells you how many are labeled as healthy but that are actually sick. \[\frac{ TN }{ TN + FP} = p(test = 0 | sick =0)\] How many healthy patients will test negatively to the test? How are we good at avoiding false alarms?
False Positive Rate = Fallout \[FPR = \frac{ FP }{ FP + TN } = 1 - TNR\]
False Negative Rate = Miss Rate \[FNR = \frac{ FN }{ FN + TP } = 1 - TPR\]
Precision, Positive Predictive Value (PPV): \[\frac{ TP }{ TP + FP} \simeq p(sick=1 | positive=1)\] How many positive to the test are actually sick?
\(F_1\) score is a more compressed index of performance which is a possible measure of performance of a binary classifier. Is simply the harmonic mean of Precision and Sensitivity: \[F_1 = 2\frac{Precision \times Sensitivity }{Precision + Sensitivity }\]
Receiver Operating Characteristic (ROC) Evaluate the TRP and FPR at all the scores returned by a classifier by changing a parameter. It is a plot of the true positive rate against the false positive rate for the different possible value (cutpoints) of a test or experiment.
The confusion matrix generalize these 4 combination of (TP TN FP FN) to multiple classes: is a \(l \times l\) where at row \(i\) and column \(j\) you have the number of elements from the class\(i\) that have been classified as elements of class \(j\).
Other links: here