Here we cover the most important deep learning concepts to provide a practical understanding of each concept rather than its mathematical details.
Supervised and Unsupervised Learning
In general, a learning problem uses a set of n data samples to predict properties of unknown data. Usually data are organized in tables where rows (first axis) represent the samples (or instances) and colums represent attributes (or features), for Supervised Learning, another array of classes or target variables is provided.
We can separate learning problems in a few large categories:
In SUPERVISED LEARNING, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features.
- We have a CLASSIFICATION task when the target variable is nominal (discrete) - examples:
- predicting the species of iris given a set of measurements of its flower
- given a multicolor image of an object through a telescope, determine whether that object is a star, a quasar, or a galaxy.
- We have a REGRESSION task when the target variable is continuous - examples:
- given a set of attributes, determine the selling price of an house
In UNSUPERVISED LEARNING the data has no labels, and we are interested in finding similarities between the samples.
Unsupervised learning comprises tasks such as dimensionality reduction, clustering, and density estimation. Some unsupervised learning problems are:
CLUSTERING is the task that group similar items together - examples:
given observations of distant galaxies, determine which features are most important in distinguishing between them.
DENSITY ESTIMATION is a task were we want to find statistical values that describe the data
DIMENSIONALITY REDUCTION is for reduce the number of the features while keeping most of the information
UNSUPERVISED / SUPERVISED LEARNING in DL usally the two approach are combined, in fact the DL layers (Restricted Boltzmann Machines, Autoencoders, Convolutional Neural Networks) are used to learn the most significative features of the data. Those features are then used with standard ML regressors or classificators.
- We have a CLASSIFICATION task when the target variable is nominal (discrete) - examples:
Classification is the problem of identifying which of a set of categories a new observation belongs to, on the basis of a training set of data containing observations whose category membership is known. It is important to notice that in classification target variable (the category) is nominal (discrete). A subset of the classification algorithms is dedicated to outliers detection.
- Some classification examples:
- predicting the product characteristics or class from process measurements
- find groups of genes with related expression patterns
- Image classification: given a multispectral image of a crop, determine the vegetation and the state of health.
- Fault detection systems
When the target variable (the variable to be estimated) is continuous (i.e. can be represented by any rational number in a given range) we are dealing with a regression problem. Regression analysis is widely used for prediction and forecasting. Regression can be performed with a wide number of algorithms: the most simple are represented by linear models (general linear models) but most difficult problems require nonlinear approaches like those made available by the Neural Networks.
Particularly interesting for some engineering applications are the Locally Weighted Projection Regression (LWPR) algorithms: LWPR is a nonparametric learning algorithm that designed to achieve a nonlinear function approximation in high dimensional spaces with redundant and irrelevant input dimensions. What is important is that it uses locally linear models , spanned by a small number of univariate regressions in selected directions in input space. LWPR supports incremental training, adjust its weighting kernels based on local information only and has a computational complexity that is linear in the number of inputs.
Support Vector Machines (SVM)
SVMs can be used both for classification and regression of linear and nonlinear systems. A SVM separates the given examples (the data provided to train the system) by a clear gap that is as wide as possible. SVM are usually effective in high dimensional spaces and when the number of dimensions is greater (but not much greater) than the number of samples. Moreover, since SVMs use a subset of the training points in the decision function, they are usually also memory efficient.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (a cluster) are more similar to each other than to those in other groups. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms. The appropriate clustering algorithm and parameter settings depend on the individual data set and intended use of the results.
- Here’s a list of the most common clustering algorithms:
- K-Means: requires the number of clusters to be specified and makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes. In higher-dimensional spaces is advisable to run a dimensionality reduction algorithm such as PCA prior to k-means clustering.
- Affinity Propagation: creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples.
- Mean Shift: is made to discover blobs in a smooth density of samples. It’s a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region.
- DBSCAN: it views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex
t-SNE: is an advanced dimensionality reduction algorithm that works particularly well with high-dimensional data that lie on several different, but related, low-dimensional manifolds. For high-dimensional data that lies on a nonlinear manifold it is usually important to keep the low-dimensional representations of very similar data-points close together, this usually is not possible with a linear mapping. In facts, a linear method such as classical scaling (closely related to PCA) is not good at modeling curved manifolds and it focuses on preserving the distances between widely separated data-points rather than on preserving the distances between nearby data-points. t-SNE is capable of capturing much of the local structure of the high-dimensional data very well, while also revealing global structure such as the presence of clusters at several scales.
De-Noising Autoencoders: Auto-encoders learn an encoder function from input to representation and a decoder function back from representation to input space, such that the reconstruction (composition of encoder and decoder) is good for training examples. Regularized auto-encoders prevents the auto-encoder from simply learning the identity function achieving a generally lower reconstruction error. Basically the autoencoders “compress” high dimensional spaces in lower dimensional spaces. The “code” synthesized by the autoencoder can be used to explore similarity in the data or for applying clustering or regression algorithms in low-dimensional spaces.
The modern algorithms for Feature Selection can handle hundred or thousands of correlated, meaningless and noisy features. Nowadays this class of algorithms is extensively used in many research fields. As an example we can cite the gene selection from microarray data: here the features are gene expression coefficients corresponding to the abundance od mRNA in a sample for a number of patients. A typical classification task is to separate healthy patients from cancer patients based on their gene expression profile. Usually less than few hundreds patients are available but the number of measured variables for each patient ranges from 5000 to 50000.
Variable Ranking: This is one of the most simple class of methods. It can scale very well on large datasets but has limited detection capability because it works on individual variables independently of the context of others. This class of algorithms calculates a scoring value for each feature and a high score is indicative of a high valuable feature. Predictors are made by incorporating progressively more and more variables starting from those with the highest scores. This kind of algorithms detects only linear dependencies of individual variables and in case s where there is a large number of variables that separate the data perfectly, cannot distinguish between top ranking variables. Those characteristics will strongly limit the predictive power. Moreover, when those algorithms are used as filters to select the features to be used in a correlation predictor, they can discard useful information for two important reasons: first, redundant Features can work together to increase the predictive power, second: variables that are useless by themselves can be useful with others. This lead us to consider a more powerful class of algorithms: Variable Subset Selectors.
Variable Subset Selectors: These algorithms divide into Wrappers and Embedded Methods. Wrapper methodology consists in using the prediction performance of a given learning machine to access the relative usefulness of subsets of variables. Wrappers can be used to search the feature space when the number of features is not too large. When the number of features is too large to perform an exhaustive search, greedy search strategies (like forward selection or backward elimination) have demonstrated good performances. Embedded Methods are usually even more efficient than Wrappers since they integrate the black-box learning machine approach of the wrappers with the ability to automatically do the variable selection during training.
Machine Learning - Deep Learning
Neural Networks (NN)
Artificial Neural Networks (ANNs) or more simply Neural Networks (NN) are a family of models inspired by biological models like the central nervous systems of animals. Artificial neural networks are systems of interconnected \"neurons\" which exchange messages between each other. The most common topology has neurons are organized in layers: each layer receives signals from the previous layer and the first layer receives the input signals. Those signals are multiplied with weights, summed together and then passed though a nonlinear function (usually a sigmoid). In practice the Neural Network applies a transformation known as homeomorphism (that do not affect the topology) on the input data, in order to represent the data in a linearly-separable form. Neural networks can be used to classify or making regression on data.
A unit often refers to the activation function in a layer by which the inputs are transformed via a nonlinear activation function (for example by the logistic sigmoid function). Usually, a unit has several incoming connections and several outgoing connections. However, units can also be more complex, like long short-term memory (LSTM) units, which have multiple transformations and gating functions.
The term artificial neuron is an equivalent term to unit, but implies a close connection to neurobiology. Now the term neuron is discouraged and the more descriptive term unit should be used instead.
An activation function takes in weighted data (matrix multiplication between input data and weights) and outputs a non-linear transformation of the data. The difference between units and activation functions is that units can be more complex, that is, a unit can have multiple activation functions (for example LSTM units) or a slightly more complex structure (for example maxout units).
The most powerful networks are based on Nonlinear Activation Units. In contrast, the features of 1000 layers of pure linear transformations can be reproduced by a single layer (because a chain of matrix multiplication can always be represented by a single matrix multiplication). This is why non-linear activation functions are so important in deep learning.
A layer is the highest-level building block in Neural Networks and consequently in Deep Learning. A layer is a container that usually receives weighted input, transforms it with a set of mostly non-linear functions and then passes these values as output to the next layer. A layer is usually uniform, that is it only contains one type of activation function, pooling, convolution etc. The first and last layers in a network are called input and output layers, respectively, and all layers in between are called hidden layers.
In machine learning we train a general model on data, and use the trained model to make predictions on new data. The process of training a model is a learning process where the model is exposed to new, unfamiliar data step by step.
It may take many iterations to train a model with good predictive performance. This iterative predict-and-adjust process continues until the predictions of the model no longer improve.
Feature engineering is the process of extracting useful patterns from data that will make it easier for Machine Learning models to learn a specific task. For example, if you model a heating system, you can multiply the fuel quantity by its specific energy: the model will understand better the problem if you provide directly the energy input instead of the fuel flow.
Feature engineering is the most important skill when you want to achieve good results for most predictions tasks. However, it is difficult because requires domain knowledge. The difficulty of feature engineering is the main reason to seek algorithms that automatically engineer features.
Feature learning algorithms find the common patterns that are important to disentangle the input data.
Feature learning can be thought of as Feature Engineering done automatically by the algorithms.
Deep Learning Networks are exceptionally good at finding good features in data to the next layer to form a hierarchy of nonlinear features that grow in complexity. The final layer(s) use all these generated features for classification or regression.
In hierarchical Feature Learning, we extract multiple layers of non-linear features. We are interested in stacking such very deep hierarchies of non-linear features because we cannot learn complex features from a few layers.
You can imagine the Neural Network learning process to be like a student studying for an exam.
A shallow network is like a student that learn the lessons by hart without understanding it: if you add more neurons you increase the student’s memory but you do not increase its ability to understand the problems: this student will be exceptional in replicating examples that he has already seen but wouldn’t generalize the concept he learned.
If you make the network deeper instead, you will increase its ability to understand complex concepts and to generalize it to quite different problems.
While hierarchical feature learning was used before the field deep learning existed, these architectures suffered from major problems such as the vanishing gradient problem where the gradients became too small to provide a learning signal for very deep layers, thus making these architectures perform poorly when compared to shallow learning algorithms.
The term deep learning originated from new methods and strategies designed to generate these deep hierarchies of non-linear features by overcoming the problems with vanishing gradients.
Deep learning is not associated just with learning deep non-linear hierarchical features, but also with learning to detect very long non-linear time dependencies in sequential data: the Long Short-Term memory (LSTM) recurrent neural networks allow the network to pick up on activity hundreds of time-steps in the past to make accurate predictions. While LSTM networks have been mostly ignored in the past 10 years, their usage has grown rapidly since 2013 and together with convolutional nets they form one of two major success stories of deep learning.
Random Forests (RF) is an ensemble method based on a modified version of Bagging (Bootstrap aggregating) invented by Leo Breiman in 2001. The rationale of this algorithm is that complex models have high variance and averaging many of them contributes to reducing it. In principle a complex model is able to perfectly learn the true function, but this is also its drawback. In practice it adapts to much to the subset of the data that is seeing, thus several complex model that are trained with different subset of the data display high variability in the resulting model. This type of problem can be addressed by averaging many complex models as RF does: it builds a large ensemble of trees and then averages their prediction to obtain a final model. The idea of Random Forests is to reduce the correlation of each tree (and thus decrease variance) by randomly selecting a subset of input variables for each split in the tree. This procedure builds trees that are less correlated with each other and thus averaging them improve performance. Random Forests are a popular method because they work surprisingly well with hyperparameters’ default values.
- Recent blog posts about Random Forests:
- Ensemble Methods: Random Forests
Gradient Boosted Trees
Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a special case of Gradient Boosting where the weak learners are regression trees. The method was invented by Jerome H. Friedman in 1999. It is a form of boosting that learns a function sequentially, each member of the ensemble learns on the error of its predecessor. The final model is the aggregation of all individual predictions
- Recent blog posts about Gradient Boosted Trees:
- Ensemble Methods: Gradient Boosted Trees
Recurrent Neural Networks (RNN)
While Standard (Feed-Forward) Neural Networks accept a fixed-sized vector as input and produce a fixed-sized vector as output with a fixed amount of computational steps (e.g. the number of layers in the model), the recurrent neural networks operate over sequences of vectors. RNNs combine the input vector with their state vector with a learned function to produce a new state vector. While RNN are used nowadays mainly for text and speech understanding, we use this kind of algorithms in our Virtual Sensors to process streaming data. In fact, RNNs can be trained for sequence generation by processing real data sequences one step at a time and predicting what comes next. Assuming the predictions are probabilistic, novel sequences can be generated from a trained network by iteratively sampling from the network’s output distribution, then feeding in the sample as input at the next step.
Long-Short Term Memory
In theory RNN can use their feedback connections and internal state to store any representations of input events, streamed over time, in form of activations. In practice this is not possible especially when the data streams are sampled at an high frequency but the information is spread over time (over hundreds of thousands samples). The Long-Short Term Memory algorithms (LSTM in short) are exactly designed to overcome these problems: they can learn to bridge time intervals in excess of 1000 samples even in case of noisy and incompressible input sequences.
ARX - ARMAX - NARMA - NARMAX - NLARX
These models represent one of the most important evolution in the field of system identification, in particular in the field of black-box system identification. These algorithms are stochastic forecasting algorithms and can be divided in linear models and nonlinear models. Linear models are the ARMA (AutoRegressive Moving Average) and the ARMAX (Autoregressive Moving Average with Exogeneous inputs), while the non-linear counterparts are the NARMA (NonLinear AutoRegressive Moving Average) and NARMAX (NonLinear AutoRegressive Moving Average with Exogeneous inputs). NLARX is a special name used for commercial reasons that basically is equivalent to NARMAX. Since these particular algorithms are so important in system identification we developed a special proprietary version of the training software that improves of two order of magnitude the performances of the existing commercial implementation and allows strong parallelization on heterogeneous nodes.
ARIMA - SARIMA
Auto Regressive Integrated Moving Average (ARIMA) and Seasonal Auto Regressive Integrated Moving Average (SARIMA) are extensions of the ARMA class in order to include more realistic dynamics, in particular, respectively, non stationarity in mean and seasonal behaviours. These kind of models are applicable just to a very limited set of problems and in recent years their performances have been outdated by new implementations like LSTM.