How To

Introduction to Machine Learning Models

Discover what it takes to set up machine learning models, learn about the types of models, and meet ten most popular algorithms.

Over the last 100 years alone, artificial intelligence has achieved what was once believed to be science fiction: cars that drive themselves, machine learning models that diagnose heart disease better than doctors can, and predictive customer analytics that lead to companies knowing their customers better than their parents do.

This machine learning revolution was sparked by a simple question: can a computer learn without explicitly being told *how*?

By joining statistical knowledge with the computer’s ability to shift through huge amounts of data faster than any human could, the field of artificial intelligence created machine learning models. These models could take in raw data, recognize an underlying governing pattern, and apply what they’d learned to novel situations. In other words, computers could learn by themselves to uncover the hidden truths within data.

In this article, we take a peek into the mechanisms of machine learning models.

Deploy machine learning models by connecting your favorite notebook with Keboola Connection.

A machine learning model is a mathematical representation of the patterns hidden in data. When the machine learning model is trained (or built or fit) to the training data, it discovers some governing structure within it. That governing structure is formalized into rules, which can be applied to new situations for predictions. So, if we train a model on some training data and then apply that model to new data, the model would be able to infer some relationship within it.

Take, for example, a weather machine learning model that has been trained to recognize the imminence of rain whenever the barometer falls under a certain threshold. This same model can also predict rain whenever this threshold in air pressure is crossed on a different barometer.

So, how does one create a machine learning model?

Whether you’re building a smart computer system capable of recognizing objects in a real-time camera feed, or are trying to predict whether the stock market will go up, the machine learning model process always follows the same steps:

**Get input data**. This is usually the most time-consuming step when building a machine learning model. The input data needs to be collected, cleaned, and transformed in the appropriate form for the algorithm(s) you are going to use.**Split data into training and test data sets**. Once the data set is ready for you to build a machine learning model, it is split into two: training data and test data. The model is*built upon*training data and*tested on*test data (data points that it has never seen before). This confirms that whatever it has learned on the training data generalizes well to novel situations. Sometimes, the data set is split into three parts, with another part of the data being used for ‘validation’ or hyperparameter tuning.**Fit algorithm to training data**. In this step, we ‘train’ the model. One or more algorithms of your choice are fitted to the training data, so that the model learns the specific pattern within it. In mathematical terms, this would mean that an algorithm optimizes the mapping between input and output data to minimize a cost function, then records those settings so it can apply them to novel situations. In a less technical language, this means that an algorithm identifies features that can help it to achieve its task. For example, we could say something like “I came up with a rule for distinguishing a table: if it has a flat surface and four legs sticking out from it, it is most likely a table.”**Evaluate model**. The model is evaluated for accuracy (or sensitivity, specificity or another metric of success) against the test data set. A good model will not overfit, aka, be unable to generalize from training data to test data. Different aspects of the model are changed (e.g. using a different algorithm altogether, tweaking the hyperparameters of the algorithm, additional feature engineering, etc.) until the evaluation of the model is satisfactory.**Use the model in real life**. Once the model has been successfully evaluated, it can be shipped to production. Now your machine learning model can drive cars, label objects in videos, or trigger a warning if it suspects that a radiological image is displaying cancerous cells.

We can broadly categorize machine learning models into three types based on the learning directives that we give to the model when training it:

**1. Supervised learning**. In supervised learning, we train machine learning models by giving them a set of inputs (training data) and expected outputs or labels. The model is tasked with discovering the patterns in the training data, which can be used to map inputs to outputs. For example, if we give a supervised learning model inputs about the area of an apartment and the general geographical location, the model can predict the selling price of that apartment. Supervised learning models can be broken down into two subcategories:

*Regression models*. Regression models output continuous numeric values, e.g. the likelihood of a customer churning (95%) or the best price for a new item based on price elasticity ($45).

*Classification models*. Classification models output categorical variables, such as classes and labels. For example, a model might tell us if a customer belongs to the ‘outdoor’ shopping group or if they’re more of a ‘tech aficionado’, which can be used for upselling and product suggestions.

**2. Unsupervised learning**. Unlike supervised learning, unsupervised learning models aren’t trained with any outputs or labels. The model’s goal is to find the underlying structure within the data without any guidance. These techniques are mostly used in exploratory data analysis and data mining, where the goal is to discover new knowledge about underlying data rather than improve and predict existing knowledge. The models are still statistical ones tasked with pattern recognition, but the pattern is not known in advance. As such, unsupervised learning has been used in anomaly detection (e.g. identifying bank frauds), clustering (e.g. figuring out how many customer personas are in a customer base), dimensionality reduction (taking complex data and distilling it to fewer dimensions to keep the information within the original data, but making it simpler to work with - this is often used in network or social media analyses to cut down the noise), and any other branch of data science where knowledge discovery is guiding principle.

**3. Reinforcement learning**. Similarly to supervised learning, reinforcement learning also trains models by mapping input data to outputs. However, unlike supervised learning, the directive is not to discover patterns in data and learn them. Rather, reinforcement learning models act as *agents* which need to perform actions. At any stage, the model can take multiple actions or decisions and it gets rewarded or punished according to its chosen path. Though this might seem counterintuitive, reinforcement learning is the best technique for teaching self-driving cars. It was also employed when successfully teaching computers to beat humans at games, such as chess.

This classification is rough and incomplete. ML models can also be categorized into ensemble models (combining multiple different models to work towards a common goal) and feature learning models (instead of predicting the outcome, the model learns better representations (or features) for input data). The fruitful advancements in neural networks also beg the question of whether deep learning should be an independent category.

We’ll now take a look at the 10 most popular machine learning algorithms, from the salt and pepper (linear and logistic regression) to the state-of-the-art neural networks. These popular algorithms are widely used to solve complex tasks with machine learning:

**Linear regression**. Linear regression is one of the best known supervised machine learning algorithms. It is used to predict a numeric value based on a set of inputs, e.g. the value of stock prices at the end of the day (output) based on the opening prices on the stock market (inputs). Linear regression tries to achieve a straight line of best fit (hence the name ‘linear’) between the input training variables and output training variables. The end model is a linear function, which describes the model as y = m*x + b. Linear regression can be performed in many different ways (simple linear regression or Ordinary Least Squares (OLS), multiple linear regressions, linear regression with L1 (Lasso) or L2 (Ridge) regularization…).**Logistic regression**. Logistic regression is similar to the linear one, but instead of being used on regression tasks, it is mostly used for classification. Logistic regression fits a sigmoid curve to the training inputs to predict the*likelihood*of a data point belonging to a class. Because it predicts likelihood, and not the class directly, it is called a regression. Let us look at a practical example: if a customer has a 1% likelihood of being a ‘fraudster’, the logistic regression will classify them as the more likely ‘honest Joe’.**Support vector machine (SVM)**. SVMs are extremely powerful supervised statistical modeling algorithms used for both regression and classification. In classification tasks, SVMs find a hyperplane (imagine a plane in a new dimension) between different classes of data, which is later used with novel examples to classify them. If a new data point falls on the side where ‘Class A’ was during training, SVMs classify the novel point as belonging to the same class.**Principal component analysis (PCA)**. Principal component analysis is an unsupervised algorithm used for feature extraction. PCA reduces all of the data you have into principal components (aka, vectors representing the original data). Its use case is most often dimensionality reduction, but also visualization and predictive modeling.**Decision trees**. Decision trees are supervised algorithms designed to find a path to the target variable via a set of decisions from the input variables. For example: “Is the color of the fruit’s skin blue?” > not an “apple”. They can either be continuous (regression trees) or categorical (classification trees). Accordingly, the algorithm is technically called CART (Classification And Regression Tree). Decision trees are often favored as first-step algorithms when approaching new problems, as they are similar to decision charts and therefore easy to interpret. However, they can easily overfit the data and do not generalize well to novel situations and datasets.**Random forests**. This is an ensemble method that combines multiple decision trees into a final decision. Because individual decision trees tend to overfit, random forests mitigate the individual biases by combining the outputs of multiple decision trees (regression or classification) and weighting them for the final verdict.**k-nearest neighbors (k-NN)**. k-NN is a common supervised classification algorithm. It stores the information about each training data point, as well as the class that the data point belongs to. Then, when new data points are added (for example, previously unseen data), k-NN uses the distance from existing points to determine which neighboring data point (and respective class) the new data point is most similar to. There are multiple measures of similarity which k-NN can use, from Euclidean distance (used for continuous variables) to Hamming distance (used for categorical variables).**k-means**. k-means is similar to k-NN because it looks at distance to predict class membership. However, unlike k-NN, k-means is an unsupervised learning algorithm. Its goal is to discover how different points cluster together. The intuition behind this mathematical model is that similar data points will be closer together. k-means then tries to determine different k-points called*centroids*, which are at the center (least cumulative distance) from other points of the same class, but further away from points of another class. This algorithm is intuitive but computationally taxing, so it’s mostly used for exploratory analysis on smaller data sets.**Naive Bayes**. Naive Bayes classifier is a probabilistic classifier, with its mathematical model resting upon the Bayesian conditional probability. It is often the first algorithm to be used in text classification (e.g. spam vs. legitimate email, sport vs. political news, etc.) and it is highly favorable because of its ability to scale resources linearly when adding new examples. Naive Bayes classifier first determines the probability of an example belonging to a certain class, then implements a decision rule to assign that example to the respective class.******Neural networks**. Neural networks deserve a special mention on this list, even though there’s no such thing as a ‘neural network’ algorithm. Neural networks are a family of algorithms, which cover classification tasks, regression tasks, ensemble tasks, and feature discovery. Aptly, each neural network architecture has its own name, such as perceptron, autoencoders, Liquid State Machines, etc. The main architecture is similar across the different implementations: the algorithm is divided into multiple layers, from the input layer (where input examples are represented) to the output layer (where the resulting regression/classification/ensemble is represented), with optional (hidden) layers in between. The specific architecture of the in-between layers mostly determines how a neural network algorithm works, alongside its activation function and other characteristics. From this hidden layer comes the name of ‘deep learning’, because knowledge is represented as the weights on the connections between the input and output layers (so knowledge is stored deep in the neural network).

If you can’t wait to try some of these algorithms for yourself and build some machine learning models, check out Keboola’s feature-rich offering for data science.

Keboola is a platform for data scientists who are looking to build their own machine learning models. It comes with a one-click deployment of Jupyter Notebooks, where all of the modelings can be done using Julia, R, or Python.

But Keboola takes it a step further.

Being a data-centric platform, you can also build your ETL pipelines and orchestrate tasks to get your data ready for machine learning algorithms. You can deploy multiple models with different algorithms to version your work and compare them to see which one is the best performer. Check out the entire offering and test it for free.

Free download just few clicks ahead

Oops! Something went wrong while submitting the form.