Categories

**What is Big Data? - Part 4**

**There are different techniques to work with big data**. Analysts can run queries on data sets and get exact answers for specific questions, or they can use statistical methods to get the required information. **Machine learning is a subfield of computer science, closely related to computational statistics; it applies algorithms on masses of data to learn from them and make decisions and predictions.** One of its most acknowledged features is the ability to improve the outcomes with an increase of inputs: instead of following rigid mathematical functions, it builds models, search for similarities, trends. As it was mentioned with big data, machine learning techniques do not unveil the reasons for an event or happening but with the massive amount of data they can build models on correlations that can be used to calculate the results (Alpaydın, 2010).

**Data mining is a component of machine learning.** It is defined as **“the analysis of (often large) data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner”** (Hand, Mannila, & Smyth, 2001). These data sets can come from different sources; they can be structured or unstructured. To apply data mining techniques on them, they first have to be transformed to a logical, organized format before they could be used as input for the analyses (Reffat, Gero, & Peng, 2004).

There are several analysis techniques in data mining, based on the purpose of the research. They can be classified into two broad categories: **supervised learning or unsupervised learning.** The first one assumes labeled data, which means that each instance is labeled by a response variable, while in unsupervised learning, variables are not divided into response and predictor variables (van der Aalst, 2011). Some authors mention semi-supervised learning as well, but I only focus on the two original groups in the research.

**Classification techniques are used to categorize instances based on a defined predictor variable.** The resulting groups can take on discrete values. The simplest examples are binary values (tests that can have one of two possible values, for example, yes or no). However, multiclass targets can be defined as well (e.g., low, normal, high or unknown viscosity of engine oil). Classifications do not imply the order of the instances and they use algorithms to find relationships between the predictor and target values (van der Aalst, 2011).

**Regression techniques, on the other hand, expect numerical response variables.** The aim of these techniques is to fit the data somewhere on a range. For example, a function could suggest that the wearing of ball studs (in percentage) is calculated by the days since their installment divided by 800 (e.g. a 600 days old ball stud is assumed to be 75% worn out). Regression tries to find the function that can predict the response variables with the highest precision based on the predictor values (van der Aalst, 2011).

**Decision trees are used to make models to predict specific values based on different input variables. The model goes through numerous attribute value tests with two or more possible outcomes. **

The root node arrives to the first test and forwards the inspected item to one of the branches. Values must belong to either one of them unambiguously. After the test, the new node can have further tests until the recursive process reaches an end point, where the model can provide reasonable estimates of the examined target value. The nodes that do not have further tests are called leaves, and the probability can be calculated for them (in this case the chances of oil leaking in the engine, in percentage) (van der Aalst, 2011).

**Clustering is the unsupervised technique of grouping objects together that belong to the same class (cluster)**. It uses unlabeled data and a function to determine the classes and the instances belonging to them. They can be analyzed by two or more attributes (dimensions). The most common algorithm is k-Means clustering, which distributes the instances into a pre-defined number of groups based on their Euclidean distances). It is the base of many other methods, such as pattern recognition, text mining, various analyses, face recognition, diagnostics. It is a scalable technique, and with the increase of the examined data set, the clusters become more reliable (van der Aalst, 2011).

Human beings can easily recognize things or objects based on past learning experiences. **Pattern recognition aims to achieve similar results by computation method. It uses different techniques and algorithms to find the most likely matching of the inputs, considering their statistical variation.** Pattern recognition methods can be applied on supervised and unsupervised data sets. In supervised learning, the model is built on a set of training data with appropriately labeled instances and the resulted output. A learning procedure then generates a set of rules that can be generalized and applied to new data sets. The process is efficient if the outcome of the new data is correctly determined. In unsupervised learning, training data is not labeled, so different techniques can be used to identify patterns in the data that can predict the correct output value for new data instances. Pattern recognition is used in multiple areas, of which the most relevant ones are speech recognition, optical character recognition, face recognition, landscape analysis in geology and other monitoring functions (Bishop, 2006).

In data mining, **anomaly is a pattern that does not conform to an expected behavior, and anomaly detection is a method that aims to identify these peculiarities.** There are many use cases where its techniques are used, such as bank fraud detection, cyber intrusions or predictive maintenance. Most algorithms start by defining the normal values, but the boundary between normal and deviant behavior is usually not exact. Anomaly detection uses different functions to examine which instances represent extreme values and fall out of the boundaries, but there are some challenges: data might contain noise, and it might evolve with time resulting different normal behavior groups. Supervised and unsupervised techniques can both be used, based on the fact that if the data set is labeled or not (Chandola, Banerjee, & Kumar, 2009).

All the previous techniques and methods are based on the concept that larger data sets can result in better generic models and reduce uncertainty in predicting the outcome. Moreover, data sets are constantly evolving, and data mining analytics has to always reflect the current inputs. However, the high number of instances also require tremendous computing capacity that only computers can provide.