Nobody can tell exactly how much data is generated each day, but there are good assumptions: in 2012, 2.5 billion gigabytes of data were generated every single day according to IBM (IBM, 2012), and they predict that this number will increase to 2.3 trillion gigabytes per day by 2020 (IBM, 2013). We live in a world where information overload – a term that originates from Bertram Gross, social scientist – is no longer a futuristic fantasy, but part of our everyday lives. The available data and information are impossible for one person to collect, verify and understand (Hunt, 2014). As a result of the boost of available data, new and original methods have developed to extract useful information from it. In each organization within each industry, a vast amount of data is generated that contains various information from internal and external sources such as data transactions, corporate documents, social media, sensors and other devices. Companies can take advantage of analyzing their data to satisfy customer needs, optimize their operations or obtain new sources of revenue (IBM, 2013).
Big data has many definitions of which the most extensive one was written by Frank J. Ohlhorst: “Big Data defines a situation in which data sets have grown to such enormous sizes that conventional information technologies can no longer effectively handle either the size of the data set or the scale and growth of the data set. In other words, the data set has grown so large that it is difficult to manage and even harder to garner value out of it” (Ohlhorst, 2012). A survey of 154 C-suite global executives showed that the subjects had different ideas about what big data is. Some followed a business approach and looked at it as a new source of opportunities, some defined it from a technological perspective, others looked at legal requirements of storing data or the boom of social data, but they all agreed in one thing: the amount of data to be dealt with is huge (Gandomi & Haider, 2015), and as Ohlhorst defined, too big to be effectively handled. In a world with 4 166 667 Facebook likes or 300 hours of YouTube-video upload per minute (DOMO, 2015), we can agree that this amount of data is impossible to be manually analyzed.
However, there are scholars who do not recognize big data as a novelty. Gupta (2015) argues that enterprises used large data sets already before the Google Flu Trends article has been published in 2008 and started an avalanche of big data related researches and analyzes (Gupta, 2015). Mark Barrencechea states that big data’s potential lies in unstructured data, such as conversations, social data, documents, and the utilization of these types of information is the innovation, but he also argues that with time it will be business as usual, just as structured data became the base of the everyday processes (Barrenechrea, 2013).
Mayer-Schönberger and Cukier mentioned other characteristics of big data that differentiates it from the previously available approaches. According to their research, the vast amount of data enables extensive analysis of the different features of an examined subject. As different datasets become available, combining them might result in a comprehensive perspective on the same phenomenon. Previously researchers were enforced to use samples for their investigations, with statistical sampling methods to define the individuals that were examined. But now they can analyze the whole population, and they can even access data that they would not have measured in the traditional way (Mayer-Schönberger & Cukier, 2013).
The second attribute is the messiness of the data. The authors state that Big Data encourages the analysts to take the complexity and diversity of the world into consideration when they examine the data sets instead of endeavoring to reach punctual and perfectly accurate results from analyses made in an artificial and controlled environment.
Furthermore, they explain that it is impossible to create a big data setup that is punctual and universal, but this is not its intention. The analysts should accept that big data is messy with varied qualities and complex distribution. However, this imperfection is the property that reflects the real world the most, so it has to be incorporated in the analyses: researchers should not aim for punctual, detailed and exact results, but they have to give generic directions (Mayer- Schönberger & Cukier, 2013). Machine-generated data – as I will describe later – also reflects this messiness, but it is a bit more structured than social data, for example.
The theory of messiness is closely related to the third attribute of big data, namely correlation - the statistical connection between two values. If one of the values change, the chances are high that the other one will change too. Formerly researchers had reservations about analyses built on correlation, and they did not consider them entirely reliable, as they could have inaccurate conclusions: they were either the results of serendipity or external factors have affected them. However, big data can dismiss these worries because implications can be produced from enormous data sets. The authors explain that correlation cannot always unveil the precise causes of a phenomenon, but it can indicate the effects of a happening, which can already be satisfactory (Mayer-Schönberger & Cukier, 2013).
Ohlhorst introduced the concept of 4V’s, four dimensions that need to be considered to get value out of big data: volume, velocity, variety and veracity (Ohlhorst, 2012). His model is built on an article written by Gartner analyst, Doug Laney, that forecasted the explosion of data in 2001. Laney mentioned 3 of the 4 V’s, data volume, data velocity and data variety (Laney, 2001), that can be seen as the first description of big data.
Volume is the amount of data. It is not a few gigabytes anymore; there are numerous cases where the available data takes up terabytes or even larger scales. To gain insight from this mass of data, special tools are required (Ohlhorst, 2012).
Velocity means that data does not have a stable state, but it is always changing, and new data is generated and transferred in a matter of milliseconds. 6 000 tweets are sent in average in a second (Twitter, Inc., 2015), and around 10 million trades each day on NASDAQ (NASDAQ, 2015). With such fast-paced conditions, real-time information can be retrieved from data, which always reflects the actual trends. However, Ohlhorst suggests that data should be stored, and it should be available from archival sources as well (Ohlhorst, 2012).
Variety means that data comes in many forms. Users can post text content, pictures, videos, share links. But even in more controlled environments, such as a factory, data can be generated by temperature sensors, productivity logs and reports made by technicians. Big data is often unstructured, and it encourages analysts to consider all data sources and types and find correspondence between their values (Ohlhorst, 2012).
Veracity refers to abnormalities, noise and statistical errors in the data. When there are millions of records, it is unavoidable that some of them are not relevant or not correct and would alter the big picture, which could lead to false conclusions. Some examples are inaccurate sensor measurements or the lack of credentials in social media. One of the greatest challenges of big data analysis is to clean the data, remove uncertainty about its veracity (Ohlhorst, 2012).
There are other V’s that has been mentioned lately together with big data: validity, volatility, viscosity, virality and value. The first one is an extension of veracity, and it means that the data is correct and accurate for the intended use. Volatility refers to questions like how long should data be stored and how long can it be used in analyses. Viscosity measures the resistance in data flows, that can be caused by friction that occurs when integrating different data sources; virality refers to the speed of information is spread and shared to each unique node (Wang, 2012). Value refers to the outcome, the value that can be extracted from the data sources with big data analysis.
However, these many V’s do not help to define big data, rather overcomplicate it. Seth Grimes argues that these “wanna-V backers and the contrarians mistake interpretive, derived qualities for essential attributes” (Grimes, 2013). He describes why the original 3 V’s (by Doug Laney, 2001) are sufficient to define big data, and that the other additional properties are “analytics-derived qualities that relate more to data uses than to the data itself” (Grimes, 2013).
In this thesis I use the four dimensions of Ohlhorst, it seems to be the most widely used model defining big data, and it includes veracity in addition to Laney’s dimensions, which is necessary to consider when dealing with noisy and highly unstructured data.