What is Big Data? - Part 2
Big data is messy because it is formulated from the real world, and reflects its messiness. However, data can be generated in many forms, some of it in a more structured way, than the others. Data is structured when it “has a predictable and regularly occurring format of data” (Inmon & Linstedt, 2015). It follows a particular schema, and it is typically managed by a database management system (DBMS). Structured data is normally built on a deliberate structure; it is well- defined and predictable. It is stored in relational databases in records, attributes, and indexes. Therefore, it is easier to analyze, run different queries on them, even if they contain millions of individual items (Inmon & Linstedt, 2015). Some examples are customer data (with fields with defined length and type for zip codes, country codes, email addresses, etc.), website visiting statistics and most of the systems that we use today that follow the method of storing individual elements in individual records in a relational database. Structured data is not necessary generated by people, in many cases, it is a result of machine logs, sensor data (which will be described in the following pages).
On the other hand, unstructured data is more natural, typically text-heavy data. It is “unpredictable and has no structure that is recognizable to a computer” (Inmon & Linstedt, 2015). It does not follow a pre-defined data structure or hierarchy that results in slower querying and data analysis. Examples of unstructured data are books, electronic documents, recently social data, or the internet itself with its millions of web pages. However, pictures, audio, and video files represent data that is hard to process and organize, and they also belong to this category. Specialized tools are required to analyze these types of sources, such as natural language processing with artificial intelligence, or special databases, like NoSQL.
Semi-structured data is data that is not organized into structured databases in itself, but it contains some related information that can be processed by computers. These can be the metadata of pictures and documents, or using tags to categorize them (Inmon & Linstedt, 2015).
While big data is usually identified as processing massive unstructured data sources, such as social networks, websites, documents, it does not have to be restricted to human generated information. A significant amount of data is generated by systems and machines. They create structured records, usually in the form of logs or measurements. If the data is collected, it can grow to enormous sizes, and similar techniques can be applied to it as on other big data sets. There is no exact definition for machine data (or machine-generated data), but most authors refer to it simply as data generated by machines during their operations. This kind of data is much less messy, as it does not follow the real world’s disorder, but a pre-defined structure. Machine data is generated in every industry from healthcare equipment through handheld devices to industrial machines, and they can be used to find patterns and clusters or to predict trends and unveil previously hidden connections. It has several business use cases, such as debugging, performance analysis, root-cause analysis, predictions, fraud detection, etc. (Surange & Bansal, 2013).