In the 2018 Global Data Management Benchmark Report (Experian, 2018), Experian reported that 95% of C-level executives believe that data is an integral part of forming their business strategy.
While serving as a persistent backdrop for all analytic activities in organizations, the associated tools, technologies, and processes surrounding the care and feeding of data are often overshadowed by the sexier aspects of its use. There is a burgeoning interest in real time processing, IoT, and data as a service (DaaS) and all of these are reliant on modernizing our analytic efforts through advancements in automated and semi-automated processing. Enter machine learning.
Data management and its constituent parts, data integration, data quality, data governance, and master data management, are necessary but insufficient to extract full value from the lifeblood of the modern enterprise. The point is this, just as I wrote in 2015 (Nelson, 2015), precious little attention is given to how good, clean, usable data gets to us, just that it does.
With the recent attention on machine learning as well as deep learning and artificial intelligence, I wanted to highlight some potential opportunities for applying the techniques in machine learning towards solving some of the painful processes that we still encounter in data management.
In that same 2018 Experian benchmark report, “89% percent of C-level executives agree that inaccurate data is undermining their ability to provide an excellent customer experience. Furthermore, another 84% of C-level executives agree that the increasing volumes of data make it difficult to meet their regulatory obligations.
“With data volumes continuing to outpace our ability to manage it in traditional ways, new methods must be adopted to improve how we collect, ingest, prepare, transform, persist and experience data. “(?” not sure about whether to do the quotes here... It may be a style thing for you" - Gregory S. Nelson, Author, The Analytics Lifecycle Toolkit
Data continues to grow at unprecedented rates. Intel suggests that the average car will generate 4000 GB of data per hour of driving. Modern multi-player games generate over 50 billion rows of data per day. Nick Ismail, a reporter for Information Age, recently suggested “Now that those tools are available, the pendulum will swing back to the demand side of the equation and force businesses to pay more attention to collection, management and storage of that increasingly valuable data.” (Ismail, 2017)
Organizations that utilize machine learning for the automated and semi-automated processing of data will set the standard, and those that fail to adopt strategies to keep up with the analytics appetite will lose in this modern-day data arms-race.
While modern data management has been around since the early 1960’s through the late 1980’s. It wasn’t until the late 1980’s that we began to modernize approaches to truly managing data versus the previous approach of “sucking its exhaust.” Bill Inmon and Ralph Kimball had competing approaches to how one should organize data for advantage in reporting and on-line analytical processing. In his 1996 book (Kimball, 1996), Ralph Kimball highlighted 34 critical subsystems that form the architecture for every ETL system. This book and the subsequent work formed the basis for how I think about modern data management activities. While discussions of IoT and sensor data were not yet part of the daily vernacular in the mid-1990’s, these subsystems are ever-present in the way that we ingest, manage, and exploit data today.
In the figure below (from Nelson, 2018), I highlight major components of the data pipeline and its relation to the analytics lifecycle. This illustrates the activities we find in the data pipeline that are undertaken every day as part of the effort to feed the analytics beast. That is, getting good, clean, quality, reliable data to those that can turn raw product (data) into value.
Figure 1: The data pipeline supports the analytics lifecycle (Source: ©The Analytics Lifecycle Toolkit, Wiley 2018)
The components of this pipeline are part of the data value chain. At a high level, the data value chain includes the following processes:
1. Data Engineering
o Collection or acquisition of data (e.g., sensors, web crawling, Internet of Things)
o Motion management (data in motion, real time / event stream processing, landing
2. Data Preparation
o Data organization and storage (databases, storage engines, file systems, models and
o Data processing (data warehousing, data integration, image processing, natural
3. Data Use
o Learning from data (machine learning, data mining, natural language understanding)
o Making predictions and decisions (e.g., information retrieval, intelligent systems,
In the remainder of this 3 part article, I will highlight some potential use cases for machine learning (as well as other techniques) to aid in the processing of data. First, however, it is important to understand what we mean by machine learning and some of the problems that we can solve with these approaches.