From self-driving cars to natural language understanding (think Siri and Alexa) and natural language generation (automated performance reviews, baseball game summaries), artificial intelligence (AI) has come full circle in recent years due, in large part, to the data and computing power needed to train and process information. Machine learning is a part of AI that relates to the notion that computers can learn from data. The nuance of machine learning is that, unlike traditional computer programs, the computer must be able to learn patterns that it's not explicitly programmed to identify.
Examples of machine learning include:
• Pattern recognition: (event detection) identify the type of event being depicted in an image (e.g., a child with a basket and brightly colored eggs is determined to be a picture of an Easter Egg Hunt) or determine whether an image is a known person (facial recognition)
• Prediction: predict the risk of hospital readmission based on electronic health record data or the transaction price for real estate
• Classification: identify the language and/or meaning of a given text (language identification); a binary outcome (fraud or no fraud); or authorship of a given text
• Recommender systems: identify similar products (i.e., Amazon) or movies (Netflix) based on past behaviors
• Sentiment analysis: determine whether a given text expresses a positive or negative sentiment towards some person or thing
• Anomaly detection: identifying values that are out of range or anomalies in the data.
In the language of machine learning, we often distinguish the approaches based on the task, or specific objective, that we intend to achieve with our machine learning algorithm. The two most common categories of tasks are supervised learning and unsupervised learning. Essentially this refers to how we “teach” the machine. In supervised learning, we teach the “machine” by giving it examples or what is referred to as labeled data. The second major type is called unsupervised learning and it is often used either as a form of automated data analysis or automated signal extraction. In the case of unsupervised machine learning, we aren’t explicitly training the machine with known good examples, but rather let the algorithm find the interesting nuggets.
At this point, you might be asking yourself, “what does this have to do with data management? Isn’t machine learning used to develop things like predictive models? If we boil machine learning down to its essential goals, they include:
• Predict an outcome
• Categorize similar things
• Identify patterns and relatedness among entities
• Detect anomalies
Given this, there are ample opportunities to predict, categorize, identify and detect in the world of data management. Consider where we tend to spend time and resources today in data management:
• Finding data that might be useful in solving a problem
• Combining and restructuring data suitable for analysis
• Determining what features are important to an analysis or automated algorithm
• Quickly integrating new data into our analysis
• Determining the quality of our data
• Identifying and eliminating incorrect values
• Prioritizing new data sources
• Defining the rules which govern data access and security
• Deciding how long to keep data before we archive
• Cataloging business rules for master data
• Determining data ownership
• Making use of unstructured data (without the painful NLP tasks)
Machine Learning for Data Management
In the previous section, we highlighted a number of challenges that we deal with in our everyday use of data. Traditional methods for managing data are no longer sufficient. For example, manual mapping of data sources, explicit business rules for their transformation, and pre-programmed responses to poor data quality are not sustainable in a world of distributed data sources (Spark, Hadoop), cloud-based data and compute resources (Amazon Web Services, Microsoft Azure, and Google Cloud Platform), and real time event stream processing and IoT data coming from a growing number of sensors.
Let us now turn our attention to practical use cases for the use of machine learning for data management. I will focus this discussion by highlighting opportunities in each of the four areas in the data value chain:
Table 1: Opportunities for machine learning in the data value chain