Most organizations have been using data stored in systems designed for reporting and analysis in some form or another for at least two decades. The basis for the examples to follow come directly from the challenges that organizations deal with on a regular basis. These should serve only to incubate ideas and novel approaches to improving data operations within organizations.
OPERATIONAL LOAD TIMES
One of the goals of any data operations team should be to meet or exceed service level agreements (SLA) with data consumers. A common SLA might characterize, for example, the latency of new data for use. A challenge of such SLAs is that the reality of conditions changes over time and, especially, the volume and veracity of data. This is manifest in increasingly slower load times that often exceed the operational window.
Figure 2: Example of Total Data Volume Over Time
To maintain adherence with our SLA, we want to proactively monitor load times, data growth, and seasonality characteristics for various data types throughout an organization. As data volumes increase, or load times vary from the expected, we can act in automated or semi-automated ways to remediate any issues before they adversely affect business operations.
One possible approach to predicting daily or weekly data volumes might include the use of regression to predict the size and/or time to transfer data from a source system to a target system (or the data flow rate for a real time system.). In this example, we could use a supervised machine learning method such as regression) to train our data from historical information (load times, data volume) to predict expected outcomes. Load times or data volumes that differ significantly from the expected value could be used to instrument an alert for investigation. Segmented regression or time series analysis could be used to handle varying data types, domains, or to predict seasonal differences in expected outcomes.
AUTOMATIC DATA VALUE CLASSIFICATION
Figure 3: Training and Prediction Phase for Supervised Learning Classification Problem
Historically, methods for managing data values often require manual intervention when new data values arrive. For example, new sales territories, counterparties, or discrete lots or methods in manufacturing may appear without notice in the data. Determining whether new values are suspect are left to manual data stewardship methods or coding changes to downstream business rules.
The goal of automatic classification is to determine whether the value is acceptable or suspect. The best case would be to focus the energies on data investigations on the high value activities and to prioritize exception management.
There are a number of approaches to the classification of values. One approach would be to use a supervised machine learning method called Knn, or K-Nearest-Neighbor. You train the “machine” by supplying the model with known examples (or labeled data) and when faced with a new value, the algorithm can correctly classify the new (never seen) value. Similarly, for real time applications, neural networks can be used to classify data points in real time.
For example, we might want to automatically determine the industry code for a new counterparty in a credit risk application. By looking at the patterns of known values for a company, we can use distance scores to identify the closest match without having to expressly lookup and code each new value.
IDENTIFICATION OF DATA GAPS
Figure 4: Data Gaps Aren’t Always Obvious Problem
A particularly troublesome aspect in data management is when we have gaps in data. Unlike other types of data quality issues such as unexpected values, data gaps often require an appreciation of what’s missing rather that what’s there.
To support accurate use and analysis of data, we need to ensure that all data that should be present are indeed reflected in the data. The desire would be to have machine learning algorithms predict if a human decision maker would flag data points as suspect and potentially have the algorithms predict the missing values.
An example of data gaps might include the absence of sales data for a specific SKU in a retail application. Day in and day out, we get data values for all SKUs but due to an upstream issue, we might just not get the data feed from certain stores or for certain products.
Tobias Cagala (Cagala, 2017) applied supervised machine learning algorithms to determine whether the securities holdings data that is reported by German banks to the German central bank (Deutsche Bundesbank) are missing and what the missing values might be. In this paper, he compared the performance of two models: logistic regression and a random forest algorithm.
SELF-ORGANIZING ENTERPRISE DATA DICTIONARY FOR DATA DOMAINS
Another challenge in data management is the determination of the content in a data source. Often when faced with a new dataset or one that is simple novel to an analyst, it requires extensive exploratory data analysis to figure out the content and its importance.
The goal of self-organizing data domains is to automatically classify new fields and encode them correctly in an enterprise data dictionary that can be used to quickly search for features that might be relevant to an analysis.
We can imagine a few ways to automatically interrogate new data: one that is based on brute force methods of analyzing a dataset and cataloguing its content or an automated self-discovery. In the case of the latter, imagine a having a set of records automatically encoded to capture the topic or entity. For example, if we see a test type, date, patient id, and value we might classify this as lab results. Similarly, if we see constructs such as customer, product, and amounts, then we might consider this to be in the domain of orders.
In natural language processing, we can use methods such as Named Entity Recognition, Co-Reference Resolution, and Topic Modeling along with summary statistics for our categorical and numeric features. Furthermore, we can adopt supervised machine learning methods for classifying content.
Figure 5: Intelligent Entity Discovery
RECOMMENDATION OF NEW POTENTIAL DATA SOURCES
Often in analytics, we source data for explicit use in an analytics exercise. While often reliant on traditional, curated data sources that exist in the enterprise, tertiary data such as social media, public record data and other data sources can prove useful in analysis to add much needed context or additional content. In the figure below, we highlight this as it exists outside the traditional data pipeline.
Figure 6: Tertiary Data Outside the Data Pipeline
The goal of the Chief Data Officer should consider the optimization of new data sources. That is, how shall we prioritize and streamline new data sources useful in analytics processes?
An example application of this is a clustering model can be used to provide data domain or content recommendations, such as novel data sources or trending utilization of data domains that had not previously been on the radar.
Clustering is an unsupervised machine learning task that groups a set of data points into a cluster. A distance function determines the similarity between data points.
For example, we might use a centroid-based clustering (k-means) technique to learn the center position of each cluster. To classify a data point, you compare it to each centroid, and assign it to the most similar centroid. When a data point is matched to a centroid, the centroid moves slightly toward the new data point. You can apply this technique to continuously update the clusters in real time, which is useful because it allows the model to constantly be updated to reflect new data points.
An application of this might include analyzing metrics about the data used in various model development activities and serving these as trending clusters of data domains.
Analysts or data scientists working on various problem domains would be clustered together and recommendations on data domains or features (as part of various feature engineering efforts) that have been useful by others can be surfaced as part of a team knowledge management process.
Figure 7: Outlier Detection Problem
Finally, a common use for machine learning algorithms is in outlier detection or more generally known as anomaly detection. Similar to other problems we experience in data management, the insidious nature of outliers is such that it often requires a thoughtful examination of the data to determine whether the anomaly is interesting or merely problematic for our analysis.
Much like our example earlier of gap detection, the goal is to correctly classify data points as anomalies.
We can use both statistical and machine learning approaches to detect data outliers and anomalies. One example of this might be to determine whether the activity of a user or a group of users is interesting (that is, they became aware of some potential new data or have recently have been granted access) versus problematic (indicative of spyware or other malicious activity.)
We might consider the use of unsupervised machine learning to create a multi-dimensional model of user activities (typical pattern of data queries, network traffic, time of day logins/ activities, number of requests, size of query result sets, etc.) Using Principal Component Analysis (PCA), we can perform dimensionality reduction or hierarchical clustering to find users whose behavior was different during a given period. To determine whether actions and actors were malicious, we can use distance and density-based outlier detection methods to test for outliers.
In this article, we wanted to highlight some potential applications for modern analytic methods such as those in machine learning to help solve some of the challenges in data management. Annually, organizations spend millions of dollars in an attempt to acquire, ingest, transform and store data for use by data scientists. While I have merely touched on the potential range of applications for data management including data quality, data stewardship, and data governance, I hope this has spurred some ideas for how to best deliver on the promise of data for organizational use.