Redefining Data Quality: A Paradigm Shift in the Machine Learning Pipeline

Data quality issues form some of the central challenges in machine learning, but what do we mean by “quality”? Here we redefine and reframe the term to clarify both the problem and its potential solutions.

Data quality can mean everything from how accurately a dataset reflects real-world events to its consistency in formatting to whether there are missing values, records, or class imbalances.

What all of these definitions hint at (and what we ultimately care about in machine learning) is whether or not the data is useful for decision-making.

We propose a more fundamental reframing that encompasses all of these to some degree: Data quality means how well your data represents what you’re trying to predict.

This gets directly at our end goal while implying everything else that leads to it: accuracy, completeness, timeliness, relevance, consistency, validity, granularity, intelligibility, etc.

It likewise clarifies possible solutions by reorienting us towards using data and features that correlate with our prediction task, capturing the statistical behavior and relationships, and representing them explicitly to our model in terms it can understand.

Addressing data quality issues has to date required increasingly complex modeling approaches.

Machine learning is used to make predictions when we need to uncover patterns and relationships in complex datasets that may not be immediately apparent or fully understood. Even with extensive data, the underlying relationships can be too intricate or dynamic for traditional methods to capture effectively.

Machine learning enables us to model these complexities and make informed predictions, even when the data is incomplete or the problem is too complex for simpler approaches. Its unique strength lies in its ability to identify patterns and relationships within data, even when that data is incomplete or imperfect.

By leveraging advanced modeling techniques, machine learning can generate accurate predictions, effectively bridging the gaps in the data and compensating for its limitations. So it’s no surprise that as an industry, there’s been a great deal of emphasis on developing more sophisticated models that can effectively capture underlying patterns in data.

Deep learning techniques like CNNs, GANs, GNNs, autoencoders, and transformers excel at finding lots of complex, non-obvious relationships, yielding high degrees of predictive accuracy. But they’re resource intensive in terms of cost, talent, and time, while not always providing significant gains over classical methods.

Even these massive foundation models are limited by the quality of their data, compensating by consuming vast amounts of it. While there have been consistent improvements to date — thanks to increasing processing power, new data sets, and tricks to optimize inference like RAG — they will eventually plateau without fundamental innovation.

At the end of the day, models will hit a wall in terms of the “new” information they can understand, and long before that the marginal gain will be far too small for the amount of data required to make them perform better.

Shifting the focus to creating richer representations of data is both more effective and more efficient.

This all begs the question: instead of relying on modeling to capture relationships in our data, what if we could surface them in the data itself? What if — before the model even sees the data — we could make those relationships explicit in a simplified format that’s more interpretable to the model?

And what if that could be done for complex, non-linear relationships without requiring resource intensive deep learning approaches?

Those questions frame up a new approach, one centered on representation learning, to our main focus on improving data quality.

Instead of relying on the end model to do all this work, we could create an intermediate model that learns to create this representation from the data first. Its goal would be different from the end model that’s responsible for making predictions: its focus instead would be solely on learning how to create data that fits the criteria for making accurate predictions.

This representation learning model would take the input data after feature engineering and use a new objective function to approximate relationships in the data, converging on nearly orthogonal features that are strongly correlated with the prediction task while generalizing readily to new data without overfitting.

These new features would roll up all the granular, non-linear relationships present in the data, but that would be difficult to extract without a deep learning approach. Those relationships could then be represented simply and clearly to the model, presenting it with statistically optimal training data that would be both more predictively powerful and more intelligible to it.

From the model’s perspective, there would be “new” information. Of course, no new information will have been created, but the information that matters has been made more explicit to the model, correcting for any biases that all models must make to make sense of their training data.

A transformative new step in the data science pipeline.

These questions sparked the research that led us to creating Dark Matter, which represents a fundamentally new step in the data science pipeline that we call Feature Enhancement — elevating the predictive power of your data through representation learning before it’s presented to your end model.

Feature Enhancement gives your model a more complete picture for making sound predictions by closing in on more of the relationships that matter and helping the model clearly differentiate between signal and noise before it ever starts to train on your data.

It’s different from every other representation learning and embeddings approach and provides a more scalable and powerful way of addressing data quality than approaches like data labeling and synthetic data.

Our hope is that it will become an essential part of every data scientist’s toolset and every pipeline, lowering barriers to state-of-the-art performance and enabling new machine learning capabilities downstream.

If you’re interested in learning more, check out our case studies, or get in touch to find out if Dark Matter is a good fit for your goals.

Ready for better model performance?

Get in touch to learn more and book a demo. 

Join the Waitlist

Early Access Form