Confounding is when the relationship between a given set of dependent variables is distorted by a one (or more) variable(s) that has relationships to each of the other two. For example, if you’re studying the effects of exercise on weight loss, but don’t take diet into account, you could mistakenly infer that exercise has a stronger or weaker relationship with weight than it really does.
By including diet in the model, it becomes an observed confounder — a known variable that can be controlled for in order to isolate the relationship of the other variables, even if you don’t yet understand why it is important.
But what about variables that we don’t know exist? Or variables that we can’t measure? Or are otherwise missing from our dataset? Those unobserved confounders are complex to deal with because they can’t be directly controlled for.
Randomized controlled trials, instrumental variables, and sensitivity analysis are all methods of coping with unobserved confounders, but with machine learning, we require even more powerful tools available to account for them — especially with sparse, noisy, or problematic datasets. For ML to work in the real world, it needs to deeply understand the real relationships in play.
Neural Nets Excel, But Often Prove Impractical at Capturing Confounders
Sophisticated deep learning models and techniques excel at capturing complex, non-linear relationships. Feedforward neural nets, CNNs, RNNs, and transformers all have strengths in this context. They can capture layers and hierarchies of abstraction from the data, leverage non-linear activation functions, and model dependencies over time or in sequence.
Neural networks automatically learn representations of the input data through multiple layers, with each layer learning increasingly abstract features of the data. For example, in image recognition, early layers might learn simple features like edges, while deeper layers learn more complex patterns like shapes or objects. This approach allows the model to capture complex, non-linear relationships that may include confounding effects.
Deep learning models use non-linear activation functions (e.g., ReLU, sigmoid, tanh) in their neurons. These functions introduce non-linearity into the model, allowing it to fit complex relationships between inputs and outputs. This capability is crucial for handling confounders that interact with multiple variables.
Theoretically, given enough data and time, a deep learning model using these architectures could uncover all the hidden relationships present in a dataset. But for large, complex datasets, that’s impractical due to the time and compute constraints, ruling them out for use in applications with limited, high dimensional, and sparse data, which are common challenges in the real world. In practice, Neural Nets require constant retraining and tuning to overcome the effects of unobserved confounders.
The Curse of Dimensionality
Further complicating the situation, as the number of dimensions (features) in a dataset increases, the volume of the space increases exponentially. This means that the data points can often be sparse and distant from one another, making it harder to identify patterns or clusters in the data.
As data is spread out over a larger and larger space, the notion of “nearness” loses its meaning, and algorithms like k-nearest neighbors (k-NN) become less effective. Most distance calculation algorithms are best suited to our 3 dimensional world, making it difficult to distinguish distance between different data points in higher dimensional space.
With more space also comes more ways to partition the space, causing models to become overly complex, fitting the noise in the data rather than the underlying patterns. The result is overfitting, where models perform well on training data, but poorly on unseen data.
While there are techniques for addressing dimensionality, like PCA, t-SNE, feature selection, and regularization, all require expertise and time that teams may not have to spare. And unfortunately, they simply may not work for every problem (even if you have the requisite expertise).
Thankfully, there is an alternative approach that we’ve found to work across domains, requiring less feature engineering and data prep while yielding the same or better model performance.
A New Form of Representation Learning
Instead of relying on resource intensive techniques to compensate for unobserved confounders (especially in sparse, limited, high dimensional datasets), we can achieve better results by assuming the existence of these unknown variables and learning to approximate them.
This assumption lets us create richer representations of your data before modeling by accounting for variables that are missing from your dataset, yet indirectly influence and thus “show up” in it in ways that can be captured by machine learning.
We’ve developed a data representation model that serves as an intermediate step between feature engineering and modeling. It uses a new objective function that learns how to extract relationships in your data to build statistically optimal representations from multiple “angles.”
In the process, we can capture complex, non-linear relationships (including unobserved confounders) in simple, statistical terms your model can understand.
These representations form a new embedding that clearly and simply represents these relationships to your model, providing it with more usable information without requiring extensive feature engineering or domain expertise — reducing engineering time, and freeing your team up to focus on experimentation.