Numerous causal machine learning estimators are available for estimating conditional average treatment effects (CATEs). This paper reviews methods for testing whether these estimators detect genuine systematic effect heterogeneity or merely produce sophisticated noise. We present a unifying theoretical framework that encompasses various approaches in the literature, such as Generic ML and rank-weighted treatment effects, as special cases. Using both simulated and real-world datasets, we evaluate the statistical power of these methods, offering practical guidance for researchers and practitioners seeking to validate their CATE models.
Analyses, survey insights and a practical guide mark a key milestone in the Erasmus+ project supporting data-driven vocational training across five European countries.










