Abstract. The Earth’s land surface and the atmosphere are strongly interlinked through the exchange of energy and matter. This coupled behaviour causes various land–atmosphere feedbacks, and an insufficient understanding of these feedbacks contributes to uncertain global climate model projections. For example, a crucial role of the land surface in exacerbating summer heat waves in midlatitude regions has been identified empirically for high-impact heat waves, but individual climate models differ widely in their respective representation of land–atmosphere coupling. Here, we compile an ensemble of 54 combinations of observations-based temperature (T) and evapotranspiration (ET) benchmarking datasets and investigate coincidences of T anomalies with ET anomalies as a proxy for land–atmosphere interactions during periods of anomalously warm temperatures. First, we demonstrate that a large fraction of state-of-the-art climate models from the Coupled Model Intercomparison Project (CMIP5) archive produces systematically too frequent coincidences of high T anomalies with negative ET anomalies in midlatitude regions during the warm season and in several tropical regions year-round. These coincidences (high T, low ET) are closely related to the representation of temperature variability and extremes across the multi-model ensemble. Second, we derive a land-coupling constraint based on the spread of the T–ET datasets and consequently retain only a subset of CMIP5 models that produce a land-coupling behaviour that is compatible with these benchmark estimates. The constrained multi-model simulations exhibit more realistic temperature extremes of reduced magnitude in present climate in regions where models show substantial spread in T–ET coupling, i.e. biases in the model ensemble are consistently reduced. Also the multi-model simulations for the coming decades display decreased absolute temperature extremes in the constrained ensemble. On the other hand, the differences between projected and present-day climate extremes are affected to a lesser extent by the applied constraint, i.e. projected changes are reduced locally by around 0.5 to 1,textasciicircum$∘$C – but this remains a local effect in regions that are highly sensitive to land–atmosphere coupling. In summary, our approach offers a physically consistent, diagnostic-based avenue to evaluate multi-model ensembles and subsequently reduce model biases in simulated and projected extreme temperatures.