research | Machine Learning in Medicine Lab

Methodological interests in trustworthy AI and beyond:

Interpretable machine learning
Performance generalisation and performance prediction
Subgroup discovery
Bias detection and bias mitigation in medical AI
Disentangled representation learning
Simulation-based inference for physiological models

Application areas:

Medical image analysis including for example image classification, segmentation, disease progression modelling
Digital twins in diabetes
Interpretation of Continuous Glucose Monitoring (CGM) data, including for example CGM forecasting

Check out some example projects (past and ongoing) below. For a comprehensive list of publications, check out google scholar.

Subgroup discovery for monitoring ML performance in hidden stratifications

Traditional subgroup analysis, a common practice in medical research, often falls short when evaluating deep learning models for medical imaging. Although useful, this practice masks significant performance variations among specific groups. Clinical trials and medical AI studies often stratify results by demographic attributes (e.g., age, sex, or ethnicity), but AI medical models might not rely on these standard metadata categories, leading to suboptimal performance evaluations.

We study an alternative approach: using subgroup discovery methods to enrich performance analysis. Subgroup discovery methods uncover hidden patterns and systematic groupings beyond traditional metadata, providing deeper and more meaningful insights into AI model performance. We have many challenges to overcome, in special validation of subgroup discovery, as labels for ground truth inherently do not exist in real data. We argue that subgroup discovery can be an effective and easily implemented tool to enhance the performance validation and monitoring of ML systems in medicine.

Read more and try our interactive visualization in the blog post!

Bissoto, A., Hoang, T.-D., Flühmann, T., Sun, S., Baumgartner, C. F., & Koch, L. M. (2025). Subgroup Performance Analysis in Hidden Stratifications. Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI).

Simulation-Based Inference for digital twins in Type 1 Diabetes

Type 1 Diabetes (T1D) affects over 9 million people worldwide and requires frequent insulin injections and continuous monitoring of blood glucose levels with wearable continuous glucose monitoring (CGM) devices. The dynamics between glucose, meal intake, and insulin over time can be described by complex physiological models consisting of systems of differential equations, where model parameters (e.g., insulin sensitivity) can be highly patient-specific. Identifying these parameters from observed data enables the creation of a digital twin (DT) of an individual’s metabolic system, supporting treatment planning, prediction, and real-time adaptation.

Parameter estimation in such models is a challenging inverse problem. Existing approaches, such as Markov Chain Monte Carlo (MCMC), are computationally expensive, non-amortized, and often rely on steady-state initial condition assumptions that may not hold in practice. Therefore, we are working on Simulation-Based Inference (SBI) method based on Neural Posterior Estimation (NPE), which enables efficient inference of both physiological parameters and initial conditions. Unlike traditional methods, SBI provides fast, amortized inference and produces full posterior distributions, allowing uncertainty quantification and more reliable decision-making. Future extensions will explore robustness to model misspecification and missing CGM data.

Research area: simulation-based inference in diabetes

Hoang, T.-D., Bissoto, A., Naik, V. V., Flühmann, T., Shlychkov, A., Garcia Tirado, J., & Koch, L. M. (2025). A Real-Time Digital Twin for Type 1 Diabetes using Simulation-Based Inference. Proc. Digital Twin for Healthcare (DT4H).

Model monitoring through label-free performance estimation

The performance of classification models can deteriorate significantly between training and deployment data because of distribution shifts. Such shifts may arise for various reasons, for example adopting a model from a different institution or simply over time. For safe clinical deployment, continuous performance monitoring is therefore essential and can anticipate model failures before they affect patients. Ideally, performance is assessed by calculating task-specific metrics such as accuracy, recall, and the area under the receiver operating characteristic curve (AUC). However, this is only possible if the new data is labeled, a costly task which is, especially in the clinical setting, rarely the case. Consequently, performance monitoring becomes a problem of label-free metric estimation. Different approaches have emerged to meet this challenge. Some methods use distance metrics to quantify the dataset shift and relate it to changes in the model’s accuracy. Others retrain the model on the deployed data with pseudo-labels and sequentially assess the new model’s performance on the initial data. Because both groups of methods require additional resources in the form of other labeled datasets and model retraining, confidence-based performance estimation is a promising alternative. Here, the model’s softmax outputs on the target data are leveraged either to estimate accuracy directly or, by comparison with the softmax outputs from the initial data. As these methods are predominantly designed to estimate accuracy, yet in the clinical setting we are also interested in PPV, NPV, and prevalence-independent metrics such as recall, specificity, and AUC, we focus on estimating the confusion matrix and, consequently, all relevant counting metrics. We achieve this for binary classification by designing estimators for PPV and NPV based on the positive and negative class predictions, followed by deriving point estimates for the confusion-matrix elements. Future work will pair our confusion-matrix estimators with calibration and domain-adaptation techniques to guarantee reliable estimates under various distribution shifts.

Flühmann, T., Bissoto, A., Hoang, T.-D., & Koch, L. M. (2025). Label-free estimation of clinically relevant performance metrics under distribution shifts. Proc. Uncertainty for Safe Utilization of Machine Learning in Medical Imaging (UNSURE).

Distribution shift detection for postmarket surveillance of medical AI algorithms

Distribution shifts remain a problem for the safe application of regulated medical AI systems, and may impact their real-world performance if undetected. Postmarket shifts can occur for example if algorithms developed on data from various acquisition settings and a heterogeneous population are predominantly applied in hospitals with lower quality data acquisition or other centre-specific acquisition factors, or where some ethnicities are over-represented. Therefore, distribution shift detection could be important for monitoring AI-based medical products during postmarket surveillance. We investigated, implemented and evaluated various deep-learning based shift detection techniques on simulated shifts in medical imaging datasets. We then simulated population shifts and data acquisition shifts and analysed the performance of the shift detectors at detecting both subgroup and out-of-distribution shifts.

Research area: distribution shift detection

Koch, L. M., Baumgartner, C. F., & Berens, P. (2024). Distribution Shift Detection for the Postmarket Surveillance of Medical AI Algorithms: A Retrospective Simulation Study. Npj Digital Medicine. https://doi.org/https://doi.org/10.1038/s41746-024-01085-w
Koch, L. M., Schürch, C. M., Gretton, A., & Berens, P. (2022). Hidden in Plain Sight: Subgroup Shifts Escape OOD Detection. Proceedings of Machine Learning Research, 172, 726–740.

Timeseries transformers for analysing continuous glucose monitoring data

To treat diabetes, individuals need to manage their blood glucose level through diet, exercise, weight loss and medications. Many people with diabetes, in particular with Type I diabetes, require frequent insulin injections throughout the day to maintain a healthy glucose profile.

Frequently measuring glucose, for example through continuous glucose monitoring (CGM) devices, is therefore a crucial component of diabetes care. CGMs are small wearable devices widely used by people with diabetes to continuously monitor their blood glucose levels. They provide valuable information that help patients make informed decisions about their diet and insulin dosing. However, control of blood sugar levels remains challenging, as a myriad of complex factors influence the dynamics of a patient’s glucose profile, including cardiometabolic risk factors such as obesity, age, sex, or exercise. These complex relationships are not yet fully understood, but they will need to be incorporated for effective next generation treatment systems.

Our research interests include:

Transformer-based approaches for training large CGM models
Glucose forecasting
Cardiometabolic risk factor prediction and biomarker discovery

Interpretable methods for diabetic retinopathy detection

Deep learning models typically lack interpretability, thereby posing ethical concerns and preventing wide adoption in clinical practice. Interpreting deep learning models typically relies on post-hoc saliency map techniques. However, these techniques often fail to serve as actionable feedback to clinicians, and they do not directly explain the decision mechanism. In our research, we are interested in two approaches to mitigate the shortcomings saliency maps:

Inherently interpretable models, which combine the feature extraction capabilities of deep neural networks with advantages of sparse linear models in interpretability.
Visual counterfactual explanations, which provide realistic counterfactuals (“what would this image have looked like, were the patient healthy?”) to illustrate a ML model’s internal reasoning.

Relevant publications:

Djoumessi, K. R. D., Ilanchezian, I., Kühlewein, L., Faber, H., Baumgartner, C. F., Bah, B., Berens, P., & Koch, L. M. (2023). Sparse Activations for Interpretable Disease Grading. Proceedings of Machine Learning Research, 6, 1–17.
Sun, S., Koch, L. M., & Baumgartner, C. F. (2023). Right for the Wrong Reason: Can Interpretable ML Techniques Detect Spurious Correlations? Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 14221 LNCS, 425–434. https://doi.org/10.1007/978-3-031-43895-0
Sun, S., Woerner, S., Maier, A., Koch, L. M., & Baumgartner, C. F. (2023). Inherently Interpretable Multi-Label Classification Using Class-Specific Counterfactuals. Proceedings of Machine Learning Research, 227.
Boreiko, V., Ilanchezian, I., Ayhan, M. S., Müller, S., Koch, L. M., Faber, H., Berens, P., & Hein, M. (2022). Visual Explanations for the Detection of Diabetic Retinopathy from Retinal Fundus Images. Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 13432 LNCS, 539–549. https://doi.org/10.1007/978-3-031-16434-7