Evaluating Prototype Explanations in Machine Learning

Prototype-based post-hoc explanations aim to make model predictions interpretable by presenting representative examples (prototypes) that illustrate how the model arrives at decisions. Their evaluation often relies on quantitative metrics such as fidelity (how closely prototypes approximate the model’s decision function), coverage (how much of the input space they represent), stability (whether explanations remain consistent under small perturbations), and diversity (ensuring prototypes capture … Read more

Fairness and Robustness in Risk Detection Models

Risk detection models (such as IBM’s Granite Guardian) are increasingly used to flag harmful prompts and responses in large language model pipelines. These systems are trained on human and synthetic data to identify risks across multiple dimensions, but their reliability and fairness are not guaranteed. They may over-flag certain groups, miss subtle harms, or be … Read more

Debugging Classifications with Counterfactual Explanations

This project investigates how post-hoc counterfactual explanations can be used to debug opaque models such as deep neural networks by revealing which feature changes most influence predictions. In applications like anomaly detection, counterfactuals help clarify why certain cases are flagged as abnormal and expose when models rely on spurious correlations or biased patterns. By using … Read more

Intersectional Fairness in Machine Learning

This project focuses on the rich field of algorithmic fairness where the goal is to ensure that predictions are not biased against subgroups of the population whilst maximising predictive performance. One challenge is when we focus on multiple protected attributes.