Fairness and Robustness in Risk Detection Models

Risk detection models (such as IBM’s Granite Guardian) are increasingly used to flag harmful prompts and responses in large language model pipelines. These systems are trained on human and synthetic data to identify risks across multiple dimensions, but their reliability and fairness are not guaranteed. They may over-flag certain groups, miss subtle harms, or be vulnerable to adversarial manipulation. This project invites the student to critically evaluate the limitations of risk detection models (or hate-speech detection models), with a focus on:

  • Fairness testing: checking whether detection outcomes vary across sensitive attributes or linguistic styles.
  • Adversarial evaluation: probing the system with crafted prompts to bypass safeguards or trigger false positives.
  • Distribution shift: assessing performance on text from domains or communities outside the training set.

The project provides exposure to challenges in NLP, robustness, and fairness & trustworthiness. Strong programming skills in Python are essential. Experience with NLP libraries (e.g., Hugging Face Transformers) and adversarial/fairness evaluation frameworks will be highly beneficial. Evidence of project contribution e.g., on GitHub is also very nice. Prior projects demonstrating applied machine learning or model evaluation are a strong advantage.