Counterfactual explanations for Explainable and Trustworthy Reinforcement Learning

Explanations targeted at non-expert users of AI systems are necessary to encourage collaboration and ensure user trust in the black-box system. Counterfactuals are user-friendly explanations that offer the user actionable advice on how to change their input features in order to achieve a desired output. While researched in depth in supervised learning, counterfactual explanations
are seldom applied to reinforcement learning (RL) task. In this project, we will look at a number of approaches for generating counterfactual explanations in supervised learning and aim to adapt them to Reinforcement Learning.

References:

J Gajcin, I Dusparic. RACCER: Towards Reachable and Certain Counterfactual Explanations for Reinforcement Learning https://arxiv.org/abs/2303.04475

J Gajcin, I Dusparic. Counterfactual Explanations for Reinforcement Learning. https://arxiv.org/pdf/2210.11846

https://arxiv.org/abs/2303.04475