Taken – Investigating the Reproducibility of Studies which conducted Data Mining or Machine Learning on Educational Data 

*If a MSc in Statistics and Sustainability student is interested in this project, a variant of this project would be to complete it on sustainability data sets.

“Reproducibility, closely related to replicability and repeatability, is a major principle underpinning the scientific method. For the findings of a study to be reproducible means that results obtained by an experiment or an observational study or in a statistical analysis of a data set should be achieved again with a high degree of reliability when the study is replicated. … With a narrower scope, reproducibility has been defined in computational sciences as having the following quality: the results should be documented by making all data and code available in such a way that the computations can be executed again with identical results.” – Wikipedia 

This project is about examining the reproducibility of studies that have conducted educational data mining/machine learning (e.g., clustering, prediction modelling, classification).  

The first step of this project will involve identifying multiple studies which feature publicly available data and that have conducted educational data mining and machine learning. The second step is to repeat the statistical analysis undertaken for each study.  

For each study, the aim is to identify: 

  • Whether the method is detailed enough to be reproducible. 
  • Whether code has been provided. 
  • Whether any cleaning of the data can be reproduced. 
  • Whether the results can be duplicated within an acceptable margin of error.