posted on 2024-12-01, 00:00authored byFrancesco Vaina
As automated decision-making systems become increasingly prevalent in critical
domains like education, ensuring fairness in these systems is paramount. Missing data
presents a unique challenge to fairness in machine learning (ML), particularly in high-
stakes applications such as predicting student outcomes. This research investigates the
effects of missing data and various preprocessing methods on the fairness and
accuracy of ML models within educational datasets. Using data from the 2012
Education Longitudinal Study, the study aims to predict bachelor’s degree attainment
through models such as Random Forest, Logistic Regression, and Support Vector
Classifier. By examining multiple imputation techniques, especially in contexts where
data is not Missing Completely at Random (MCAR), this research evaluates the
influence of these methods on model fairness and performance, with a focus on
mitigating bias against vulnerable student groups.
The study underscores the importance of feature handling in data preprocessing,
highlighting how improper treatment during imputation can introduce or exacerbate
biases that affect model predictions. Through an analysis of feature importance and its
impact on fairness, this work identifies the features most likely to contribute to bias,
supporting the design of more equitable predictive models. Findings reveal trade-offs
between accuracy and fairness, illustrating the critical role of appropriate fairness
metrics—such as Equalized Odds—in accounting for contextual nuances over simpler
metrics like Statistical Parity. This research contributes to the field by addressing gaps
in existing literature, providing insights into the relationship between missing data
handling, fairness, and accuracy in educational ML applications, and offering practical
recommendations for developing fairer, more reliable models in educational contexts.