Deidentifying Student Writing with Rules and Transformers

Abstract

As education increasingly takes place in technologically mediated settings, it has become easier to collect student data that would be valuable to researchers. However, much of this data is not available due to concerns surrounding the protection of student privacy. Deidentification of student data is a partial solution to this problem, but student-generated text, a form of unstructured data, is a major challenge for deidentification strategies. In response to this problem, we develop and evaluate two approaches for the automatic detection of student names. We develop one system using a rule-based approach and one using a transformer-based approach that relies on finetuning a pretrained large language model. Our findings indicate that the transformer-based approach to student name detection shows more promise, especially when there is a high degree of variation between texts in a dataset.