TY - GEN
T1 - Automated Long Answer Grading with RiceChem Dataset
AU - Sonkar, Shashank
AU - Ni, Kangqi
AU - Tran Lu, Lesa
AU - Kincaid, Kristi
AU - Hutchinson, John S.
AU - Baraniuk, Richard G.
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
PY - 2024
Y1 - 2024
N2 - This research paper introduces a new area of study in the field of educational Natural Language Processing (NLP): Automated Long Answer Grading (ALAG). Distinguishing itself from traditional Automated Short Answer Grading (ASAG) and open-ended Automated Essay Grading (AEG), ALAG presents unique challenges due to the complexity and multifaceted nature of fact-based long answers. To facilitate the study of ALAG, we introduce RiceChem, a specialized dataset derived from a college-level chemistry course, featuring real student responses to long-answer questions with an average word count notably higher than typical ASAG datasets. We propose a novel approach to ALAG by formulating it as a rubric entailment problem, employing natural language inference models to verify whether each criterion, represented by a rubric item, is addressed in the student’s response. This formulation enables the effective use of large-scale datasets like MNLI for transfer learning, significantly improving the performance of models on the RiceChem dataset. We demonstrate the importance of rubric-based formulation in ALAG, showcasing its superiority over traditional score-based approaches in capturing the nuances and multiple facets of student responses. Furthermore, we investigate the performance of models in cold start scenarios, providing valuable insights into the data efficiency and practical deployment considerations in educational settings. Lastly, we benchmark state-of-the-art open-sourced Large Language Models (LLMs) on RiceChem and compare their results to GPT models, highlighting the increased complexity of ALAG compared to ASAG. Despite leveraging the benefits of a rubric-based approach and transfer learning from MNLI, the lower performance of LLMs on RiceChem underscores the significant difficulty posed by the ALAG task. With this work, we offer a fresh perspective on grading long, fact-based answers and introduce a new dataset to stimulate further research in this important area. The code and dataset can be found at https://github.com/luffycodes/Automated-Long-Answer-Grading.
AB - This research paper introduces a new area of study in the field of educational Natural Language Processing (NLP): Automated Long Answer Grading (ALAG). Distinguishing itself from traditional Automated Short Answer Grading (ASAG) and open-ended Automated Essay Grading (AEG), ALAG presents unique challenges due to the complexity and multifaceted nature of fact-based long answers. To facilitate the study of ALAG, we introduce RiceChem, a specialized dataset derived from a college-level chemistry course, featuring real student responses to long-answer questions with an average word count notably higher than typical ASAG datasets. We propose a novel approach to ALAG by formulating it as a rubric entailment problem, employing natural language inference models to verify whether each criterion, represented by a rubric item, is addressed in the student’s response. This formulation enables the effective use of large-scale datasets like MNLI for transfer learning, significantly improving the performance of models on the RiceChem dataset. We demonstrate the importance of rubric-based formulation in ALAG, showcasing its superiority over traditional score-based approaches in capturing the nuances and multiple facets of student responses. Furthermore, we investigate the performance of models in cold start scenarios, providing valuable insights into the data efficiency and practical deployment considerations in educational settings. Lastly, we benchmark state-of-the-art open-sourced Large Language Models (LLMs) on RiceChem and compare their results to GPT models, highlighting the increased complexity of ALAG compared to ASAG. Despite leveraging the benefits of a rubric-based approach and transfer learning from MNLI, the lower performance of LLMs on RiceChem underscores the significant difficulty posed by the ALAG task. With this work, we offer a fresh perspective on grading long, fact-based answers and introduce a new dataset to stimulate further research in this important area. The code and dataset can be found at https://github.com/luffycodes/Automated-Long-Answer-Grading.
KW - Automated Long Answer Grading
KW - Large Language Models
KW - Natural Language Inference
KW - Rubric-based Grading
UR - http://www.scopus.com/inward/record.url?scp=85200257336&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85200257336&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-64302-6_12
DO - 10.1007/978-3-031-64302-6_12
M3 - Conference contribution
AN - SCOPUS:85200257336
SN - 9783031643019
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 163
EP - 176
BT - Artificial Intelligence in Education - 25th International Conference, AIED 2024, Proceedings
A2 - Olney, Andrew M.
A2 - Chounta, Irene-Angelica
A2 - Liu, Zitao
A2 - Santos, Olga C.
A2 - Bittencourt, Ig Ibert
PB - Springer Science and Business Media Deutschland GmbH
T2 - 25th International Conference on Artificial Intelligence in Education, AIED 2024
Y2 - 8 July 2024 through 12 July 2024
ER -