TY - JOUR
T1 - A Case Study Using Large Language Models to Generate Metadata for Math Questions
AU - Bainbridge, Katie
AU - Walkington, Candace
AU - Ibrahim, Armon
AU - Zhong, Iris
AU - Mallick, Debshila Basu
AU - Washington, Julianna
AU - Baraniuk, Rich
N1 - Funding Information:
The research reported here was supported by philanthropic foundations.
Publisher Copyright:
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
PY - 2023
Y1 - 2023
N2 - Creating labels for assessment items, such as concept used, difficulty, or vocabulary used, can improve the quality and depth of research insights as well as targeting the right kinds of questions for students depending on their needs. However, traditional processes for metadata tagging are resource intensive in terms of labor, time, and cost, and these metadata become quickly outdated with any changes to the question content. Given thoughtful prompts, Large Language Models (LLMs) like GPT-3.5 and 4 can efficiently automate generation of assessment metadata and can help scale the process for larger volumes of questions as well as address any updates to question content that would otherwise have been tedious to reanalyze. With a human subject matter expert in-the-loop, recall and precision were analyzed for LLM generated tags for two metadata variables: problem context and math vocabulary. We conclude that LLMs like GPT-3.5 and 4 are highly reliable at generating assessment metadata, and make actionable recommendations for others intending to apply the technology to their own assessment items.
AB - Creating labels for assessment items, such as concept used, difficulty, or vocabulary used, can improve the quality and depth of research insights as well as targeting the right kinds of questions for students depending on their needs. However, traditional processes for metadata tagging are resource intensive in terms of labor, time, and cost, and these metadata become quickly outdated with any changes to the question content. Given thoughtful prompts, Large Language Models (LLMs) like GPT-3.5 and 4 can efficiently automate generation of assessment metadata and can help scale the process for larger volumes of questions as well as address any updates to question content that would otherwise have been tedious to reanalyze. With a human subject matter expert in-the-loop, recall and precision were analyzed for LLM generated tags for two metadata variables: problem context and math vocabulary. We conclude that LLMs like GPT-3.5 and 4 are highly reliable at generating assessment metadata, and make actionable recommendations for others intending to apply the technology to their own assessment items.
KW - Assessments
KW - Human-in-the-loop
KW - Large Language Models
KW - Metadata
UR - http://www.scopus.com/inward/record.url?scp=85174178353&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85174178353&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85174178353
SN - 1613-0073
VL - 3487
SP - 34
EP - 42
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
T2 - 1st Annual Workshop on Empowering Education with LLMs - the Next-Gen Interface and Content Generation, AIEDLLM 2023
Y2 - 7 July 2023
ER -