TY - JOUR
T1 - Classification of cervical biopsy free-text diagnoses through linear-classifier based natural language processing
AU - Hsu, Jim Wei Chun
AU - Christensen, Paul
AU - Ge, Yimin
AU - Long, S. Wesley
N1 - © 2022 The Authors. Published by Elsevier Inc. on behalf of Association for Pathology Informatics.
PY - 2022/1
Y1 - 2022/1
N2 - Routine cervical cancer screening has significantly decreased the incidence and mortality of cervical cancer. As selection of proper screening modalities depends on well-validated clinical decision algorithms, retrospective review correlating cytology and HPV test results with cervical biopsy diagnosis is essential for validating and revising these algorithms to changing technologies, demographics, and optimal clinical practices. However, manual categorization of the free-text biopsy diagnosis into discrete categories is extremely laborious due to the overwhelming number of specimens, which may lead to significant error and bias. Advances in machine learning and natural language processing (NLP), particularly over the last decade, have led to significant accomplishments and impressive performance in computer-based classification tasks. In this work, we apply an efficient version of an NLP framework, FastText™, to an annotated cervical biopsy dataset to create a supervised classifier that can assign accurate biopsy categories to free-text biopsy interpretations with high concordance to manually annotated data (>99.6%). We present cases where the machine-learning classifier disagrees with previous annotations and examine these discrepant cases after referee review by an expert pathologist. We also show that the classifier is robust on an untrained external dataset, achieving a concordance of 97.7%. In conclusion, we demonstrate a useful application of NLP to a real-world pathology classification task and highlight the benefits and limitations of this approach.
AB - Routine cervical cancer screening has significantly decreased the incidence and mortality of cervical cancer. As selection of proper screening modalities depends on well-validated clinical decision algorithms, retrospective review correlating cytology and HPV test results with cervical biopsy diagnosis is essential for validating and revising these algorithms to changing technologies, demographics, and optimal clinical practices. However, manual categorization of the free-text biopsy diagnosis into discrete categories is extremely laborious due to the overwhelming number of specimens, which may lead to significant error and bias. Advances in machine learning and natural language processing (NLP), particularly over the last decade, have led to significant accomplishments and impressive performance in computer-based classification tasks. In this work, we apply an efficient version of an NLP framework, FastText™, to an annotated cervical biopsy dataset to create a supervised classifier that can assign accurate biopsy categories to free-text biopsy interpretations with high concordance to manually annotated data (>99.6%). We present cases where the machine-learning classifier disagrees with previous annotations and examine these discrepant cases after referee review by an expert pathologist. We also show that the classifier is robust on an untrained external dataset, achieving a concordance of 97.7%. In conclusion, we demonstrate a useful application of NLP to a real-world pathology classification task and highlight the benefits and limitations of this approach.
KW - Cervical biopsy
KW - Computational pathology
KW - FastText
KW - Linear classifier
KW - Machine learning
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85133903560&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85133903560&partnerID=8YFLogxK
U2 - 10.1016/j.jpi.2022.100123
DO - 10.1016/j.jpi.2022.100123
M3 - Article
C2 - 36268101
AN - SCOPUS:85133903560
SN - 2229-5089
VL - 13
SP - 100123
JO - Journal of Pathology Informatics
JF - Journal of Pathology Informatics
M1 - 100123
ER -