TY - GEN
T1 - GS4
T2 - International Workshops on Data Mining and Decision Analytics for Public Health, Biologically Inspired Data Mining Techniques, Mobile Data Management, Mining, and Computing on Social Networks, Big Data Science and Engineering on E-Commerce, Cloud Service Discovery, MSMV-MBI, Scalable Dats Analytics, Data Mining and Decision Analytics for Public Health and Wellness, Algorithms for Large-Scale Information Processing in Knowledge Discovery, Data Mining in Social Networks, Data Mining in Biomedical informatics and Healthcare, Pattern Mining and Application of Big Data in conjunction with 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2014
AU - Moutafis, Panagiotis
AU - Kakadiaris, Ioannis A.
N1 - Funding Information:
This research was funded in part by the US Army Research Lab (W911NF-13-1-0127) and the UH Hugh Roy and Lillie Cranz Cullen Endowment Fund. All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of the sponsors.
Publisher Copyright:
© Springer International Publishing Switzerland 2014.
PY - 2014
Y1 - 2014
N2 - In this paper, we propose a method to improve nearest neighbor classification accuracy under a semi-supervised setting. We call our approach GS4 (i.e., Generating Synthetic Samples Semi-Supervised). Existing self-training approaches classify unlabeled samples by exploiting local information. These samples are then incorporated into the training set of labeled data. However, errors are propagated and misclassifications at an early stage severely degrade the classification accuracy. To address this problem, the proposed method exploits the unlabeled data by using weights proportional to the classification confidence to generate synthetic samples. Specifically, our scheme is inspired by the Synthetic Minority Over-Sampling Technique. That is, each unlabeled sample is used to generate as many labeled samples as the number of classes represented by its k-nearest neighbors. In particular, the distance of each synthetic sample from its k-nearest neighbors of the same class is proportional to the classification confidence. As a result, the robustness to misclassification errors is increased and better accuracy is achieved. Experimental results using publicly available datasets demonstrate that statistically significant improvements are obtained when the proposed approach is employed.
AB - In this paper, we propose a method to improve nearest neighbor classification accuracy under a semi-supervised setting. We call our approach GS4 (i.e., Generating Synthetic Samples Semi-Supervised). Existing self-training approaches classify unlabeled samples by exploiting local information. These samples are then incorporated into the training set of labeled data. However, errors are propagated and misclassifications at an early stage severely degrade the classification accuracy. To address this problem, the proposed method exploits the unlabeled data by using weights proportional to the classification confidence to generate synthetic samples. Specifically, our scheme is inspired by the Synthetic Minority Over-Sampling Technique. That is, each unlabeled sample is used to generate as many labeled samples as the number of classes represented by its k-nearest neighbors. In particular, the distance of each synthetic sample from its k-nearest neighbors of the same class is proportional to the classification confidence. As a result, the robustness to misclassification errors is increased and better accuracy is achieved. Experimental results using publicly available datasets demonstrate that statistically significant improvements are obtained when the proposed approach is employed.
KW - Classification
KW - K-nearest neighbor
KW - Semi-supervised learning
KW - Synthetic samples
UR - http://www.scopus.com/inward/record.url?scp=84915818958&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84915818958&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-13186-3_36
DO - 10.1007/978-3-319-13186-3_36
M3 - Conference contribution
AN - SCOPUS:84915818958
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 393
EP - 403
BT - Trends and Applications in Knowledge Discovery and Data Mining - PAKDD 2014 International Workshops
A2 - Peng, Wen-Chih
A2 - Wang, Haixun
A2 - Zhou, Zhi-Hua
A2 - Ho, Tu Bao
A2 - Tseng, Vincent S.
A2 - Chen, Arbee L.P.
A2 - Bailey, James
PB - Springer-Verlag
Y2 - 13 May 2014 through 16 May 2014
ER -