Exploring The Impact of Stemming on Text Topic-Based Classification Accuracy
DOI:
https://doi.org/10.61320/jolcc.v2i2.204-224Keywords:
stemming, classification, clustering, hierarchical, SOM, genre, content wordsAbstract
Text classification attempts to assign written texts to specific group types that share the same linguistic features. One class of features that have been widely employed for a wide range of classification tasks is lexical features. This study explores the impact of stemming on text classification using lexical features. To explore, this study is based on a corpus of thirty texts written by six authors with topics that focus on politics, history, science, prose, sport, and food. These texts are stemmed using a light stemming algorithm. In order to classify these texts according to the topic by means of lexical features, linear hierarchical clustering and non-linear clustering (SOM) is carried out on the stemmed and unstemmed texts. Although both clustering methods are able to classify texts by topic with two models produce accurate and stable results, the results suggest that the impact of a light stemming on the accuracy of text classification by topic is ineffectual. The accuracy is neither increased nor decreased on the stemmed texts, whereby the stemming algorithm helped reducing the dimensionality of feature vector space model.
References
Asian, J., Williams, H., and Tahaghoghi, S. (2005). Stemming Indonesian. In Proceedings of the twenty-eighth Australian Computer Science Conference, ACS, 307-314, Newcastle, Australia. CRPIT, 38. Estivill-Castro, V., ed.
Ardanuy Mariona Coll and Sporleder Caroline. (2014). Structure-based Clustering of Novels. Paper presented in the 3rd Workshop on Computational Linguistics for Literature (CLFL). Gothenburg, Sweden, April 27, 2014.
Baayen, R. (2001). Word frequency distributions. Dordrecht: Kluwer.
Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern information retrieval. Addison-Wesley.
Dogan, Turgut and Uysal, Alper Kursat. (2020). A novel term weighting scheme for text classification: TF-MONO. Journal of Informetrics, Volume 14, Issue 4.
Fielding Alan. (2006). Cluster and Classification Techniques for the Biosciences. Cambridge University Press.
Frakes, W. (1992). Stemming Algorithms. In W. Frakes & R. Baeza-Yates (eds.), Information Retrieval, 131-60. NJ: Prentice Hall.
Fuller, M. and Zobel, J. (1998). Conflation-based comparison of stemming algorithms. In Proceedings of the third Australian Document Computing Symposium, Sydney, Australia.
Goweder, A. (2004). Stemming and Arabic information retrieval: the case of broken plurals. PhD thesis, Department of Computer Science, University of Essex.
HaCohen-Kerner Yaakov, Miller Daniel, and Yigal Yair. (2020). The influence of preprocessing on text classification using a bag-of-words representation. Http://doi: 10.1371/journal.pone.0232525. PMID: 32357164; PMCID: PMC7194364.
Hartmann Jochen., Huppertz, Juliana, Schamp Christina, and Heitmann Mark. (2019). Comparing automated text classification methods. IJRM Volume 36, Issue 1.
Hull, D. (1996). Stemming algorithms- a case study for detailed evaluation. Journal of the American Society for Information Science 47 (1), 70-84.
Jasmeet Singh and Gupta Vishal. (2019). A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics. Knowledge-Based Systems, Volume 180: 147-162.
Jayanthi R and Jeevitha C. (2015). An Approach for Effective Text Pre-Processing Using Improved Porters Stemming Algorithm. IJISET-International Journal of Innovative Science, Engineering & Technology, Vol. 2 Issue 7: 797-802.
Jivani Anjali Ganesh. (2011). A Comparative Study of Stemming Algorithms. IJCTA, Vol 2 (6): 1930-1938.
Khoja, S. and Garside, R. (1999). Stemming Arabic text. Computing Department, Lancaster University, Lancaster, U.K. Retrieved February, 9th 2024 from http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps.
Kohonen, Teuvo. (2011). Self-organizing maps. 3rd ed. Berlin: Springer-Verlag.
Kraaij, W. and Pohlman, R. (1994). Porter’s stemming algorithm for Dutch. In L. G. M. Noordman and W. A. M. de Vroomen (eds.), Informatiewetenschaap1994: Wetenschappelijke bijdragen aan de derde STINFON Conferentie, Tilburg, 167-180.
Kraaij, W. and Pohlman, R. (1996). Viewing stemming as recall enhancements. In Proceedings of ACM SIGIR-96, 40-48, Zurich, Switzerland.
Lovins, J. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11 (1), 22-31.
Lüdeling, Anke and Kytö, Merja. (2009). Corpus Linguistics: An international handbook. Volume 2. Germany: Walter de Gruyter.
Marie-Claire Jenkins and Smith, Dan. (2005). Conservative stemming for search and indexing. Retrieved from http://lemur.cmp.uea.ac.uk/Research/stemmer/stemmer25feb.pdf
Moisl, Hermann. (2015). Cluster Analysis for Corpus Linguistics. Berlin: De Gruyter Mouton.
Oja, M., Kaski, S., and Kohonen, T. (2001). Bibliography of self-organizing map (SOM) papers: 1998-2001, Neural Computing Surveys 3, 1-156.
Paice D. Christopher. (1994). An evaluation method of stemming algorithms. Paper presented in the SIGIR '94: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 94), 1994.
Romesburg, Charles. (2004). Cluster Analysis for Researchers: (Tokyo, Uchida Rokakuho Publishing Co., Ltd.
Savoy, J. (1993). Stemming of French words based on grammatical categories. Journal of the America Society for Information Science 44, 1-9.
Senders Youri. (2021). The impact of stemming and lemmatization applied to word vector based models in sentiment analysis. The Tilburg University master’s thesis.
Vesanto, J. (1999). SOM-based data visualization methods. Intelligent Data Analysis 3, 111-126.
Wan Chaun, Wang Yuling, Liu Yaoze, Ji Jinchao, and Feng Guozhong. (2019). Composite Feature Extraction and Selection for Text Classification. IEEE, vol.7: 35208-35219.
Worsham Joseph and Kalita Jugal. (2018). Genre Identification and the Compositional Effect of Genre in Literature. Paper presented at the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, August 20-26, 2018.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Refat Ahmed
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
License and Copyright Agreement
In submitting the manuscript to the journal, the authors certify that:
- Their co-authors authorize them to enter into these arrangements.
- The work described has not been formally published before, except in the form of an abstract or as part of a published lecture, review, thesis, or overlay journal.
- That it is not under consideration for publication elsewhere,
- That its publication has been approved by all the author(s) and by the responsible authorities – tacitly or explicitly – of the institutes where the work has been carried out.
- They secure the right to reproduce any material that has already been published or copyrighted elsewhere.
- They agree to the following license and copyright agreement.
Copyright
Authors who publish in the Journal of Linguistics, Culture, and Communication agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) before and during the submission process, as it can lead to productive exchanges and earlier and greater citation of published work.
Licensing for Data Publication
Journal of Linguistics, Culture, and Communication use a variety of waivers and licenses that are specifically designed for and appropriate for the treatment of data:
- Open Data Commons Attribution License, http://www.opendatacommons.org/licenses/by/1.0/ (default)
- Creative Commons CC-Zero Waiver, http://creativecommons.org/publicdomain/zero/1.0/
- Open Data Commons Public Domain Dedication and Licence, http://www.opendatacommons.org/licenses/pddl/1-0/
Other data publishing licenses may be allowed as exceptions (subject to approval by the editor on a case-by-case basis) and should be justified with a written statement from the author, which will be published with the article.