Exploring The Impact of Stemming on Text Topic-Based Classification Accuracy

Authors

  • Refat Ahmed Independent researcher

DOI:

https://doi.org/10.61320/jolcc.v2i2.204-224

Keywords:

stemming, classification, clustering, hierarchical, SOM, genre, content words

Abstract

Text classification attempts to assign written texts to specific group types that share the same linguistic features. One class of features that have been widely employed for a wide range of classification tasks is lexical features. This study explores the impact of stemming on text classification using lexical features. To explore, this study is based on a corpus of thirty texts written by six authors with topics that focus on politics, history, science, prose, sport, and food. These texts are stemmed using a light stemming algorithm. In order to classify these texts according to the topic by means of lexical features, linear hierarchical clustering and non-linear clustering (SOM) is carried out on the stemmed and unstemmed texts. Although both clustering methods are able to classify texts by topic with two models produce accurate and stable results, the results suggest that the impact of a light stemming on the accuracy of text classification by topic is ineffectual. The accuracy is neither increased nor decreased on the stemmed texts, whereby the stemming algorithm helped reducing the dimensionality of feature vector space model.

References

Asian, J., Williams, H., and Tahaghoghi, S. (2005). Stemming Indonesian. In Proceedings of the twenty-eighth Australian Computer Science Conference, ACS, 307-314, Newcastle, Australia. CRPIT, 38. Estivill-Castro, V., ed.

Ardanuy Mariona Coll and Sporleder Caroline. (2014). Structure-based Clustering of Novels. Paper presented in the 3rd Workshop on Computational Linguistics for Literature (CLFL). Gothenburg, Sweden, April 27, 2014.

Baayen, R. (2001). Word frequency distributions. Dordrecht: Kluwer.

Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern information retrieval. Addison-Wesley.

Dogan, Turgut and Uysal, Alper Kursat. (2020). A novel term weighting scheme for text classification: TF-MONO. Journal of Informetrics, Volume 14, Issue 4.

Fielding Alan. (2006). Cluster and Classification Techniques for the Biosciences. Cambridge University Press.

Frakes, W. (1992). Stemming Algorithms. In W. Frakes & R. Baeza-Yates (eds.), Information Retrieval, 131-60. NJ: Prentice Hall.

Fuller, M. and Zobel, J. (1998). Conflation-based comparison of stemming algorithms. In Proceedings of the third Australian Document Computing Symposium, Sydney, Australia.

Goweder, A. (2004). Stemming and Arabic information retrieval: the case of broken plurals. PhD thesis, Department of Computer Science, University of Essex.

HaCohen-Kerner Yaakov, Miller Daniel, and Yigal Yair. (2020). The influence of preprocessing on text classification using a bag-of-words representation. Http://doi: 10.1371/journal.pone.0232525. PMID: 32357164; PMCID: PMC7194364.

Hartmann Jochen., Huppertz, Juliana, Schamp Christina, and Heitmann Mark. (2019). Comparing automated text classification methods. IJRM Volume 36, Issue 1.

Hull, D. (1996). Stemming algorithms- a case study for detailed evaluation. Journal of the American Society for Information Science 47 (1), 70-84.

Jasmeet Singh and Gupta Vishal. (2019). A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics. Knowledge-Based Systems, Volume 180: 147-162.

Jayanthi R and Jeevitha C. (2015). An Approach for Effective Text Pre-Processing Using Improved Porters Stemming Algorithm. IJISET-International Journal of Innovative Science, Engineering & Technology, Vol. 2 Issue 7: 797-802.

Jivani Anjali Ganesh. (2011). A Comparative Study of Stemming Algorithms. IJCTA, Vol 2 (6): 1930-1938.

Khoja, S. and Garside, R. (1999). Stemming Arabic text. Computing Department, Lancaster University, Lancaster, U.K. Retrieved February, 9th 2024 from http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps.

Kohonen, Teuvo. (2011). Self-organizing maps. 3rd ed. Berlin: Springer-Verlag.

Kraaij, W. and Pohlman, R. (1994). Porter’s stemming algorithm for Dutch. In L. G. M. Noordman and W. A. M. de Vroomen (eds.), Informatiewetenschaap1994: Wetenschappelijke bijdragen aan de derde STINFON Conferentie, Tilburg, 167-180.

Kraaij, W. and Pohlman, R. (1996). Viewing stemming as recall enhancements. In Proceedings of ACM SIGIR-96, 40-48, Zurich, Switzerland.

Lovins, J. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11 (1), 22-31.

Lüdeling, Anke and Kytö, Merja. (2009). Corpus Linguistics: An international handbook. Volume 2. Germany: Walter de Gruyter.

Marie-Claire Jenkins and Smith, Dan. (2005). Conservative stemming for search and indexing. Retrieved from http://lemur.cmp.uea.ac.uk/Research/stemmer/stemmer25feb.pdf

Moisl, Hermann. (2015). Cluster Analysis for Corpus Linguistics. Berlin: De Gruyter Mouton.

Oja, M., Kaski, S., and Kohonen, T. (2001). Bibliography of self-organizing map (SOM) papers: 1998-2001, Neural Computing Surveys 3, 1-156.

Paice D. Christopher. (1994). An evaluation method of stemming algorithms. Paper presented in the SIGIR '94: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 94), 1994.

Romesburg, Charles. (2004). Cluster Analysis for Researchers: (Tokyo, Uchida Rokakuho Publishing Co., Ltd.

Savoy, J. (1993). Stemming of French words based on grammatical categories. Journal of the America Society for Information Science 44, 1-9.

Senders Youri. (2021). The impact of stemming and lemmatization applied to word vector based models in sentiment analysis. The Tilburg University master’s thesis.

Vesanto, J. (1999). SOM-based data visualization methods. Intelligent Data Analysis 3, 111-126.

Wan Chaun, Wang Yuling, Liu Yaoze, Ji Jinchao, and Feng Guozhong. (2019). Composite Feature Extraction and Selection for Text Classification. IEEE, vol.7: 35208-35219.

Worsham Joseph and Kalita Jugal. (2018). Genre Identification and the Compositional Effect of Genre in Literature. Paper presented at the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, August 20-26, 2018.

Downloads

Published

2024-06-30

How to Cite

Ahmed, R. (2024). Exploring The Impact of Stemming on Text Topic-Based Classification Accuracy . Journal of Linguistics, Culture and Communication, 2(2), 204–224. https://doi.org/10.61320/jolcc.v2i2.204-224

Similar Articles

You may also start an advanced similarity search for this article.