YouTube Transcripts Word Frequency Measure


  • Vincent Smith University of Charleston
  • Michael Garrett University of Charleston
  • Austin Harwood University of Charleston, USA
  • James Shamblin University of Charleston, USA



Word Frequency, Youtube, Computer Science, Data, Data Analytics, Analysis


Many YouTube videos provide written audio transcripts which provide information on the language used on YouTube. One important measure relating to language usage is word frequency. Using student-developed software and libraries in R, Python, and Microsoft Excel, the transcripts of one million YouTube videos from the YouTube-8M data set were scraped and analyzed. The word frequency of the YouTube data set was shown to correlate with commonly used word frequency measures from established studies, such as the subtitle word frequency and the HAL word frequency.



Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv preprint.

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., ... & Treiman, R. (2007). The English lexicon project. Behavior research methods, 39(3), 445-459.

Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior research methods, 41(4), 977-990.

Brysbaert, M., Mandera, P., & Keuleers, E. (2018). The word frequency effect in word processing: An updated review. Current Directions in Psychological Science, 27(1), 45-50.

Ceci, L. (2022). YouTube – Statistics & Facts. Statista.

Cicconet, M. (2013, April 7). YouTube is not just a site for entertainment, but education. Washington Square News.


Johns, B. T., Dye, M., & Jones, M. N. (2016). The influence of contextual diversity on word learning. Psychonomic bulletin & review, 23(4), 1214-1220.

Mohan, S., & Punathambekar, A. (2019). Localizing YouTube: Language, cultural regions, and digital platforms. International Journal of Cultural Studies, 22(3), 317-333.

Zhao, K., Shi, N., Sa, Z., Wang, H. X., Lu, C. H., & Xu, X. Y. (2020). Text mining and analysis of treatise on febrile diseases based on natural language processing. World Journal of Traditional Chinese Medicine, 6(1), 67.




How to Cite

Smith, V., Garrett, M., Harwood, A., & Shamblin, J. (2023). YouTube Transcripts Word Frequency Measure. Journal of Linguistics, Culture and Communication, 1(2), 91–99.