YouTube Transcripts Word Frequency Measure
Keywords:Word Frequency, Youtube, Computer Science, Data, Data Analytics, Analysis
Many YouTube videos provide written audio transcripts which provide information on the language used on YouTube. One important measure relating to language usage is word frequency. Using student-developed software and libraries in R, Python, and Microsoft Excel, the transcripts of one million YouTube videos from the YouTube-8M data set were scraped and analyzed. The word frequency of the YouTube data set was shown to correlate with commonly used word frequency measures from established studies, such as the subtitle word frequency and the HAL word frequency.
Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv preprint. https://doi.org/10.48550/arXiv.1609.08675
Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., ... & Treiman, R. (2007). The English lexicon project. Behavior research methods, 39(3), 445-459. https://doi.org/10.3758/bf03193014
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior research methods, 41(4), 977-990. https://doi.org/10.3758/brm.41.4.977
Brysbaert, M., Mandera, P., & Keuleers, E. (2018). The word frequency effect in word processing: An updated review. Current Directions in Psychological Science, 27(1), 45-50. http://doi.org/10.1177/0963721417727521
Ceci, L. (2022). YouTube – Statistics & Facts. Statista. https://www.statista.com/topics/2019/youtube/#topicHeader__wrapper
Cicconet, M. (2013, April 7). YouTube is not just a site for entertainment, but education. Washington Square News. https://nyunews.com/2013/04/07/cicconet-13/
Johns, B. T., Dye, M., & Jones, M. N. (2016). The influence of contextual diversity on word learning. Psychonomic bulletin & review, 23(4), 1214-1220. https://doi.org/10.3758/s13423-015-0980-7
Mohan, S., & Punathambekar, A. (2019). Localizing YouTube: Language, cultural regions, and digital platforms. International Journal of Cultural Studies, 22(3), 317-333. https://doi.org/10.1177/1367877918794681
Zhao, K., Shi, N., Sa, Z., Wang, H. X., Lu, C. H., & Xu, X. Y. (2020). Text mining and analysis of treatise on febrile diseases based on natural language processing. World Journal of Traditional Chinese Medicine, 6(1), 67. https://doi.org/10.4103/wjtcm.wjtcm_28_19
How to Cite
Copyright (c) 2023 Vincent Smith, Michael Garrett, Austin Harwood, James Shamblin
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
License and Copyright Agreement
In submitting the manuscript to the journal, the authors certify that:
- Their co-authors authorize them to enter into these arrangements.
- The work described has not been formally published before, except in the form of an abstract or as part of a published lecture, review, thesis, or overlay journal.
- That it is not under consideration for publication elsewhere,
- That its publication has been approved by all the author(s) and by the responsible authorities – tacitly or explicitly – of the institutes where the work has been carried out.
- They secure the right to reproduce any material that has already been published or copyrighted elsewhere.
- They agree to the following license and copyright agreement.
Authors who publish in the Journal of Linguistics, Culture, and Communication agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) before and during the submission process, as it can lead to productive exchanges and earlier and greater citation of published work.
Licensing for Data Publication
Journal of Linguistics, Culture, and Communication use a variety of waivers and licenses that are specifically designed for and appropriate for the treatment of data:
- Open Data Commons Attribution License, http://www.opendatacommons.org/licenses/by/1.0/ (default)
- Creative Commons CC-Zero Waiver, http://creativecommons.org/publicdomain/zero/1.0/
- Open Data Commons Public Domain Dedication and Licence, http://www.opendatacommons.org/licenses/pddl/1-0/
Other data publishing licenses may be allowed as exceptions (subject to approval by the editor on a case-by-case basis) and should be justified with a written statement from the author, which will be published with the article.