ОБРОБКА ПРИРОДНОЇ МОВИ УКРАЇНСЬКОЮ: ВИКЛИКИ ТА ПЕРСПЕКТИВИ ВИКОРИСТАННЯ ШТУЧНОГО ІНТЕЛЕКТУ В ОСВІТІ

Bohdan Patsai; Ivan Nechyporuk; Anna Kovtun

doi:10.32782/dees.16-26

Bohdan Patsai Taras Shevchenko National University of Kyiv
Ivan Nechyporuk Irpin Lyceum of Innovative Technologies
Anna Kovtun Irpin Lyceum of Innovative Technologies

DOI: https://doi.org/10.32782/dees.16-26

Keywords: natural language processing, self-attention, vector representations, tokenization, BERT, artificial intelligence

Abstract

The article is devoted to the study of problems associated with the use of natural language processing (NLP) technologies for analyzing and generating educational materials in the Ukrainian language. The purpose of the study is to analyze the results of test generation based on the proposed content and to identify possible causes of incorrect behavior in NLP models that process educational materials in the Ukrainian language. The study employs token filtering methods using self-attention algorithms. The BLEU score was used to evaluate the results obtained with BERT. The authors focus on the challenges arising from the limited resources available for the Ukrainian language, particularly the insufficient number of text corpora for training artificial intelligence models. The article examines the main reasons for the low quality of results produced by NLP models, including irrelevant training data, incorrect tokenization, a lack of contextual analysis, and weak logical connections in the text. The study includes a comparison of the performance of the OpenAI and BERT language models, focusing on their accuracy, contextual understanding, and adaptability to the Ukrainian language. The authors propose using bidirectional context analysis, as implemented in the BERT model, to improve text comprehension and test generation. The experimental part of the study demonstrates that adjusting tokenization settings, applying stop-word filtering, and using self-attention algorithms significantly improve model quality. The article emphasizes the need to develop specialized models adapted to the peculiarities of the Ukrainian language and to increase the volume of training data for professional domains. Based on the analysis of different token filtering methods, the study concludes that tokenization processes should be configured individually for each task, as this significantly affects model performance. The conclusions highlight the potential of NLP in education, provided there is further technological improvement and adaptation to linguistic realities. This study may serve as a foundation for the further adaptation of language models for developing test tasks.

References

Berment V. Méthodes pour informatiser les langues et les groupes de langues «peu dotées» : Doctoral dissertation, Université Joseph-Fourier-Grenoble I / Université Joseph-Fourier-Grenoble I, 2004.

Hamotskyi S., Levbarg A. I., Hänig C. Eval-UA-tion 1.0: Benchmark for Evaluating Ukrainian (Large) Language Models: Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP)@ LREC-COLING 2024. 2024. May. P. 109–119.

Cambria E., White B. Jumping NLP Curves: A Review of Natural Language Processing Research. IEEE Computational Intelligence Magazine. 2014. Vol. 9, No. 2. P. 48–57. DOI: https://doi.org/10.1109/MCI.2014.2307227.

Vysotska V. Computer linguistic system modelling for Ukrainian language processing. CEUR Workshop Proceedings. 2024. Vol. 3722. P. 288–342.

Vysotska V., Pukach P., Lytvyn V., Uhryn D., Ushenko Y., Hu Z. Intelligent analysis of Ukrainian-language tweets for public opinion research based on NLP methods and machine learning technology. International Journal of Modern Education and Computer Science (IJMECS). 2023. Vol. 15, No. 3. P. 70–93. DOI: https://doi.org/10.5815/ijmecs.2023.03.06.

Mashtalir S. V., Nikolenko O. V. Data preprocessing and tokenization techniques for technical Ukrainian texts. Applied Aspects of Information Technology. 2023. Т. 6, № 3. С. 318–326. DOI: https://doi.org/10.15276/aait.06.2023.22.

Glorot X., Bengio Y. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 2010. March. P. 249–256.

Norouzi M., Mikolov T., Bengio S., Singer Y., Shlens J., Frome A., Dean J. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650. 2013. DOI: https://doi.org/10.48550/arXiv.1312.5650.

Rodriguez P. L., Spirling A. Word embeddings: What works, what doesn’t, and how to tell the difference for applied research. The Journal of Politics. 2022. Vol. 84, No. 1. P. 101–115. DOI: https://doi.org/10.1086/715162.

Tenney I. BERT rediscovers the classical NLP pipeline. arXiv preprint arXiv:1905.05950. 2019. DOI: https://doi.org/10.48550/arXiv.1905.05950.

Elov B. B., Khamroeva S. M., Xusainova Z. Y. The pipeline processing in NLP: E3S Web of Conferences. 2023. DOI: https://doi.org/10.1051/e3sconf/202341303011.

Im J., Cho S. Distance-based self-attention network for natural language inference. arXiv preprint arXiv:1712.02047. 2017. DOI: https://doi.org/10.48550/arXiv.1712.02047.

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A., Polosukhin I. Attention is all you need: NIPS. 2017. December.

NATURAL LANGUAGE PROCESSING IN UKRAINIAN: CHALLENGES AND PROSPECTS FOR THE USE OF ARTIFICIAL INTELLIGENCE IN EDUCATION

Abstract

References