Context-based Bengali Next Word Prediction: A Comparative Study of Different Embedding Methods

Mahir Mahbub; Suravi Akhter; Ahmedul Kabir*; Zerina Begum

doi:https://doi.org/10.3329/dujase.v7i2.65088

Dhaka University Journal of Applied Science & Engineering

Issue: Vol. 7, No. 2, July 2022

Title:	Context-based Bengali Next Word Prediction: A Comparative Study of Different Embedding Methods
Authors:	Mahir Mahbub Institute of Information Technology, University of Dhaka, Dhaka-1000, Bangladesh Suravi Akhter Institute of Information Technology, University of Dhaka, Dhaka-1000, Bangladesh Ahmedul Kabir* Institute of Information Technology, University of Dhaka, Dhaka-1000, Bangladesh Zerina Begum Institute of Information Technology, University of Dhaka, Dhaka-1000, Bangladesh
DOI:	https://doi.org/10.3329/dujase.v7i2.65088 PDF
Keywords:	Context-based next word prediction, word embedding, sequence model, word2vec, fastText
Abstract:	Next word prediction is a helpful feature for various typing subsystems. It is also convenient to have suggestions while typing to speed up the writing of digital documents. Therefore, researchers over time have been trying to enhance the capability of such a prediction system. Knowledge regarding the inner meaning of the words along with the contextual understanding of the sequence can be helpful in enhancing the next word prediction capability. Theoretically, these reasonings seem to be very promising. With the advancement of Natural Language Processing (NLP), these reasonings are found to be applicable in real scenarios. NLP techniques like Word embedding and sequential contextual modeling can help us to gain insight into these points. Word embedding can capture various relations among the words and explain their inner knowledge. On the other hand, sequence modeling can capture contextual information. In this paper, we figure out which embedding method works better for Bengali next word prediction. The embeddings we have compared are word2vec skip-gram, word2vec CBOW, fastText skip-gram and fastText CBOW. We have applied them in a deep learning sequential model based on LSTM which was trained on a large corpus of Bengali texts. The results reveal some useful insights about the contextual and sequential information gathering that will help to implement a context-based Bengali next word prediction system.
References:	N. Garay-Vitoria and J. Gonzalez-Abascal, ``Intelligent wordprediction to enhance text input rate (a syntactic analysisbased word-prediction aid for people with severe motor and speech disability),’’ in Proceedings of the 2nd international conference on Intelligent user interfaces, pp. 241—244, 1997. M. Haque, M. Habib, M. Rahman et al., ``Automated word prediction in bangla language using stochastic language models,’’ arXiv preprint arXiv:1602.07803, 2016. N. Garay-Vitoria and J. Abascal, ``Text prediction systems: a survey,’’ Universal Access in the Information Society, vol. 4, no. 3, pp. 188--203, 2006. T. S. Rani and R. S. Bapi, ``Analysis of n-gram based promoter recognition methods and application to whole genome promoter prediction,’’ In silico biology, vol. 9, no. 1, 2, pp. S1--S16, 2009. O. F. Rakib, S. Akter, M. A. Khan, A. K. Das, and K. M. Habibullah, ``Bangla word prediction and sentence completion using gru: an extended version of rnn on n-gram language model,’’ in 2019 International Conference on Sustainable Technologies for Industry 4.0 (STI). IEEE, pp. 1--6, 2019, S. Sarker, M. E. Islam, J. R. Saurav, and M. M. H. Nahid, ``Word completion and sequence prediction in bangla language using trie and a hybrid approach of sequential lstm and n-gram,’’ in 2020 2nd International Conference on Advanced Information and Communication Technology (ICAICT). IEEE, pp. 162-- 167, 2020. S. Hochreiter, ``The vanishing gradient problem during learning recurrent neural nets and problem solutions,’’ International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, no. 02, pp. 107--116, 1998. S. Hochreiter and J. Schmidhuber, ``Long short-term memory,’’ Neural computation, vol. 9, no. 8, pp. 1735--1780, 1997. K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, ``Learning phrase representations using rnn encoder-decoder for statistical machine translation,’’ arXiv preprint arXiv:1406.1078, 2014. M. Sundermeyer, R. Schluter, and H. Ney, ``Lstm neural networks for language modeling,’’ in Thirteenth annual conference of the international speech communication association, 2012. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, ``A neural probabilistic language model,’’ Journal of machine learning research, vol. 3, no. Feb, pp. 1137--1155, 2003. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, ``Distributed representations of words and phrases and their compositionality,’’ in Advances in neural information processing systems, 2013, pp. 3111--3119. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, ``Enriching word vectors with subword information,’’ Transactions of the Association for Computational Linguistics, vol. 5, pp. 135-- 146, 2017. A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, ``Bag of tricks for efficient text classification,’’ arXiv preprint arXiv:1607.01759, 2016. S. Bickel, P. Haider, and T. Scheffer, ``Predicting sentences using ngram language models,’’ in Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 193- -200, 2005. G. L. Prajapati and R. Saha, ``Reeds: Relevance and enhanced entropy based dempster shafer approach for next word prediction using language model,’’ Journal of Computational Science, vol. 35, pp. 1-11, 2019. K. Trnka, J. McCaw, D. Yarrington, K. F. McCoy, and C. Pennington, ``User interaction with word prediction: The effects of prediction quality,’’ ACM Transactions on Accessible Computing (TACCESS), vol. 1, no. 3, pp. 1--34, 2009. H. X. Goulart, M. D. Tosi, D. S. Goncalves, R. F. Maia, and G. A. Wachs-Lopes, ``Hybrid model for word prediction using naive bayes and latent information,’’ arXiv preprint arXiv:1803.00985, 2018. G. Szymanski and Z. Ciota, ``Hidden markov models suitable for text generation,’’ in WSEAS International Conference on Signal, Speech and Image Processing (WSEAS ICOSSIP 2002). Citeseer, pp. 3081--3084, 2002 T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur, ``Recurrent neural network based language model.’’ in Interspeech, vol. 2, no. 3. Makuhari, pp. 1045— 1048, 2010 C. Zhou, C. Sun, Z. Liu, and F. Lau, ``A c-lstm neural network for text classification,’’ arXiv preprint arXiv:1511.08630, 2015. S. Sarker, M. E. Islam, J. R. Saurav, and M. M. H. Nahid, ``Word completion and sequence prediction in bangla language using trie and a hybrid approach of sequential lstm and n-gram,’’ in 2nd International Conference on Advanced Information and Communication Technology (ICAICT). IEEE, 2020, pp. 162—167, 2020 M. Bhuyan and S. Sarma, ``An n-gram based model for predicting of word-formation in assamese language,’’ Journal of Information and Optimization Sciences, vol. 40, no. 2, pp. 427--440, 2019. P. P. Barman and A. Boruah, ``A rnn based approach for next word prediction in assamese phonetic transcription,’’ Procedia computer science, vol. 143, pp. 117--123, 2018. R. Sharma, N. Goel, N. Aggarwal, P. Kaur, and C. Prakash, ``Next word prediction in hindi using deep learning techniques,’’ in 2019 International Conference on Data Science and Engineering (ICDSE). IEEE, pp. 55—60, 2019, K. Shakhovska, I. Dumyn, N. Kryvinska, and M. K. Kagita, ``An approach for a next-word prediction for ukrainian language,’’ Wireless Communications and Mobile Computing, vol. 2021, 2021. R. Rahman, ``Robust and consistent estimation of word embedding for bangla language by fine-tuning word2vec model,’’ in 2020 23rd International Conference on Computer and Information Technology (ICCIT). IEEE, pp. 1--6, 2020 Z. S. Ritu, N. Nowshin, M. M. H. Nahid, and S. Ismail, ``Performance analysis of different word embedding models on bangla language,’’in 2018 International Conference on Bangla Speech and Language Processing (ICBSLP). IEEE, pp. 1--5, 2018 O. F. Rakib, S. Akter, M. A. Khan, A. K. Das, and K. M. Habibullah, ``Bangla word prediction and sentence completion using gru: an extended version of rnn on n-gram language model,’’ in 2019 International Conference on Sustainable Technologies for Industry 4.0 (STI). IEEE, pp. 1—6, 2019 M. S. Islam, S. S. S. Mousumi, S. Abujar, and S. A. Hossain, ``Sequenceto-sequence bangla sentence generation with lstm recurrent neural networks,’’ Procedia Computer Science, vol. 152, pp. 51--58, 2019 A. Joulin, M. Cisse, D. Grangier, H. Jegou et al., ``Efficient softmax approximation for gpus,’’ in International conference on machine learning. PMLR, pp. 1302—1310, 2017 T. Mikolov, K. Chen, G. Corrado, and J. Dean, ``Efficient estimation of word representations in vector space,’’ arXiv preprint arXiv:1301.3781, 2013 A. Pal and A. Mustafi, ``Vartani spellcheck--automatic contextsensitive spelling correction of ocr-generated hindi text using bert and levenshtein distance,’’ arXiv preprint arXiv:2012.07652, 2020. Y. Hong, X. Yu, N. He, N. Liu, and J. Liu, ``Faspell: A fast, adaptable, simple, powerful chinese spell checker based on dae-decoder paradigm,’’ in Proceedings of the 5th Workshop on Noisy Usergenerated Text (W-NUT 2019), pp. 160—169, 2019,