TEXT DOCUMENT CLASSIFICATION SYSTEM WITH BIG DATA TECHNOLOGIES USAGE

Authors

DOI:

https://doi.org/10.32782/IT/2023-2-4

Keywords:

Big Data, Hadoop, Map Reduce, Apache Spark, Machine Learning Algorithm, systems of classification, Bayes Classifier.

Abstract

The aim. The paper considered a model of the document classification system using Big Data technology. When using Big Data technology, a large array of documents accumulates on the server which must be pre-processed and uploaded to the database. In the documents you need to define keywords with a help of which you need to assign them to one or more thematic sections. In addition, the developed system should operate fast and provide automatic learning. Therefore, the development of models and methods of classification of text documents for real time is an urgent task. A very intensive development of these methods has been observed recently with the rapid development of computer technology and with the transition of many organizations into electronic document management. As a result of the study, a method and a system model were developed; a combination of approaches for model training is proposed; the most productive model for system training is determined. Scientific novelty. The paper proposes a new solution for performing accurate Bayesian classification based on Spark. This classifier uses a large number of in-memory server operations to classify a large number of text documents based on a large training dataset using MapReduce. The map phase calculates the number of occurrences of keywords in different distributions of the training data. After that, several reducers calculate the probability of assigning of a document to certain classes based on the calculations obtained at the map stage. The key point of this proposal is to manage a set of text documents keeping them in memory whenever possible. Conclusions. The results of this work could be used for implementation of an effective classification system for text documents that uses an accurate Bayesian classifier developed with the Python programming language in combination with the Hadoop Big Data service.

References

Gonzalez R.С., Thomason M G. Tree Grammars and Their Application to Pattern Recognition. Tech. Rep. TR-EE/CS-74-10, Electrical Engineering Dept., Univ. of Tennessee, Knoxville. 1974. P. 364.

Gonzalez R С., Thomason M.G. Inference of Tree Grammars for Syntactic Pattern Recognition. Tech. Rept. TR-EE/CS-74-20, Electrical Engineering Dept., University of Tennessee, Knoxville. 1974. P. 160.

Gonzalez R.С., Tou J.Т. Some Results in Minimum-Entropy FeaturExtraction. IEEE Convention Record. Region III. 1968.

Salton G. Another look at automatic text-retrieval systems. Commun. ACM. 1986. № 7. Р. 648–656. 2000. ISBN 951-22-5145-0

Semberecki P., Maciejewski H. Distributed Classification of Text Documents on Apache Spark Platform. International Conference on Artificial Intelligence and Soft Computing. June 2016. P. 621–629. DOI:10.1007/978-3-319-39378-0_53 [Scopus].

I. Pintye, E. Kail, P. Kacsuk, R. Lovas. Big data and machine learning framework for clouds and its usage for text classification. Volume 33. Issue 19. Special Issue: Human oriented solutions for intelligent analysis, multimedia and communication systems (Human Oriented Solutions 2020). Science Gateways Special Issue (Science Gateways 2020) 10 October 2021. https://doi.org/10.1002/cpe.6164.

Ratna S. Chaudhari1 , Seema S. Patil , Smita J. Ghorpade. Classification and clustering methods along with Map Reduce, Apache Spark: a study. IJRAR. November 2020. Volume 7. Issue 4.

Gopalani S., Arora R. Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means. March 2015 International Journal of Computer Applications. 113(1). P. 8–11. DOI:10.5120/19788-0531

Maillo J., Ramírez S., Triguero I., Herrera F. kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. June 2016. Knowledge-Based Systems. 1 February 2017. Volume 117. P. 3–15. DOI:10.1016/j.knosys.2016.06.012

Zipf G.K. Human Behavior and the Principle of Least Effort. Cambridge, 1949. P. ix, 3, 5–8.

Published

2023-09-12