Comparison of text vectorization techniques for machine learning applied to binary classification
Abstract
Text vectorization is indispensable for natural language processing (NLP). It allows us to transform text from comments, reviews, documents, and even complete books into formats or values that we can use as input in machine learning models. In this paper, we present a comparative analysis of some of these methods used for text vectorization. The general purpose is to identify which of these methods offers a superior performance when applied together with some classification models on binary classification datasets. Each vectorization method, in combination with the models used, is evaluated by standard metrics in machine learning, such as Accuracy, Precision, Recall, F1-Score and area under the curve (ROC) for a robust evaluation, being CountVectorizer the best performance method among all vectorizers with an average of Accuracy of 0.9441 and F1-Score of 0.8577 demonstrating that the newest and most complex vectorization methods are not necessarily better for all binary classification tasks.