Automatic Categorization of Multi Marketplace FMCGs Products using TF-IDF and PCA Features

— The use of technology in line with the increasing number of internet users has caused a shift in the product sales ecosystem to the realm of electronic commerce (electronic commerce). A total of 73.23 customers made purchase transactions using e-commerce and the most purchased products were products classified as Fast Moving Consumer Goods (FMCGs). The increasingly varied FMCGs data coupled with the increasing number of marketplaces is felt to need to be broken down into specific groups. The process is carried out by analyzing e-commerce product information, especially product names, and descriptions. In this study, we propose an automatic categorization of multiple marketplaces using data from multiple marketplaces. Data text is converted into structured data with a series of preprocessing, and comprehensive experiments are carried out to see the extraction performance of variables including TF-IDF, BOW, and N-Gram. All three methods are used to validate text data sets with K-Means grouping results used with the help of PCA to reduce data dimensions. The results show that the performance of the TF-IDF algorithm with a dimension reduction value of 70 and the use of Python can provide optimal results for the percentage of grouping data.


I. INTRODUCTION
At present, companies utilize information technology as an online buying and selling platform. This is in line with the number of internet users has reached 62.10 percent of the total population in Indonesia and this number has increased by 22.2 percent since 2018 which amounted to 39.90 percent [1]. The use of electronic commerce (e-commerce) is a company solution to market its products instantly and can make it easier for consumers to find a product. Thus, encouraging companies to carry out digital transformation and invest more in online purchasing platforms to be able to compete in meeting customer satisfaction and needs. Companies can optimize existing business processes by improving user experience [2]. According to the Indonesian Central Bureau of Statistics, as many as 71.23 percent of customers make purchase transactions using ecommerce services [3]. Based on these numbers, it means that user preferences in searching for needs tend to use e-commerce.
The products offered on the e-commerce platform vary. One of them is a product that has a relatively short useful life and a relatively large amount of consumption, that is Fast Moving Consumer Goods (FMCGs). FMCGs are terms for products that are often bought, consumed quickly, sold in bulk and have relatively low prices [4]. FMCGs is "non-durable" items required for daily use. The following are examples of products classified as FMCGs in Fig. 1. The FMCGs product sector is the largest industry in the world consisting of various products such as food, beverages, electronic devices, household appliances, medicines, and others [5]. Consumers usually buy products in this category at least once a month. Based on data from the Central Statistics Agency, the most sold products via the Internet in 2020 were the food, beverage, and grocery categories by 40.86 percent, clothing by 20.71 percent, household needs by 20.30 percent, cosmetics by 8.05 percent and 38.40 percent for other product categories [3].
FMCGs products continue to emerge, resulting in more diverse data consumed by customers. Coupled with the emergence of various e-commerce with various conveniences offered. So multi-marketplace product grouping is needed by customers in product search for a price comparison between marketplaces. Each product has detailed information in the form of product name, price, rating, description, number of products sold, store name, store address, and others. This information can be used in identifying product similarities in multiple marketplaces.
The process of grouping products based on multimarketplace categories generally uses soft computing methods. Several studies related to categorizing a product have been carried out with the use of machine learning. For example, research on the categorization of e-commerce products is carried out by proposing a machine translation paradigm using classification techniques by looking at nodes in the taxonomic tree [6]. Meanwhile, Chavaltada et al compared the performance of various machine learning techniques on product categorization in the proposed framework [7].
This study aims to group categories based on product information (e.g. product names and descriptions) available by emphasizing the use of tokenization processes and word weighting so that automatic product grouping is obtained. Several algorithms can be used in performing a grouping, especially with the use of unsupervised learning. A series of studies have applied unsupervised learning such as in analyzing mass spectrometry imaging [8], health [9][10] [11], agriculture [12][13] and various other fields. Some use K-means clustering, hierarchical clustering, principal component analysis, and factor analysis algorithms. This research will use the concept of text clustering because it will identify, and group products based on product information (for example product names and descriptions) which are classified as unstructured data. This is done because there is research that states that text clustering based on similarities between texts is an efficient technique used to partition several documents into groups [14]. So that it will be tested in the process of grouping FMCGs product categories. In addition, the grouping will focus on the application of the Kmeans algorithm which is a non-hierarchical grouping method used to identify similarities between objects based on distance vectors and has efficiency, conciseness, and speed in its implementation [15] And this method is considered capable of carrying out the machine learning process quickly and providing optimal grouping results.
However, product information (product name and description) is unstructured text data, so the data will be processed first into structured data before being processed in unsupervised learning. So that experiments were conducted and analyzed to determine the performance of the process of converting unstructured data into structured with TF-IDF, BOW, and N-Gram in case text grouping based on FMCGs products can be more accurate. This process is done by converting text data to numeric by tokenizing each record and extracting it into a structured value. As well as using the Principal Component Analysis (PCA) method to reduce the dimensions of the data tokens formed and find out the relationship between variables.
Based on the explanation above, this study is expected to be able to provide a grouping of product name and description data based on the categorization of FMCGs products Automatically by applying machine learning technology that can divide data into clusters so that the same data group is formed.

II. THEORETICAL BASIS
Source data be present in deep shapes Virtual Form yard web, media, articles, blogs, and more. Partial big information gets taken deep a yard Web with access URL address websites and configuring such shape. This research retrieval data with the use of the technique of Crawling on an e-commerce website. This approach is already used in various research deep gather data picture [16][17] [18], and text [19][20][21] [22]. Crawling techniques be an Approach that gets gather data text and pictures from the website this research will Focuses on data name product and description of product that are classified as deep product FMCGs.
Machine Learning (ML) is a series of processes that can perform automated analysis of large amounts of data by finding patterns in data. In ML, developers strive to make the processes implemented in machines resemble human thought processes. ML is becoming an excellent tool for innovation due to its low computational costs, short development cycles, robust data analysis, and predictive capabilities [23]. Today, ML has been used in sensor utilization [24], digital transformation [25], manufacturing [26], healthcare [27] and has even entered the financial field [28] [29].
In ML, there is a clustering method which is an unattended method that has no labeled inputs and problem-solving is based on the experience that algorithms gain from solving similar problems [30]. Because the data in this study uses text data, this study will focus on converting text data to vectors derived from the tokenization process. This clustering will group the vector value data based on the appropriate similarity.
Research conducted by Sen Xu et al [31] using text data to do clustering, that the algorithm applied provides optimal results using small data, but in large-scale data is still not able to provide good results. In addition, Diallo et al [32] in conducting text clustering compared similarities to Cosine and Euclidean Distances. However, this study will focus on comparing text weighting methods to obtain optimal clustering results, including TF-IDF, BOW, and N-Gram.
In addition, this study will perform dimension reduction using Principal Component Analysis (PCA). PCA can be used to analyze relationships between variables. In clustering, PCA is useful for identifying the main variable and its effect on the target variable [33]. It is used to reduce the main variables in the clustering process.

III. RESEARCH METHODOLOGY
This research involves the process of preprocessing, variable selection using text weighting methods based on words contained in the data, dimensional reduction, and K-Means clustering techniques for grouping data based on the similarity of each record, as in Fig. 2.

A. Internet, Multimarketplace, Web Crawler & Files
Various information is available on the internet, ranging from the latest news, stock prices, buying and selling products, to future predictions. Therefore, data from this study was obtained from several marketplaces, namely Shopee, Bukalapak, and Lazada which can be accessed via the internet using web crawlers. The web crawler is a data collection technique by capturing information on e-commerce websites so that the information taken is in the form of product name and description data.

B. Preprocessing
Preprocessing is a series of stages that are carried out before data is processed. This stage aims to generalize the form of data such as removing symbols or numbers that are not important, eliminating words that often appear, and returning data to its original form. In this study, a series of preprocessing processes were carried out by cleaning data, removing punctuation, symbols, and numbers (remove punctuation), removing unused words (remove stopwords), and changing the form of data to basic words (stemming).

C. Variable Extraction
Variable extraction is a process for converting text data into numeric form. This study will test and compare the results of variable extraction between TF-IDF, BOW, and N-Gram with the resulting K-Means grouping.
1) TF-IDF is applied to measure term frequency, then filter out words that appear with very low frequency [32]. This algorithm performs TF and IDF calculations of documents. Here's the TF and IDF formula in equations (1) and (2). 2) BOW (Bag Of Words): one way of extracting variables from text into numbers by representing textual documents as sparse vectors of word counts [34]. 3) N-Gram: a text preprocessing model that has a method to improve character transformations. N-Gram is a frequently used method in which n indicates a continuous number of terms or words as well as consists of a collection of document-size character sets (bi-gram, tri-gram, quad-gram) [35].

D. Dimensionality Reduction
Dimension reduction is a stage used to minimize the number of variable inputs before clustering. The dimension reduction used in this study is Principal Component Analysis (PCA). PCA is used to transform high-dimensional data to lower dimensions, this is done by using several main components so that the transformed dimensions are reduced [36].

E. Clustering
Clustering is a type of unsupervised machine learning, where this study will focus on K-Means that can identify similarities between variables based on distance vectors and have efficiency, conciseness, and speed in its implementation. In general, K-Means Clustering in grouping variables starts from the following stages: 1) Specify the K value as the number of initial clusters (centroids) you want to form. 2) Calculate the distance of data with a centroid using the Euclidean Distance formula to find the closest distance of data with a centroid. Here's the Euclidean Distance equation (3): Where, x i and μ j is the number of attribute values of the variable. 3) Classify each data based on its proximity to the centroid. 4) Update the new centroid value obtained from the cluster average, using the formula: Where μj (t + 1) is the new centroid in the (t+1) iteration and Nsj is a lot of data on the Sj cluster. 5) Steps 2 through 4 are repeated until the value of the centroid point is constant. However, in its implementation, Python already has a library to run the K-Means algorithm, namely with the sklearn.cluster library and import K-Means.

IV. RESULTS AND DISCUSSION
A total of 300 e-commerce data were collected in this research consisting of three e-commerce, namely Lazada, Shopee, and Bukalapak. Each e-commerce was composed of categories of clothing, beauty equipment, electronics, and health. The category data is used as ground truth to analyze and compare the results of groupings carried out in this research design.
The following are the preprocess stages carried out in this study: 1) Data Cleaning: steps to remove unnecessary columns in the crawled dataset. This aims to streamline data and only focuses on the data you want to process. Here's a comparison of before and after cleaning on Table I and II. 2) Remove Punctuation: steps to remove punctuation, symbols, numbers, and uniformity of letters into lowercase. This aims to homogenize the form of data to be more effective in the subsequent data processing.
Here's a comparison before and after the remove of the punctuation process in Table III. 3) Stopwords: steps to remove unimportant words, such as connecting words, subjects that indicate people, or words that are most encountered. The set of words can be found in the Indonesian version of the Python nltk library and can be added manually by creating a list of words that you want to remove. The set of words you want to remove is called a stoplist. Here's a comparison before and after the stopword in Table IV. It is caused because the words "termurah", "pendek", "premium", "pria", "dan", "wanita" are included in the stoplist, so the word is omitted. 4) Stemming: the step to change the form of a word into its basic structure by removing the affixes present in the word. The porter stemming algorithm is applied in this stage. Here's a comparison before and after stemming from Table V. The next stage is variable extraction, stemming data is processed by testing three methods, namely TF-IDF, BOW, and N-Gram. In terms of experimentation, this study also verified the complexity of PCA and its effect on clustering, results in FMCGs product datasets. And will compare the results of clustering against Weka and Python tools. The application of Table VII above can be concluded that the application of TF-IDF with a PCA value of 70 is closer to the value of balanced data division of 25%. This is different from the use of WEKA tools, with Python implementation on datasets able to provide better clustering of variable extraction using BOW and N-Gram. These results corroborate the results of research conducted by Huong et al regarding the comparison of several word vector representation techniques applied to Vietnamese sentiment analysis [37]. It is also in agreement with the study using Arabic Twitter data on distance education attitude analysis that the application of TF-IDF variable extraction works better than the AraVec model. Furthermore, evaluation was carried out in this study, including the Elbow technique to determine the optimal cluster formed. In addition, Elbow was able to see the percentage comparison of the number of clusters with the number of clusters added [39] and the evaluation of the Silhouette Coefficient and Davies Bouldin Index. The Davies Bouldin Index is one of the methods used to measure cluster validity by summing the proximity of the cluster center point of the cluster followed [40], while silhouette coefficients are used to see how well a particular cluster is separated from others [41]. The following results of the Elbow method can be seen in Fig. 3. Meanwhile, to find out the results of the Davies Bouldin Index and Silhouette Coefficients can be seen in Table VIII.  Table VIII, the value that has optimal results is found in the value K=4 or with the division of 4 clusters.

V. CONCLUSION
Based on the description above, this research proposes to automatically FMCGs products as categorized multimarketplace. Because it uses relatively much data, the deciphering of words that are not used needs to be considered, especially at the preprocessing stage. This will affect the results of variable extraction and the groupings formed. Experiments were conducted on three variable extraction algorithms namely TF-IDF, BOW, and N-Gram. In addition, this study also applies PCA to summarize variable data tables from a large scale into smaller sets of variables (summary index).
The experimental analysis above shows that the extraction of TF-IDF variables results in a multi-marketplace product categorization that is close to the optimal value of the cluster if the grouping formed is four clusters. This is to research conducted [22] that variable extraction is very influential in clustering models, and TF-IDF is a variable extraction method that can provide significant results [42]. Meanwhile, the application of PCA also has a significant influence on cluster results, namely on TF-IDF with a PCA value of 70, and the use of Python can optimize the data so that the results provided can be better. In addition, this research will continue to be developed both in testing the results formed when the number of data product is expanded and by conducting multilevel grouping to determine multi-automatic product subcategories marketplace by specifying the results of the category grouping produced in this study.