Implementation of Modified Backpropagation with Conjugate Gradient as Microarray Data Classifier with Binary Particle Swarm Optimization as Feature Selection for Cancer Detection

Cancer is one of the deadliest diseases in the world that needs to be handled as early as possible. One of the methods to detect the presence of cancer cells early on is by using microarray data. Microarray data can store human gene expression and use it to classify cancer cells. But one of the challenges of using microarray is its vast number of features, not proportional to its small number of samples. To resolve that problem, dimensionality reduction is needed to reduce the number of features stored in microarray data. Binary Particle Swarm Optimization (BPSO) is one of the methods to reduce dimensionality of microarray data that can increase classification performance. Although when combined with Backpropagation, BPSO still shows a relatively low performance. In this research, Modified Backpropagation with Conjugate Gradient is used to classify data that has been reduced with BPSO. The average accuracy result of BPSO+CGBP is 86.1%, giving it an improvement compared to BPSO+BP which averaged to 80.8%. Keywords— cancer, microarray, binary particle swarm optimization, backpropagation, conjugate gradient


INTRODUCTION
Cancer is one of the deadliest diseases in the world that responsible for the loss of 9.6 million lives in 2018. Until present time, there is no method of treatment that can effectively cure cancer. However, if cancer can be detected early, patients with cancer can be treated to prevent cancer deaths [1].
One technology that can detect the presence of cancer cells early on is microarray data. Microarray data collects and stores human gene expressions in an array of float numbers [2]. The data then can be used to classify the genes whether it is cancerous or not [3]. With early detection of cancer tissue, infected patients can be treated more quickly and can reduce the chance of the cancerous tissue to spread.
The nature of microarray data is its large number of features, which can reach tens of thousands per sample, with only a small number of samples. Not all of these features have relevant information for the classification process [4]. Therefore, dimensionality reduction is needed to reduce the number of features in microarray data. One method of dimensionality reduction algorithm that can be used is Binary Particle Swarm Optimization (BPSO). In previous studies, BPSO was proven to be able to produce accuracy that was almost twice greater than Information Gain, which is 99.5% compared to 54.5% [5]. But BPSO still gives a pretty bad result when combined with Backpropagation, with an accuracy of 44% when used for cancer data classification [6].
Based on a research conducted in 2015, Modified Backpropagation with Conjugate Gradient can improve accuracy in cancer data when compared to classification using classic Backpropagation [7]. Therefore, this study will implement Modified Backpropagation with Conjugate Gradient to classify the data that has been reduced with BPSO and compare the result with classic Backpropagation.

A. Microarray Data
The growth of cancer in the cells can cause genetic aberrations. One widely studied technique to detect those aberrations are microarray data, as it can perform rapid genetic analysis [8]. Microarray data is a technology that can collect and store thousands of extracted human gene expressions in an array of numbers at the same time [9]. Each data in microarray is stored as float numbers that represent the composition of gene expression in the human body [2]. The data then can be compared on genomic-scale with other samples to search biological relevance such as cancer [8].
The data is observed with DNA Microarray, a microscope slide that stores the extracted mRNA from human cells. The genes will be arranged in specific positions on the glass plate and will be reacted with cDNA which has been labeled with fluorescent dye. The results of this reaction will produce bright colors which can be translated into the data that stored in microarray [10].

B. Binary Particle Swarm Optimization
After the data has gone through the pre-processing stage, the number of features in the data will be reduced with Binary Particle Swarm Optimization (BPSO). Binary Particle Swarm Optimization (BPSO) is a modified version of the original Particle Swarm Optimization (PSO) which works by imitating the movement of a flock of birds or fish school. In PSO, a particle is defined as a part of the population that represents a solution to the current problem. The movement of birds will be simulated on the population until a solution that fulfills the satisfying condition is found [11]. Each particle will move based on velocity value that will be updated based on the particle that has the best fitness.
Many problems occur in discrete space, and that encouraged Kennedy and Eberhart (1997) to modify PSO into a binary form [12]. In binary space, particles will define their change in coordinates by turning the bits in the particles on or off. Velocity will be defined as the probability that a bit will get a value of 1. In PSO, velocity is defined as follows [13]: where 1 and 1 are random numbers between 0 or 1, 1 is the cognitive learning factor, 2 is the social learning factor, and is the inertia weight. Velocity and position of the -th particle in the search space dimension are defined as and , respectively. The previous best position of the -th particle in the search space dimension is , while is the best position in the search space dimension [11,14].
In BPSO, velocity is updated in the same fashion as PSO. However, because the values of , , and can only be 0 or 1, velocity is defined as the probability of the bits in the position vector to get a value of 1. This means that the velocity must be limited to the range [0.0, 1.0], which can be obtained by applying the sigmoid function that can modify the velocity value. The sigmoid function can be formulated as follows: Once the velocity is found, the position can be updated with the following rule: where () is a function that generates a random value in the range [0.0, 1.0] and ( ) is the sigmoid function in equation (2) [9,11].

C. Backpropagation Neural Network
Backpropagation is an algorithm to train Artificial Neural Networks (ANN). Backpropagation is used as a method to update the weights and biases found in each neuron in the network. There are three main stages for training ANN with Backpropagation, namely forward propagation, backward propagation, and weight update [15]. The forward propagation stage is used to calculate the predicted output, backward propagation is used to calculate the loss of output to find the gradient of each weight and bias, and the weight update is used to update the weight and bias based on the gradient. Loss will be calculated with Mean Squared Error which defined as follows: where is the output target and ( ) is the predicted output from -th data and -th class, respectively. is defined as the number of data and is defined as the number of classes [7,16].

D. Modified Backpropagation with Conjugate Gradient
Conjugate Gradient (CG) is an optimization method that aims to minimize a function. CG works by searching in conjugate directions whose value is orthogonal [17]. Two nonzero vectors ( and ) can be said to be orthogonal if the inner product is 0, which can be formulated as follows [7]: Because CG searches in orthogonal directions, CG can converge to solutions quickly [7]. This allows CG to train neural networks faster and requires less memory than classic Backpropagation [16]. CG will minimize the objective function in equation (4) by calculating the weight which will be updated by: where and are momentum parameters to avoid the solution being stuck at local optimum. The parameter can be searched using the line search technique [17]. Meanwhile, the parameter can be found using one of the following methods: a) Powell-Beale: b) Fletcher-Reeves: c) Polak-Ribiere: where +1 and +1 are the values of and in the current iteration, while is in the previous iteration. 3. Calculate the output of hidden layer (j=1,2,…,p).
Backward Propagation: 5. Calculate the error factor of the output layer's output based on the error of the actual and predicted results.
6. Calculate the error factor of the hidden layer's output based on the error of the previous layers.
7. Calculate the gradient on the output unit.
8. Calculate the gradient on the hidden unit.
9. Calculate the parameter for all the neurons in the hidden and output unit with equation (8), (9), and (10). 10. Calculate the direction for all neurons in the hidden and output unit.
For initial direction: 11. Calculate the parameter for all neurons in the hidden and output unit.
Weight update: 12. Weight will be updated with the following formula: III.

A. System Overview
This research aims to build a system with classic Backpropagation and Modified Backpropagation with Conjugate Gradient as classifier method and Binary Particle Swarm Optimization (BPSO) as dimensionality reduction method. Both classification results will be compared and analyzed.
The classification system will be built in several stages. The first stage is pre-processing, where microarray data will be normalized into intervals [0.0, 1.0]. After the data has been normalized, then the data will be divided into training data and testing data with K-Fold Cross Validation. The next stage is dimensionality reduction that will be done using the BPSO method. After the features in the microarray data are reduced, the data will be used to train the model using two methods, namely classic Backpropagation and Modified Backpropagation with Conjugate Gradient. After the model is trained, the model will be used to classify the test data. The accuracy of the two classification results will be calculated and will be compared with one another. The flowchart of the system can be seen on Figure 1.

B. Dataset
This research will use five cancer datasets that originates from Kent-Ridge Biomedical Data Set Repository, namely Breast Cancer, Colon Tumor, Lung Cancer, Ovarian Cancer, and Prostate Cancer [18]. The specifications of the datasets can be seen on Table I

C. Pre-processing
The pre-processing stage is the stage to refine the data that will be used for classification. In this stage, the data will go through a normalization process. Normalization will map data from its original form to data in a range such as [0.0, 1.0] to ensure that the data being tested can be compared with one another [19].
The method that will be used to normalize the data is Min Max Normalization. Min Max Normalization will perform linear normalization on the data. Therefore, the relationship between data will remain the same as the original data [20]. To perform Min Max Normalization, the data must be converted with the following formula: where is the currently observed value and is an attribute with the range of [ , ] that will be mapped to a new range of [

D. Data Split with K-Fold Cross Validation
To divide the data into training data and test data, the K-Fold Cross Validation method will be used. This method will divide the data into k-parts, each of which will function as training and test data. For each subset in k, the data will act as test data, while the rest of the data will act as training data.
In this study, the k value used is k=5, meaning that the data will be divided into five subset parts, each of which will be training data and test data. The ratio of training data to test data for each i in k is 4:1, or 80%:20%.

E. Feature Selection with Binary Particle Swarm Optimization
The first stage in feature selection with BPSO is the determination of parameter values. The population size is tested with several values, namely 10, 20, 30, and 50, so that the population can cover small, medium, and large numbers [22]. Other parameters that must be initiated are cognitive learning ( 1 ), social learning ( 2 ), and inertia weight ( ). There are eight test combinations to test the value of cognitive learning ( 1 ), social learning ( 2 ), and inertia weight ( ), which can be seen in Table II. In BPSO, the particles will seek direction based on the particle with the best fitness in the population. A particle is considered to have the best fitness when the particle has the smallest cost value. Since the purpose of feature selection is to improve classification performance, the best fitness can be obtained from particles with the smallest error rate and number of features. The cost function will be adopted from the research conducted by Vieira et al. in 2013 that adjusted for minimization process [23,24]: where is the performance value of the classification model, is the number of selected features on the particle, and is the number of original features. The constant is the weight value that determines the importance of and .

F. Classification with Classic Backpropagation
The first step in classic Backpropagation is to initiate a network that includes three main layers, namely the input layer, hidden layer, and output layer. Each layer has a fixed number of neurons, each of which has a weight and bias values. The weight and bias values in the network will be initiated with a random number. The number of neurons in the input layer is the number of features that the data has, while the number of neurons in the output layer is the number of classes in the data. The number of neurons in the hidden layer was tested with a value range of 5-100 with intervals of 5, to determine the effect of the number of neurons on the accuracy and the optimal number of neurons.
Tests will also be conducted on the number of hidden layers in the network. The number of hidden layers will be tested with a value of 1 to 10, with an interval of 1. The number of neurons that will be used in this test refers to the best results obtained in the previous test.
During the training process, each neuron will update its weight and bias based on the gradient and learning rate. The learning rate values will also be tested, with values of 0.05, 0.1, and 0.5 to determine the amount of learning rate that could p-ISSN 2301-7988, e-ISSN 2581-0588 DOI : 10.32736/sisfokom.v9i3.978, Copyright ©2020 Submitted : 4 September 2020, Revised : 4 September 2020, Accepted : 7 September 2020, Published : 10 September 2020 343 produce the best accuracy.

G. Classification with Conjugate Gradient Backpropagation
The Conjugate Gradient method changes the way Backpropagation updates the weight and bias values. The learning rate constant is no longer required for Conjugate Gradient Backpropagation. Instead, the learning rate constant is substituted with and which will be used to find the direction value.
The main stages performed on Conjugate Gradient Backpropagation is the same as classic Backpropagation, beginning with defining the network. The number of neurons in the hidden layer will be tested with the same testing conditions as classic Backpropagation, with a value range of 5-100 with an interval of 5. To update the weight, three search methods are used, namely Powell-Beale, Fletcher-Reeves, and Polak-Ribiere, to find out the results of each method.

H. Accuracy Calculation
The test data classification results will be compared to the actual class of the data to get the accuracy value. Accuracy will be calculated with confusion matrix that represented in Table  III. The table above has four different terms in four different cells. The term that starts with true means the model successfully predicted the results of the classification according to the original class. The label will be true positive (TP) when the original class is cancer positive and true negative (TN) when the original class is cancer negative. Conversely, when a label starts with false, it means that the model cannot successfully predict the original class in the microarray data. False positive (FP) occurs when the original class is cancer positive and false negative (FN) occurs when the original class is cancer negative. The accuracy value will be obtained by: IV.
RESULT AND DISCUSSION Tests conducted in this study aim to find the best parameter values for each method and to compare the classification results obtained from classic Backpropagation method and Modified Backpropagation with Conjugate Gradient method. Both scenarios will be carried out by applying the Binary Particle Swarm Optimization feature selection. The data that will be used for classification can be seen in Table I.

A. The effect of the number of neurons
In this testing stage, testing is based on the network architecture. Tests were conducted on the number of neurons in the hidden layer to determine the effect of number of neurons on the performance of the classification results. The network was tested with 5-100 neurons with intervals of 5 for classification with classic Backpropagation and Modified Backpropagation with Conjugate Gradient. The data tested below have not been feature-selected. The training will stop when the difference in cost value in the last five iterations is less than 0.0001 with a maximum of 1000 iterations.
Based on the test results in Figure 2, the accuracy does not show any pattern of change from the 5-100 neurons experiment.
The accuracy values tend to be in the same range but with a  A fluctuation happened in the results of Breast Cancer, Colon Tumor, and Lung Cancer data. Breast Cancer data experienced the greatest decrease in accuracy when the number of neurons was 60 with a decrease of 8.42% from the 73.26% accuracy obtained when the neurons were 55. In the Colon Tumor data, the greatest decrease in accuracy was obtained when the number of neurons was 40, with an accuracy value of 74.36%. This value is the smallest value out of all number of neurons tests in the Colon Tumor data, which on average always get an accuracy above 80%. Lung Cancer data showed a drastic decrease when there were 20 and 80 neurons, with the largest decrease occurred when the number of neurons was 80, which got an accuracy of 92.79% from 98.89% when the number of neurons was 75.
The test results with Conjugate Gradient Backpropagation method shown in Figure 3 show more stable results than the test using classic Backpropagation. Colon Tumor and Lung Cancer data that fluctuate in the previous test look much more stable on the test with Conjugate Gradient Backpropagation. Fluctuation still slightly occurs in Breast Cancer data, but it happens in a low range, with the smallest value of 60.81% when the neurons are 60 and the largest value is 67.7% when the neurons are 25. The Ovarian Cancer data has something in common with previous tests, that they both experienced a drastic increase when the number of neurons was 10.

B. The effect of the number of hidden layers
In this test, the classification is tested based on the number of hidden layers in the network. This test is done 10 times, with the number of hidden layers tested is 1 to 10. The test is carried out using classic Backpropagation and Conjugate Gradient Backpropagation. Classification will stop when the difference In the test results carried out with classic Backpropagation which can be seen in Figure 4, one similarity occurs in all data, that the largest average accuracy obtained when the number of hidden layer is 1 and experiencing a fairly drastic decrease in accuracy when the number of hidden layers is 3. The most drastic decrease occurred in Ovarian Cancer and Prostate Cancer data, where Ovarian Cancer data decreased from 97.63% to 64.02% and Prostate Cancer data decreased from 93.39% to 56.61%. Almost all data get flat result when the number of hidden layers is greater than 3, except for Breast Cancer data which has a slight increase when the hidden layers are 6, 8, and 9.
The test results using Conjugate Gradient Backpropagation can be seen in Figure 5. As can be seen from the figure, the test results are not much different from the test results using classic Backpropagation. There is a difference in Ovarian Cancer and Prostate Cancer data, which in this test has the accuracy decreased drastically when the hidden layer numbered 4, in contrast to the previous test which decreased when the hidden layer was 3. The greatest average was obtained when the number of hidden layers is 2, with an average of 86.02%, in contrast to the previous test which got the best average when the number of hidden layers is 1.

C. The effect of BPSO parameters
In this stage, the parameters at BPSO are tested to measure their performance of producing the highest accuracy. In the first test, the values tested were cognitive learning ( 1 ), social learning ( 2 ), and inertia weight ( ). The tested values of 1 and 2 is 1 and 2, by combining these values so that all conditions can be tested. While the values tested are 0.1 and 0.5.
In the test results in Table IV, the accuracy tends to be higher when the cognitive learning parameter ( 1 ) is 1. This can be seen in the average accuracy which shows a greater value when 1 = 1. The average accuracy value when 1 = 1 is 83.49%, while the average accuracy value when 1 = 2 is only 81.49%.  The accuracy value also shows a fairly large number when the social learning ( 2 ) is 2. This can be seen in the average accuracy which always has a greater value when 2 is 2 compared to when 2 is 1. When averaged, the accuracy with 2 = 2 is 82.99%, while the accuracy when 2 = 1 is only 81.99%.
The same thing can also be found on Breast Cancer, Lung Cancer, and Ovarian Cancer tests. In Breast Cancer data, the accuracy when 2 = 1 is always less than 60%, while when 2 = 2, the accuracy can reach more than 60%. If averaged, the accuracy when 2 = 1 is 56.55%, while when 2 = 2 the average accuracy increases to 63.63%. However, in Lung Cancer and Ovarian Cancer data, this does not occur in every case. For example, in Lung Cancer data, the accuracy when 2 = 1 is greater than the accuracy when 2 = 2, when 1 and w are 1 and 0.5, respectively. In Ovarian Cancer data, this pattern does not apply when the values of 1 = 2 and = 0.5.
The difference of inertia weight ( ) does not show a significant change in the results' accuracy. The average accuracy when = 0.1 is 82.89%. This value is not too much of a difference from the average accuracy when = 0.1, which averaged at 82.09%. However, the accuracy is seen to be higher in most cases when = 0.1 is combined with a cognitive learning of 1. This can be seen in the Breast Cancer, Colon Tumor, and Prostate Cancer data, where the accuracy with = 0.1 combined with 1 = 1 always show better results than the accuracy of w = 0.5. In contrast, = 0.5 shows a higher value when combined with a cognitive learning of 2.
The next BPSO parameter test is the population size. The population size can affect the size of the search space of the swarm. The greater the number of populations tested, the greater the search space of the system. In this test, the number of tested populations is 10, 20, 30, and 50. The aim is to test the system against small to large search spaces.
The data in Table V shows the test results based on the population size. The effect of the population size on the accuracy is different for each type of data. In Breast Cancer and Lung Cancer data, the best accuracy is obtained in the smallest number of populations in the test, which is 10. It is the opposite of Ovarian Cancer and Prostate Cancer data, which get the best accuracy when the population is 50. The only exception is Colon Tumor data that gets the highest accuracy when the population is 20.
Although the population size of 50 on Breast Cancer, Colon Tumor, and Lung Cancer data did not produce the highest accuracy, the accuracy obtained was also not the smallest. In Breast Cancer and Colon Tumor data, the smallest value is obtained when the population was 30. Meanwhile, in Lung Cancer data, the smallest value is obtained when the population size is 20 and 30, which produced the same value of 97.78%.

D. The effect of learning rate constant
In classic Backpropagation classification process, updating the weight and bias still requires a learning rate constant that determines how much learning Backpropagation will do. In this stage, the learning rate values are tested with values of 0.05, 0.1, and 0.5. Classification is done with 100 iterations.  In the results presented in Table VI, the highest average accuracy is obtained when the learning rate is 0.5 which results in a value of 83.11% and 80.85% when the classification is combined with BPSO. This can also be seen in Colon Tumor and Lung Cancer data which received a significant increase in accuracy. In Colon Tumor data, the learning rate with a value of 0.5 can increase the average accuracy value from 64.1% to 76.85%. The same can also be seen in Lung Cancer data which the average increased from 83.29% to 93.06%. The average accuracy also got an increase in Ovarian Cancer data, although it was not significant.
However, a learning rate of 0.5 does not always provide the best value. In Prostate Cancer data, the best average accuracy is obtained when the learning rate is 0.1, which is 82.28%. Whereas for Breast Cancer data, a learning rate of 0.1 provides the best accuracy results when the testing is not combined with BPSO.

E. The effect of classification method
This testing phase aims to compare classic Backpropagation classification method with Conjugate Gradient Backpropagation. Three methods of updating the Conjugate Gradient, namely Powell-Beale, Fletcher-Reeves, and Polak-Ribiere were each tested to compare the results with each other. Classification with and without BPSO was also tested to compare the classification results of data that had passed the dimensionality reduction process.
Before the classification, the data goes through dimensionality reduction process with BPSO. The parameters used in BPSO are cognitive learning ( 1 ) with a value of 1, cognitive learning ( 2 ) with a value of 2, inertia weight ( ) with a value of 0.5, and a population size of 50. The performance measure used to assess the selected feature subset is the method which corresponds to the classification method. The results of dimensionality reduction with BPSO can be seen in Table VII.

F. Test results analysis
Based on the conducted tests, the classification performance is very dependent on the type of data being classified. Lung Cancer and Ovarian Cancer data always show quite high results without having to use feature selection, in contrast to Breast Cancer data which shows fairly low accuracy value in all tests. This can happen due to several factors, one of which is the number of features contained in the data. It can be seen in Table  1 that Breast Cancer data has the highest number of features, namely 24481 features with a sample of only 97 samples. Even after going through BPSO feature selection, the number of features in Breast Cancer data still fairly high, with an average of 11962 features selected, as can be seen in Table VII.
The application of BPSO as feature selection also does not guarantee an increase in accuracy. In some cases, the accuracy decreased when the data had been feature-selected by BPSO. For example, in Breast Cancer data, BPSO decreases the accuracy when combined with classic Backpropagation and Conjugate Gradient Powell-Beale. The average value of 68.37% decreased to 64.37% when BPSO was used. The most drastic decrease can be seen in Colon Tumor data, where BPSO combined with classic Backpropagation reduces the accuracy from 82.18% to 71.53%.
The application of Conjugate Gradient Backpropagation is seen to increase the accuracy quite drastically in Prostate Cancer data. The average accuracy which was only 75.69% was successfully increased to 90.08% by using Conjugate Gradient Backpropagation. However, the combination of Conjugate Gradient Backpropagation and BPSO seems to reduce the accuracy on Fletcher-Reeves and Polak-Ribiere.

V. CONCLUSION
In this study, a classification system with Modified Backpropagation with Conjugate Gradient combined with Binary Particle Swarm Optimization (BPSO) feature selection has been successfully developed. Several BPSO parameters such as cognitive learning, social learning, inertia weight, and population size as well as Backpropagation parameters such as number of neurons, number of hidden layers, and learning rate have also been tested to determine the best parameter values to be used for the final test.
The classification results show that the Modified Backpropagation with Conjugate Gradient succeeded in increasing the average accuracy when compared to classic Backpropagation. Conjugate Gradient Backpropagation succeeded in increasing the average accuracy to 85.89%. Combining Conjugate Gradient Backpropagation with BPSO can also increase the average accuracy even higher. The average accuracy obtained from all tests with Conjugate Gradient Backpropagation and BPSO is 86.11%. However, combining BPSO with classic Backpropagation reduces the accuracy, which decreases the average value to 80.85%.
Classification methods and parameter determination are not the only factors that can determine the goodness of the accuracy. Different types of data can also affect classification performance. Data with a very large number of features can also reduce classification performance, such as Breast Cancer data which always produces a low accuracy when compared to other data.