Forecasting of GPU Prices Using Transformer Method

— GPU or VGA (graphic processing unit) is a vital component of computers and laptops, used for tasks such as rendering videos, creating game environments, and compiling large amounts of code. The price of GPU/VGA has fluctuated significantly since the start of the COVID-19 pandemic in 2020, due in part to the increased demand for GPUs for remote work and online activities. Furthermore, accurate GPU price forecasting can have broader implications beyond the computer hardware industry, with potential applications in investment decision-making, production planning, and pricing strategies for manufacturers. This research aims to forecast future GPU prices using deep learning-based time series forecasting using the Transformer model. We use daily prices of NVIDIA RTX 3090 Founder Edition as a test case. We use historical GPU prices to forecast 8, 16, and 30 days. Moreover, Transformer we compare the results of the Transformer model with two other models, RNN and LSTM. We found that to forecast 30 days; the Transformer model gets a higher coefficient of correlation (CC) of 0.8743, a lower root mean squared error (RMSE) value of 34.68, and a lower mean absolute percentage error (MAPE) of 0.82 compared to the RNN and LSTM model. These results suggest that the model is an effective and efficient method for predicting GPU prices


INTRODUCTION
In today's world, the shortage of graphics cards has caused much concern and frustration for people who used computers as their primary tools for jobs. The high cost of these cards makes it difficult for people to afford them, hindering their ability to play games or create content. The fluctuating prices of GPUs further exacerbate the problem, and NVIDIA, one of the leading producers of these cards, must rely on third-party manufacturers for their chipsets. Manufacturers experiencing disruptions in their operations has led to a scarcity of graphics processing units (GPUs) and longer wait times for their production. Furthermore, the massive buying of GPUs for mining is causing a limitation in the availability of GPUs. As a result, the cost of GPUs such as the NVIDIA RTX 3090, which originally had a suggested price of $699, has skyrocketed to as much as $2,400 overnight. This scarcity and cost of graphics cards is a pressing issue that requires attention [1].
For those reasons people are trying to find the perfect time when they can buy a GPU. The forecasting method can be the way to solve the problem. Forecasting is a process of predicting based on historical data and extracting trends that can be approached using statistical or machine learning [2]. In [3]they study the GPU NVIDIA GTX 1060, which is affected by the bitcoin price. In this research, they are using linear regression models to forecast the upcoming price of GPUs. They found that the bitcoin's price affects the GPU's price. Another research that some researchers from CEEJ have done showed that the price of the GPU stock could be forecast using an optimal machine learning technique, the Nested Cross Validation algorithm [4].
One way to approach forecasting GPU prices is to use a deep learning model. This paper uses the recently developed transformer model initially developed to solve the NLP problem. In the transformer, from the input sequence, the model determines what other parts of the sequence are essential at each step [5]. The transformer has two parts: the encoder and the decoder. Theoretically, the transformer will use historical data to predict the upcoming prices in the experiment that some  [6]. In another experiment comparing transformers and LSTM, the transformer came out with a huge benefit because it is more stable and doesn't need so much time to train [7]. For this research, we will use the encoder layer to forecast the time series data that we have collected from keepa.
In In this study, we are focusing on how a good transformer can predict the prices of GPU. Besides that, we also compare the model with other architectures, such as LSTM and RNN. The transformer itself is designed for forecasting sequential data. By using a transformer, the prediction can be done faster since the process in this model only runs once. Also, we limit the GPU variation so that there will be only 1 GPU that will be used for this research. Additionally, we only take the data from the first time the GPU is launched; September 2020, until November 2022. The result of the prediction will be quantified by using Coefficient Correlation (CC), Root Mean Square Error (RMSE), and Mean Averaged Percentage Errors (MAPE). All these metrics will evaluate the result of the model we have made.
In this study, we are focusing on how good the transformer model can predict the prices of GPU. The transformer model has been used for time series forecasting. In [8], they found that the transformer is effective for forecasting since they come up with a good result [8]. In [9], the transformer is used for solving time series forecasting; from their study, the transformer performs better than the LSTM and RNN-based methods. In [10], the transformer was used for forecasting both univariate and multivariate time series forecasting. In the end the research will help people to determine what is the right time for them to buy the GPU and showing the power of transformer for forecasting. This paper will be divided into several sections. The following section will discuss the related work of this study in the past related to forecasting GPU prices. Moreover, section 2 there will explain the methodology. Next, in section 3, there will be an explanation of the result and the analysis for this paper. Lastly, we have a conclusion to conclude what we have so far and to close the paper.

II. METHODOLOGY
To do forecasting using a transformer, they are several steps to do shown in Fig. 1.  Starting from collecting data until the research is stopped. The method of forecasting that will be used is time series forecasting. After the data has been collected then we need to preprocess it to make the data as clean as possible, so the data is ready to use. Then, the data will be split into train, validation, and test data. Next, we will train, validate, and test the data by using transformer, LSTM, and RNN. After that, we need to evaluate your results first. If the accuracy is more than 95%, then we can continue to analyze gaining the optimum result of each model. Moreover, we will evaluate the data using Coefficient Correlation, RMSE, and MAPE to find the optimum result. After all the procedure has been completed and qualified, then the research will be stopped.

A. Dataset
The dataset that will be used in my research is the time series dataset. This dataset is downloaded or obtained from the keepa website that can be accessed through keepa website [11]. The dataset contains 783 rows and 4 attributes, where the dataset shows the price from the first time the GPU is launched until November 2022 which will be shown in Table 1. From Table 1 above, we can see that the data is the daily prices of the GPU. The Last Future and Difference is an additional feature that has been added manually. The used attribute in this paper is the first two columns which is Date and Price column.

B. Preprocessing Data
On this research, we do several preprocessing techniques. The data is recorded daily, we check every possibility on our dataset so that we can choose the proportional preprocessing technique. After that, we are using the preprocessing technique where to reshape the data inputation so that it can go through the models. After that we do scaling or normalization so that the inputation will change into value from 0 -1. Then we split the data with a ratio of 80% training data, 10% validation data, and 10% testing data. The visualization of these data will be shown in this The formula shows the attention with parameters Q, K, and V. Then, there are some steps to calculate the scaled dot attention. The first one is to compute the alignment scores by multiplying the set of queries packed in a matrix. Next, we need to scale the score of the alignment using 1 Then we are √ applying a softmax operation to obtain a set of weights. This softmax function will convert the layers into a vector of a. Transformer Transformer architecture has an encoder and decoder. The encoder later will map an input sequence of sequence symbols of continuous representation of z. Then at the same time, the decoder will generate an output sequence of symbols. At each step, the model uses the previously generated symbol as additional input when generating the next symbol. The layers on the transformer are fully connected to each other [12]. The architecture of the transformer will be presented on Fig 2. probabilities. After we get the demanded weight, we multiply it with the value in matrix V.
Multi-head attention allows the model to focus on information from different representations at the same time. Square below elsewhere [12]. This method is represented by this equation (2).
From the equation above as you can see, with the same parameter we are trying to calculate the value of the multihead attention. First thing first, we need to compute the linearly projected versions of the queries, keys, and value through multiplication with the parameter of ( , , ). Then As stated, before the encoder and the decoder are fully connected. The left side is the encoder, and the right side is the decoder. This architecture contains 2 main attentions, the attention itself is used to mapping a query and set-of-key-value pairs to an output, where all the variables are vectors [12]. The attention that is built in the architecture is scaled dot-product attention and multihead attention. The scaled dot-product attention will calculate the softmax value, which is represented by this formula (1).
we need to apply an attention function on each head function by multiplying the queries and the key matrices. Apply softmax and calculate the weight for the output. Concentrate the outputs of the ℎ = (1 … . ℎ). After that, to obtain the result we need to multiply it with weight matrix . it can be seen that after the data is split, the the bias term. The activation function used in the equation is the hyperbolic tangent function, which is defined as follows (4).
transformer model is initialized. Next, the parameters for the model are initialized and a look-back value is set for inputation. A nested loop is then implemented, with the outer loop iterating over the number of transformer blocks and the inner loop iterating over the number of transformer heads. The start time is recorded, and the model is trained. Once training is complete, the stop time is recorded and the validation loss is extracted. The computation time for training is then calculated. The best score is determined and the model is saved in an "h5" file format. The model's performance is then evaluated and predictions are made. If the results are satisfactory, the research is concluded. If not, hyperparameter tuning is carried out to achieve the desired outcome.

b. Recurrent Neural Network
Recurrent Neural Network (RNN)is a type of artificial neural network that can process sequential data. It consists of a series of interconnected units that pass their output as input to the next unit, forming a directed graph. This allows the network to have an internal state or memory, enabling it to exhibit temporal dynamic behaviors. RNNs are particularly useful for recognizing patterns in sequential data, such as handwriting or speech recognition, or for predicting time-series data [13]. The simple RNN architecture can be seen on Fig 5 [14]. From the figure above we can see that every output will move to every RNN cell, well the RNN cell has its own architecture that will be shown in Fig 6 [15]. = tanh( The range of the hyperbolic tangent function is from -1 to 1. The output value is calculated using the following formula (5).
In the equation, represents the output, is the weight matrix at the output layer, is the state, and is the bias term.
Here is the flowchart for my RNN model After the first step is done, then we are moving to next step which is the input gate. The output of the input gate can be calculated by using these formula  From the architecture above we can see that there are 3 main gates on LSTM. Forget gate, Input Gate and Output Gate. Each gate receives two input vectors: the current input, , and the previous output, −1. The input gate determines which input values will be used in the current time step, the forget gate determines which values from the previous time step should be forgotten, and the output gate determines which values should be output in the current time step. Together, these gates allow the LSTM cell to effectively store and retrieve information over long periods of time [17].
From its we can also see that the first step of LSTM is the forget gate. In this function (6) X is the input value, and the output is a value between 0 and 1. If the output of the sigmoid function is close to 0, the data will be discarded. If the output is close to 1, the data will be updated or passed through.
In the context of an LSTM (Long Short-Term Memory) cell, the sigmoid function is used to determine the values of the input, forget, and output gates. The input x is a combination of the current input value, , and the hidden state of the previous time step, −1. The coefficients W and b are learned during the training process and are used to weight the input values.
Overall, the sigmoid function plays a crucial role in the LSTM cell's ability to effectively store and retrieve information over long periods of time.
Finally, the last gate which is output gate that can be interpreted by this equation.
The output gate is calculated using the sigmoid function, as defined in equation (10). The output value, namely the cell state value , is then forwarded to the next memory cell calculation, and the current hidden state is generated using the hyperbolic tangent function, as defined in equation (11).
Overall, the output gate plays a crucial role in the LSTM cell's ability to store and retrieve information over long periods of time. It allows the cell to selectively pass on information from one time step to the next, enabling it to capture long-term dependencies in data. The flowchart can be seen on

D. Evaluation Matrix
For evaluating every model and to comparing each model we are using three evaluation metrices. CC, RMSE and MAPE. Coefficient Correlation (CC) is a statistical measure of the relationship between two variables. When two variables are correlated, a change in the value of one variable is associated with a change in the value of the other variable. The direction of this association can be positive, meaning that the two variables increase or decrease together, or negative, meaning that one variable increases as the other decreases.
The Pearson correlation coefficient is a common measure of correlation that is used to quantify the strength and direction of a linear relationship between two continuous variables. It is typically used when the data follows a bi-variate normal distribution, meaning that the variables are jointly normally distributed. The Pearson correlation coefficient can range from -1 to 1, with values closer to -1 indicating a strong negative correlation, values closer to 1 indicating a strong positive correlation, and values closer to 0 indicating a weaker or no correlation [18].  From the equation there is ̂ and . The first variable relies on the result of the prediction value and the other one is the actual value from the data. Last, we have n, where it is implied to the amount of the prediction [19].
Another method to measure the accuracy of prediction models is MAPE. MAPE or Mean Absolute Percentage Error is performance matrix beside RMSE that can be used to measure the accuracy of prediction on forecasting. The difference is we are using percentage as the benchmark. This method can be represented by this equation. From the equation above, 1 can be changed to 100\% since the result would be on percentage value. The variable is implying the actual value form the data and we have which is the forecast value of the data. Lastly just like RMSE we have n where it relies on the amount of the total number of observation data [20].

III. RESULT AND ANALYSIS
The daily data of GPU prices that is used on this research is from the first time the GPU is launch which is September 2020 until November 2022. The data then splitted, where later if we want to forecast for 8, 16, and 30 days. We are just using8, 16, and 30 data from the splitted data test.
Before we train the data, we check the hyper parameters so that the test would be fair and equal for all models. After we do some analysis for the optimum hyper parameters. Here we try to change the epoch so we can find the optimum epoch and batch size. The optimum epoch for this study is 300 where the batch size is 16. The Table II will show the analysis of the epoch and batch size. After we find the optimum epochs and batch size, we then try to define the best dropout rate.   From the table we can see that the accuracy on transformer is better than LSTM and RNN. The result is surpassing the result on other models. The ability of transformer to forecast are really close to the real data. dropout itself is used to prevent overfitting in the neural network model. Increasing the dropout rate decreases the model's ability to fit the training data and lowers accuracy. Also, in this case by adding small amount of dropout making the model less complex, leading to underfitting of the model. Therefore, we set the dropout rate to 0 for all models. We also selected a head size of 128 and a patience of 100. We tested various numbers of heads and transformer blocks which is [1,2,3] and number of heads of [1,2], then the model determined the optimal combination to be 1 transformer block and 2 heads.
All the hyper parameter tunning is doing using all 10% of data test splitting. This optimum setting then applied on the other model. The prediction results from the three models will be compared using the Coefficient of Correlation (CC), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE). The CC, RMSE and MAPE for each prediction method will be presented in tables. The results will be used to evaluate the performance of each prediction method and determine the most accurate and efficient method for predicting the target variable. The result will be shown in Table IV Table  V, and Table VI. Table VI the Transformer model appears to be more accurate than the LSTM and RNN models in predicting the target variable. This is happened, due to its optimized layer structure. Moreover, the self-attention mechanism which allow transformer to give ever input tokens a weight and will output the best weight across several input tokens. Overall, these results indicate that the Transformer model is a more effective and efficient method for predicting the target variable compared to the LSTM and RNN models.
From the result we can see that the result is very different with the previous result [3]. Where the result by using only linear regression technique show the accuracy of 98.57%. compared with 99.18%. The transformer can also be used to forecast more than 1 day with more accurate results.

IV. CONCLUSION
The main goal of this paper is to investigate the ability of the Transformer model to predict daily GPU Price over periods of 8, 16, and 30 days. To do this, we used a dataset of daily prices of NVIDIA RTX 3090 Founders Edition, and applied the Transformer, RNN, and LSTM models to predict the prices based on the past 2 years of data. Our results showed that the Transformer model was the most accurate in predicting the prices of the GPU, with higher CC, lower RMSE, and lower MAPE values compared to the RNN and LSTM models.
In recent years, the prices of GPUs have become an increasingly popular topic of interest, with many studies focused on forecasting their prices. The use of Transformer networks for this purpose has shown promising results. However, there is still much room for further research in this area. Compared to traditional methods like LSTM and RNN, transformer-based model could provide a better performance and accuracy. For further research, it would be beneficial to evaluate the computational time and cost-effectiveness of the transformer-based model and compare it with other popular models. In addition, using more data or utilizing multiple GPUs during training could potentially improve the performance of the transformer model. Lastly, by changing other parameters can lead to a better result.