Project Question:
Methodology 1.1:
Methodology 1.2:
Methodology 1.3:
Methodology 1.4:
Methodology 2.1:
Methodology 2.2:
Methodology 2.3:
Methodology 2.4:
Conclusion:

Project Question:

This project investigates whether it is possible to predict the daily direction of financial asset prices by using past price and volume information. This problem is central to both academic finance and practical trading, as accurate short-term price prediction could provide insights into market efficiency and help develop systematic trading strategies.

Understanding the predictability of daily price direction is important because it directly tests the limits of available information and the potential for excess returns. While traditional finance theory suggests that price movements are largely random in the short term, advances in machine learning offer new tools that may uncover subtle patterns because they can handle non-linear relationships and high-dimensional feature spaces more effectively than traditional models. This project aims to explore whether modern classification algorithms can extract useful signals from historical price behavior.

Review of Literature:

Based on Random Walk Theory (RWT), the price is unpredictable by using past data. However, many researchers showed the opposite evidence to challenge RWT. Jegadeesh & Titman (1993) suggested that there is a “momentum effect” in stock market which keeps price tend to be move in the same way. In addition, “Volatility Clustering” is high volatility tends to be followed by high volatility and vice versa, which seems to be a common phenomenon among financial markets shown by many researchers such as Thomas Lux (1999). Besides, trading volume is also relevant for volatility suggested by Timothy J Brailsford (1996).

Apart from the lagged terms, there are also various exogenous factors which are correlated with the price volatility. Schwert (1989) indicated that macroeconomic factors like inflation and interest rates have important impacts on price volatility. However, the scope of this study will not cover those macroeconomic factors since our objective is short-term price change, while economic variables are appropriate for mid-term to long-term changes.

Since there are many opposite evidence of the characteristic of price movements, this study aims to provide a comprehensive overview by investigating financial markets with long-term daily data (2005-2024).

Data:

Given the time series nature of this problem, it is extremely crucial to prevent data leakage, which would result in distortion of prediction outcomes. We ensure that every explanatory variable has been preprocessed by solely using information available prior to our objective variable, which is the daily price move direction, and properly aligned with it for a dataframe.

Our research focus is to analyze the short term daily price movement direction through a series of price and volume factors. We will cover the movement direction of recent short and mid term, classic technical analysis such as 20-day simple moving average compared with close price, market sentiment as VIX index, the volatility, close price and volume change of the previous day. Last but not least, we include two interaction terms in order to capture the potential non-linearity of the model.

•close, high, low, open, volume, vix: six original features which will generate all explanatory variables below.

Explanatory Variables:

•updown1: the direction of D-1 (up is 1; down is 0) (compare the close and open price).

•updown2: the direction of D-2.

•updown3: the direction of D-3.

•updown4: the direction of D-4.

•updown5: the direction of D-5.

•20d_direction: the direction from open of D-20 to close of D-1 (up is 1; down is 0).

•above20ma: the close price of D-1 is higher or lower than 20ma which is the simple moving average close price calculated from D-20 to D-1 (above is 1; below is 0).

•vix1: the vix index of D-1.

•volatility1: high of D-1 divided by low of D-1.

•volumechange: the percentage change in volume from D-2 to D-1 （change rate could avoid the scaling problem) (for realistic purposes we could only possibly see the change rate from D-2 to D-1 every time).

•pricechange: the absolute value of percentage change in close price from D-2 to D-1 (because the direction is already captured by “updown1”).

•pricevolume_interact: pricechange * volumechange.

•volatilityvolume_interact: volatility1 * volumechange.

Response Variable:

•response: the direction of close price from D-1 to D-0 (up is 1; down is 0).

Source of data: Yahoo Finance Link: https://finance.yahoo.com/quote/%5EGSPC/history/ https://drive.google.com/file/d/1PkzfrXVRcge38UsnhMNAVS3RVZsDNZgN/view?usp=drivesdk

Methodology 1.1:

Since the response variable is binary classification problem, we will first perform logistic regression and start with a basic train-test split approach. The training set includes 75% of the data, which is year 2005-2019, while the remaining data from 2020-2024 is served for out-of-sample test to determine how well the model generalizes to unseen data.

Single Split (train: 2005-2019; test: 2020-2024)

Logistic Regression

## 
## Call:
## glm(formula = response ~ updown1 + updown2 + updown3 + updown4 + 
##     updown5 + X20d_direction + above20ma + vix1 + volatility1 + 
##     volumechange + pricechange + pricevolume_interact + volatilityvolume_interact, 
##     family = binomial, data = train)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                1.281e+00  8.471e+00   0.151  0.87977   
## updown1                   -2.159e-01  6.867e-02  -3.144  0.00167 **
## updown2                   -4.493e-02  6.870e-02  -0.654  0.51310   
## updown3                   -6.244e-02  6.813e-02  -0.916  0.35944   
## updown4                   -3.244e-02  6.769e-02  -0.479  0.63173   
## updown5                   -5.577e-03  6.722e-02  -0.083  0.93387   
## X20d_direction             5.899e-02  9.517e-02   0.620  0.53538   
## above20ma                 -4.747e-02  1.005e-01  -0.473  0.63657   
## vix1                      -4.431e-04  6.594e-03  -0.067  0.94642   
## volatility1               -8.898e-01  8.471e+00  -0.105  0.91635   
## volumechange               5.544e+01  2.632e+01   2.106  0.03519 * 
## pricechange                8.091e-01  6.958e+00   0.116  0.90744   
## pricevolume_interact       6.242e+01  2.800e+01   2.229  0.02579 * 
## volatilityvolume_interact -5.511e+01  2.620e+01  -2.104  0.03539 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5199.8  on 3774  degrees of freedom
## Residual deviance: 5179.1  on 3761  degrees of freedom
## AIC: 5207.1
## 
## Number of Fisher Scoring iterations: 4

##         Actual
## glm.pred   0   1
##        0  28  32
##        1 554 644

## [1] 0.5341812

## Area under the curve: 0.4952

The outcome shows that there are only four predictions are statistically significant at 5% level. They are updown1, volumechange, and the two interaction terms. The direction of D-1 presents the most significance while the other three just marginally significant. Despite the inclusion of multiple variables, the model’s residual deviance (5179.1) showed little improvement compared to the null deviance (5199.8), indicating limited explanatory power.

Model performance will be evaluated using the test set on accuracy, confusion matrix, and Area Under the ROC Curve (AUC). Accuracy measures the proportion of correctly predicted data, but it can be misleading when one class dominates the other class. In contrast, AUC evaluates the stability of a model to discriminate between positive and negative classes, reflecting the trade-off between the true positive rate and false positive rate and provide a comprehensive view of model performance. AUC values closer to 1.0 indicate excellent model performance, while a value of 0.5 suggests that the model performs no better than random guessing. In most applied settings, an AUC above 0.7 is considered acceptable, while values above 0.8 indicate strong discriminatory power.

The output shows accuracy of 47.6%, which is below the baseline accuracy one would achieve by simply guessing the majority class. More importantly, the AUC (Area Under the ROC Curve) is 0.5829, indicating that the model’s ability to distinguish between the two classes is only slightly better than random guessing (AUC = 0.5). The relatively high number of false positives (77) and false negatives (55) suggests that the model struggles to separate the two classes meaningfully. This may be due to insufficient predictive power in the features, class overlap, or possible imbalance in the data.

Methodology 1.2:

One of the main objectives is to evaluate the consistency of price behavior across different years. In this section, we will use a smaller time period as training set and every remaining single year as test set. Therefore, the training set is 2005-2009, and there are fifteen test sets which cover 2010-2024 respectively. This forecast design carries several important implications. We will test the model’s ability to generalize beyond its training period, simulating a realistic forecast through a long future period. In addition, we will discover if there exists deterioration of prediction accuracy over time due to some essential structural changes of the financial market by checking the prediction performance year by year. This framework also provides insights if a model needs to be retrained regularly or it tends to be sustainable. The result of prediction power is expected to decline over time since no one will believe financial market in recent months will be similar to a few years ago, let alone around two decades ago.

Decay of Prediction Power (train first 5 years, test every single year)

(training set: 1, test set: 15)

##         Actual
## glm.pred   0   1
##        0  44  36
##        1  64 108
## [1] 0.6031746

## Area under the curve: 0.5764
##         Actual
## glm.pred  0  1
##        0 37 53
##        1 77 85
## [1] 0.484127

## Area under the curve: 0.5308
##         Actual
## glm.pred  0  1
##        0 35 41
##        1 83 91
## [1] 0.504

## Area under the curve: 0.5438
##         Actual
## glm.pred   0   1
##        0  28  37
##        1  77 110
## [1] 0.547619

## Area under the curve: 0.5096
##         Actual
## glm.pred   0   1
##        0  32  37
##        1  76 107
## [1] 0.5515873

## Area under the curve: 0.5278
##         Actual
## glm.pred   0   1
##        0  29  28
##        1 104  91
## [1] 0.4761905

## Area under the curve: 0.5423
##         Actual
## glm.pred   0   1
##        0  36  28
##        1  85 103
## [1] 0.5515873

## Area under the curve: 0.5507
##         Actual
## glm.pred   0   1
##        0  20  19
##        1  88 124
## [1] 0.5737052

## Area under the curve: 0.585
##         Actual
## glm.pred   0   1
##        0  24  32
##        1  95 100
## [1] 0.4940239

## Area under the curve: 0.5386
##         Actual
## glm.pred   0   1
##        0  31  46
##        1  71 104
## [1] 0.5357143

## Area under the curve: 0.5078
##         Actual
## glm.pred  0  1
##        0 51 53
##        1 57 92
## [1] 0.5652174

## Area under the curve: 0.569
##         Actual
## glm.pred  0  1
##        0 38 45
##        1 71 98
## [1] 0.5396825

## Area under the curve: 0.514
##         Actual
## glm.pred   0   1
##        0  40  40
##        1 103  68
## [1] 0.4302789

## Area under the curve: 0.4916
##         Actual
## glm.pred  0  1
##        0 34 39
##        1 79 98
## [1] 0.528

## Area under the curve: 0.4864
##         Actual
## glm.pred   0   1
##        0  21  38
##        1  88 105
## [1] 0.5

## Area under the curve: 0.5669

## 
## Call:
## glm(formula = response ~ updown1 + updown2 + updown3 + updown4 + 
##     updown5 + X20d_direction + above20ma + vix1 + volatility1 + 
##     volumechange + pricechange + pricevolume_interact + volatilityvolume_interact, 
##     family = binomial, data = train)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -4.664603  11.784895  -0.396  0.69224    
## updown1                    -0.422397   0.120777  -3.497  0.00047 ***
## updown2                    -0.025816   0.122336  -0.211  0.83287    
## updown3                     0.007452   0.120292   0.062  0.95060    
## updown4                    -0.114295   0.119633  -0.955  0.33938    
## updown5                     0.023231   0.117918   0.197  0.84382    
## X20d_direction              0.282832   0.163561   1.729  0.08377 .  
## above20ma                  -0.193833   0.170167  -1.139  0.25467    
## vix1                       -0.009106   0.008488  -1.073  0.28337    
## volatility1                 5.183911  11.769980   0.440  0.65962    
## volumechange               31.458878  29.836479   1.054  0.29171    
## pricechange                 0.496005   9.830219   0.050  0.95976    
## pricevolume_interact       64.186154  36.086217   1.779  0.07529 .  
## volatilityvolume_interact -31.377277  29.635312  -1.059  0.28970    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1735.5  on 1258  degrees of freedom
## Residual deviance: 1710.9  on 1245  degrees of freedom
## AIC: 1738.9
## 
## Number of Fisher Scoring iterations: 4

Graph of Decay (AUC and Accuracy)

From the model outcome trained by only 2005-2009, we could see the direction of D-1 shows a more significant impact compared to our first model; however, there are no features present significance at 5% level. On the other hand, we visualize the performance of fifteen test sets so that it could be much easier to analyze the outcomes. The first plot illustrates the yearly AUC values of the classifier, with a dashed line of random prediction (AUC=0.5). Almost the values of all single years are above 0.5 and about three points reach closer to 0.6; however, it still shows poor ability to distinguish between two classifications. What is interesting is except for the lowest points in 2022 and 2023, the prediction power doesn’t show some decline through the 15-year period even if the model is trained only by using 5-year period which is almost 20 years ago. The second plot presents the prediction accuracy over the years (blue line) with real upward proportion of each year (grey dashed line) for comparison. The close alignment between the two lines indicates that much of the classifier’s accuracy can be attributed to class imbalance rather than true predictive skill. In particular, years with a higher proportion of positive cases tend to show higher accuracy, highlighting the model’s tendency to follow class frequency instead of learning meaningful patterns. Again, the model demonstrates no statistically significant predictive ability but exhibits no obvious decline in predictive accuracy over time.

Methodology 1.3:

Additionally, different approaches like rolling window and expanding window will be covered in following sections to evaluate the objective of this study. Rolling window approach is a common technique in time series analysis. In this method a fixed-size training window is moved forward step by step across the data set. At each step, the model is re-estimated by using most recent data and dropping the oldest data to generate the subsequent fixed-size test set. We train every four years and test the subsequent year. The first training set is 2005-2008 to test 2009, and the last training set is 2020-2023 to test 2024. This framework will produce sixteen models and we could expect to see the improvement of regularly retrain the model by using new data. Additionally we could observe the significance of each explanatory variable to evaluate if they tend to vary over time. Although there are four significant variations among our initial selections, we will continue to use the full model in the following analysis because we do not know if the significant variables are all the same subset in every time period given this preliminary model is just a single split.

Rolling Window (train every 4 years; test every next 1 year)

(training set: 16, test set: 16)

##         Actual
## glm.pred  0  1
##        0 62 65
##        1 50 75
## [1] 0.5436508

## Area under the curve: 0.5311
##         Actual
## glm.pred   0   1
##        0  40  41
##        1  68 103
## [1] 0.5674603

## Area under the curve: 0.5591
##         Actual
## glm.pred  0  1
##        0 27 44
##        1 87 94
## [1] 0.4801587

## Area under the curve: 0.5257
##         Actual
## glm.pred   0   1
##        0  12   8
##        1 106 124
## [1] 0.544

## Area under the curve: 0.5089
##         Actual
## glm.pred   0   1
##        0  21  27
##        1  84 120
## [1] 0.5595238

## Area under the curve: 0.506
##         Actual
## glm.pred   0   1
##        0   5   8
##        1 103 136
## [1] 0.5595238

## Area under the curve: 0.5377
##         Actual
## glm.pred   0   1
##        0   9   6
##        1 124 113
## [1] 0.484127

## Area under the curve: 0.567
##         Actual
## glm.pred  0  1
##        0 36 33
##        1 85 98
## [1] 0.531746

## Area under the curve: 0.4908
##         Actual
## glm.pred  0  1
##        0 49 60
##        1 59 83
## [1] 0.5258964

## Area under the curve: 0.5504
##         Actual
## glm.pred  0  1
##        0 36 51
##        1 83 81
## [1] 0.4661355

## Area under the curve: 0.5182
##         Actual
## glm.pred  0  1
##        0 41 72
##        1 61 78
## [1] 0.4722222

## Area under the curve: 0.5304
##         Actual
## glm.pred   0   1
##        0   3   5
##        1 105 140
## [1] 0.5652174

## Area under the curve: 0.568
##         Actual
## glm.pred   0   1
##        0   4   5
##        1 105 138
## [1] 0.5634921

## Area under the curve: 0.5111
##         Actual
## glm.pred   0   1
##        0   9   1
##        1 134 107
## [1] 0.4621514

## Area under the curve: 0.5317
##         Actual
## glm.pred   0   1
##        0  36  37
##        1  77 100
## [1] 0.544

## Area under the curve: 0.5345
##         Actual
## glm.pred  0  1
##        0 32 55
##        1 77 88
## [1] 0.4761905

## Area under the curve: 0.5829

Graph of Rolling Logistic Regression (AUC and Accuracy)

Through sixteen logistic regressions generated by rolling window approach, we could observe changes in significant variables across each overlapping 4-year period. We could see updown1 shows significance in half of the sixteen models; however, there is no any significant predictors in some models. It shows that the price action is not consistent across a long period because some variables exhibit significance during specific time periods, while appearing irrelevant in others.

While AUC values fluctuate over time, they consistently stay within a narrow range of approximately 0.50 to 0.57. This suggests that the model has no stable or substantial ability to discriminate between classes, and its performance is only marginally better than random in most years. For the second graph, the close co-movement between the two lines suggests that the classifier may be biased toward majority class guessing, and that accuracy is largely driven by class imbalance rather than true model effectiveness.

Since the rolling window approach optimizes the impact of each predictors in each window, it is expected it will perform better than a fixed model that would not be retained in a long future. For clearer comparison, we employ overlaid plot to visualize the AUC and accuracy of two framework, one is fixed model, the other is rolling model.

Comparison of Decay and Rolling Window

The first plot, however, does not meet our expectations. Neither approach demonstrates consistency superior performance across the entire time span. This suggests that while rolling window and fixed model capture slightly different data structures, neither significantly improves the model’s ability to separate classes, and both remain within the bounds of marginal predictability. That is to say, the more computational effort in rolling window approach does not deliver significant extra performance.

The second plot, the two approaches show no obvious difference in predictive accuracy while they are both closely aligned with the positive ratio. Often times, the predictions are lower than the actual upward proportion, meaning we could get better and more stable results if we guess going upwards every time.

Methodology 1.4:

We discovered that the outcomes tend to be varying and unstable throughout the remaining dataset when the test set is one single year. Hence, this time we extend the test set to rolling two years to see if the results become more smooth and stable. In addition, we will conduct expanding window approach in this section. This method is similar to rolling window, the test sets remain fixed; however, new observations are added to training set at each iteration, allowing the model to continuously learn from an increasing history of past data.

We start with the first four years (2005-2008) to train and the next two years (2009, 2010) to test. And the train set will be added one year at each iteration. The last train set is 2005-2022 and the last test set is 2023 and 2024. There are fifteen models in total. Since the training set grows over time, we expect the results would be better than all the approaches we have discussed above.

Expand Window (increasing training set, test set remains 2 years)

(training set: 15, test set: 15)

##         Actual
## glm.pred   0   1
##        0  16  17
##        1 204 267
## [1] 0.5615079

## Area under the curve: 0.556
##         Actual
## glm.pred   0   1
##        0  20  16
##        1 202 266
## [1] 0.5674603

## Area under the curve: 0.5367
##         Actual
## glm.pred   0   1
##        0  12  11
##        1 220 259
## [1] 0.5398406

## Area under the curve: 0.5075
##         Actual
## glm.pred   0   1
##        0   3   4
##        1 220 275
## [1] 0.5537849

## Area under the curve: 0.4977
##         Actual
## glm.pred   0   1
##        0   4   8
##        1 209 283
## [1] 0.5694444

## Area under the curve: 0.5285
##         Actual
## glm.pred   0   1
##        0   6   9
##        1 235 254
## [1] 0.515873

## Area under the curve: 0.5527
##         Actual
## glm.pred   0   1
##        0   6   5
##        1 248 245
## [1] 0.4980159

## Area under the curve: 0.5667
##         Actual
## glm.pred   0   1
##        0   9   4
##        1 220 270
## [1] 0.554672

## Area under the curve: 0.5663
##         Actual
## glm.pred   0   1
##        0   8   4
##        1 219 271
## [1] 0.5557769

## Area under the curve: 0.5207
##         Actual
## glm.pred   0   1
##        0   9   6
##        1 212 276
## [1] 0.5666004

## Area under the curve: 0.5119
##         Actual
## glm.pred   0   1
##        0  14  13
##        1 196 282
## [1] 0.5861386

## Area under the curve: 0.5266
##         Actual
## glm.pred   0   1
##        0  12  14
##        1 205 274
## [1] 0.5663366

## Area under the curve: 0.5235
##         Actual
## glm.pred   0   1
##        0  15   9
##        1 237 242
## [1] 0.5109344

## Area under the curve: 0.5025
##         Actual
## glm.pred   0   1
##        0  13  11
##        1 243 234
## [1] 0.493014

## Area under the curve: 0.5012
##         Actual
## glm.pred   0   1
##        0   6  14
##        1 216 266
## [1] 0.5418327

## Area under the curve: 0.5264

Among the fifteen models, updown1 shows strong significance in every window. And as the training set collects more historical data, volumechange and two interaction terms gradually shows their significance as the preliminary model. However, with the expanding training set and increased test set, the AUC of every 2-year period test set still presents similar outcomes to previous discussions.

Methodology 2.1:

Since the every modification of logistic regressions could not meet the expected results, we then proceed to explore other non-parametric machine learning classification approaches for further experimentation and analysis.

To illustrate the classification of the relationship between the variables and the response, in this part, we shall utilize the decision tree algorithm to present the outcomes. The purpose of the decision tree algorithm in machine learning is to create a model that predicts the value of a target variable by learning simple decision rules inferred from data features. In this part, we use data before 2020 as the training model and use data after 2020 to test and predict.

Decision Tree

## 
## Regression tree:
## tree(formula = response ~ updown1 + updown2 + updown3 + updown4 + 
##     updown5 + X20d_direction + above20ma + vix1 + volatility1 + 
##     volumechange + pricechange + pricevolume_interact + volatilityvolume_interact, 
##     data = train, method = "class")
## Variables actually used in tree construction:
## character(0)
## Number of terminal nodes:  1 
## Residual mean deviance:  0.2479 = 935.4 / 3774 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.547  -0.547   0.453   0.000   0.453   0.453

However, the output turns out that no variables were selected for splitting. The model decides not to split at all, implying it could not find a feature that improves classification enough to justify a split. Therefore, from the outcome, we shall determine the model fails to hold predictive power. Such a lack of predictive power might be caused by little variation in training data and noise among data.

Methodology 2.2:

To capture complex patterns with non-linear relationships, we have chosen to use the gradient boosting machine (GBM) model to analyze the binary response variable alongside market-related features. The GBM is an ensemble learning method that constructs multiple shallow decision trees in a sequential manner. Each new tree is designed to correct the residual errors of the trees that were built previously. In this study, the model is trained on data from before 2020 and tested on data from 2020 onwards. Specifically, we train a total of 5,000 trees, with each tree having a maximum depth of 3.

Gradient Boost Machine

##           
## boost_pred   0   1
##          0 225 266
##          1 357 410

## Accuracy:  0.5047695

## Area under the curve: 0.4957

By summing up the results, the accuracy is around 51%, which is close to random guessing, indicating low predicting power to the test sets. We propose some potential causes for this outcome. Like we state above, overfitting is one of the main potential causes. Moreover, it is not negligible to notice the influence of noisy features and weak distinguishing ability. Such features may not sufficiently distinguish between classes, so models cannot distinguish among them.

Methodology 2.3:

In this part, we intend to use extreme gradient boosting (XGBoost) to rerun the data and try to create another model. This XGBoost is a powerful ensemble learning algorithm based on gradient boosting. Its purpose is to combine multiple weak learners, typically decision trees, to create a strong predictive model. It is widely used in classification and regression problems for its speed and accuracy. As we previously utilized, the model is trained on data from before 2020 and tested on data from 2020 onwards.

XGBoost

##          Actual
## Predicted   0   1
##         0  93 121
##         1 489 555

## Accuracy:  0.5151033

## Area under the curve: 0.5022

The accuracy result turns out to be 0.5151, which is barely better than random guessing. Moreover, the AUC of 0.5022 suggests no discriminatory power, since in practice, “AUC > 0.7” is usually considered acceptable. Thus, the outcome indicates the model is not learning any meaningful pattern from the data.

Methodology 2.4:

Last but not least, we utilize the support vector machine algorithm to test the data. The SVM is a supervised learning algorithm used primarily for classification. Its main goal is to find the optimal decision boundary (hyperplane) that separates different classes in the feature space with the maximum margin. In this part, we use the SVM with a linear kernel to solve this binary classification problem.

SVM

##      Actual
## class   0   1
##     1 582 676

## [1] 0.5373609

## Area under the curve: 0.5

The accuracy result turned out to be 53.74%, which is similar to previous outcomes that settled around 50%. However, the AUC of 0.5 indicates this model fails again to demonstrate a meaningful decision boundary. This might be caused by class imbalance, non-informative features, and a linear kernel that is too simple. Thus, to get a better outcome, we continue to try the non-linear test.

We shall use SVM with a radial basis function kernel as a non-linear classifier. Its purpose is to capture complex, non-linear relationships between input features and the target variable. The RBF kernel maps the input space into a higher-dimensional space, allowing the model to draw curved decision boundaries. But the outcome does not perform better. The accuracy results in 50.87% and AUC falls below 50%, which reflects poor predictive power. This indicates that the model fails to generalize or capture meaningful patterns in the data.

non-linear SVM

##      Actual
## class   0   1
##     0  80 116
##     1 502 560

## [1] 0.508744

## Area under the curve: 0.4829

Distribution of up & down in each year and all data

## 2005 : 0.5595238 
## 2006 : 0.561753 
## 2007 : 0.5458167 
## 2008 : 0.4980237 
## 2009 : 0.5555556 
## 2010 : 0.5714286 
## 2011 : 0.547619 
## 2012 : 0.528 
## 2013 : 0.5833333 
## 2014 : 0.5714286 
## 2015 : 0.4722222 
## 2016 : 0.5198413 
## 2017 : 0.5697211 
## 2018 : 0.5258964 
## 2019 : 0.5952381 
## 2020 : 0.5731225 
## 2021 : 0.5674603 
## 2022 : 0.4302789 
## 2023 : 0.548 
## 2024 : 0.5674603

## all :  0.5446056

Comparison among Classifiers

Comparison of Model Performance (Accuracy and AUC)
Model	Accuracy	AUC
Logistic Regression	0.4761905	0.5829
GBM	0.5063593	0.5004
XGBoost	0.5151033	0.5022
SVM	0.5373609	0.5000
non-linear SVM	0.5087440	0.4829

Conclusion:

Some of the observed prediction accuracies appear in our logistic models exceeding 50% and even 59% could largely be attributed to the proportion of upward movements in that particular testing year was relatively high. Therefore, it doesn’t necessarily indicate that the classifier itself has effective prediction capabilities. Even with many different types of training and test set combinations including rolling model, we still can conclude that the model is not equipped with predictive power. We use a smaller training set to test every single year and the predictive power does not decline. Similarly, we conduct expanding window and the predictive power does not increase as the training set becomes larger, either. From the confusion matrix, it is evident that the models tend to predict upward movements much more frequently than downward movements. This indicates that the model is heavily biased toward predicting one class. Such bias stems from the underlying distribution of the original data, where the proportion of upward days was already relatively higher, reaching 54.46%.

After reviewing different kind of combinations of training and test sets, we turned to non-parametric models to see if there is any significant improvement. Given the main model we use is logistic and the limitation of the study, all non-parametric models conducted would be single split for fair comparison (train 2005-2019, test 2020-2024). And as the table shown, Logistic Regression, despite being a linear and interpretable baseline, delivered the highest AUC among the classifiers, suggesting strongest predictive performance yet still rather poor in normal evaluation criteria, which is a threshold of 0.7. On the other hand, XGBoost and SVM achieved slightly higher accuracy rates (0.5151 and 0.5374, respectively), but their AUC values remained close to 0.5, indicating poor ability to distinguish between classes beyond random guessing. Overall, the results suggest that while more complex models can slightly improve classification accuracy, they do not necessarily outperform simpler methods in terms of true discriminative power as measured by AUC.

In conclusion, we conducted a comprehensive study involving well-rounded selection of price and volume based features, rigorous data processing procedures designed to strictly avoid look-ahead bias, multiple training and testing set configurations, and several classical robust non-parametric classification methods. We adopted AUC instead of accuracy, as our primary performance evaluation to account for potential class imbalance and threshold sensitivity. Despite these careful methodology choices, our findings suggest that using daily price direction as the response variable behaves largely like a random process, lacking significant predictability. This indicates that predicting single-day price movements may not be feasible within our current framework.

Future research should consider redefining the response variable to capture multi-day price trends or classification based on the magnitude of price movements, even with a more dedicated operational definition to capture essence of price actions, rather than just simple daily direction. The inability to predict one-day returns does not imply that all aspects of market behavior are unpredictable, nor does it deny the existence of profitable trading strategies based on other market features.

ECON590 ML Final Project

Hsiang Lee (hsiangl2@illinois.edu), Hanzhe Dong (hanzhe2@illinois.edu)

2025-04-21