[ad_1]
Step by Step Walkthrough with S&P 500 Index
Disclaimer: This is a deep-dive tutorial of Hampel Filter. S&P 500 Index is just an example to demonstrate how to implement Hampel Filter. It should not be considered as any financial advice.
Outliers can significantly impact the results of data analysis, causing incorrect conclusions and decisions. The Hampel filter is a powerful tool for detecting and handling outliers in data. It is widely used in many fields, including finance, engineering, and statistics.
This tutorial will cover all the details about Hampel Filter to get you started using it. Hopefully, this would also allow you to understand when and why Hampel Filter works/doesn’t work. We will also include a step-by-step walkthrough of applying some basic ML models on the Hampel Filter smoothed S&P 500 Index, so make you read it through to see a real-life application.
Key Topics Discussed (TL;DR):
- Step-by-step dissection of Hampel Filter
- Key assumptions of Hampel Filter
- Apply Normal and Laplacian Hampel Filter on S&P 500 Index
- Compare ML results on raw S&P 500 Index and Hampel Filter smoothed S&P 500 Index
- Reflection on the walkthrough
Hampel filter is an algorithm that tries to smooth out a data series by replacing outliers with an appropriate value.
It calculates the “appropriate value” using some parameters in the following steps:
- Parameter 1: the size of the sliding window,
m
. This will be symmetrical. i.e. ifm=2
(illustrated in the diagram above), the actual sliding window size is2+2+1 = 5
. - Parameter 2: outlier threshold
n
in terms of multiples of the rolling standard deviation. - Parameter 3: scaling constant
k
for estimating rolling standard deviation using median absolute deviation. As this is assumed to be1.4826
most of the time, this constant is often overlooked when applying Hampel Filter.
Note: The need for a scaling constant is such that E(k MAD) ≈ E(S) ≈ 𝜎 where 𝜎 is the population standard deviation, S is sample standard deviation and MAD is median absolute deviation.
- Step 1: Find the rolling median of the sliding window. Repeat that for every timestamp
t
in the data series. - Step 2: Calculate the sliding window’s rolling Median Absolute Deviation (MAD) of the sliding window using the formula below. Repeat that for every timestamp
t
in the data series.
- Step 2: Estimate the rolling standard deviation as rolling MAD multiplied by the scaling constant. Repeat that for every timestamp
t
in the data series.
- Step 3: Calculate the difference between the data point at the middle of the sliding window (i.e.
x_t
) and the rolling median. - Step 4: Whenever the difference calculated in step 3 is greater than the predefined multiple of the rolling standard deviation, replace the data points with the rolling median.
While the algorithm sounds simple enough, there are a couple of critical points that can go unnoticed easily:
- It uses Median Absolute Deviation and Median instead of Standard Deviation and Mean. This makes Hampel filter more robust to outliers than one that uses mean. Standard deviation (like mean) is also easily skewed by outliers. When our sample data contains outliers, using mean and standard deviation for denoising could mean that we are accepting some of the outliers
Mean is more susceptible to outliers than Median.
- Hampel Filter works better with symmetrical datasets. Like any denoising algorithm that defines outliers as data points that deviate too much from the “middle” of the dataset, it will not perform as well when it is too skewed. This may not be an issue, as it could be that the volatility of data is naturally higher at certain periods. But the pattern in the rolling standard deviation and skewness/kurtosis are definitely metrics that we should keep an eye on when applying Hampel Filter.
- The scaling constant of 1.4826 assumes data to be Gaussian-like. This is arguably one of the most forgotten assumptions of the Hampel Filter. As mentioned before, the scaling constant exists such that the MAD can be scaled to approximate the population standard deviation (E(k MAD) ≈ E(S) ≈ 𝜎). In a normal distribution, k is estimated to be the reciprocal of the 75th percentile (i.e., 1/0.67449 ≈ 1.4826). But that may not be the same for other distributions. Simulating 1,000 samples of 1,000 each for Uniform distribution, Laplace distribution, and Exponential distribution, below is the code and the results of the estimated scaling constants:
- Hampel Filter defaults to 3 standard deviations because that provides a 95% confidence interval for Normal distribution. Following up on the previous point on the scaling constant, this is another parameter that we should consider adjusting according to the underlying distribution. For instance: 2.3 standard deviations for Laplace distribution, 3.4 standard deviations for Cauchy distribution, etc.
Enough theoretical talking. Let’s try to apply it to some real data!
This walkthrough will use S&P 500 daily prices from 2nd Jan 2020 until 7th Mar 2023, made available on Investing.com.
Once you have downloaded the data, the CSV file should be named S&P 500 Historical Data.csv
.
Load & Preprocess Data
Once we have loaded the data, let’s handle the date column appropriately. We should end up with a dataframe of 800 rows from 2nd Jan 2020 to 7th Mar 2023.
Implement Hampel Filter
We will apply Laplacian Hampel Filter and Normal Hampel Filter in this walkthrough. To start with, we will need to define a couple of parameters:
- Window size: 5 days
- MAD scaling constant: 2.04 for Laplace, 1.4826 for Normal
- Threshold of deviation: 2.3x standard deviation for Laplace, 3x standard deviation for Normal
Note: For simplicity’s sake, window size is defined arbitrarily. More detailed time-series analysis is recommended to come up with a more justified window size.
In the code above, all rolling aggregations have been applied with argument center=True
. This ensures that Hampel Filter is applied to the center of the sliding window.
We should end up with 86 outliers for the Normal Hampel Filter and 213 for Laplacian Hampel Filter.
As Laplacian Hampel Filter has tighter parameters than the Normal counterpart, more outliers have been flagged, and the resultant data series is much smoother.
Upon further investigation, you will notice that the set of outliers flagged by the Normal Hampel Filter is, in fact, a subset of that flagged by the Laplacian Hampel Filter.
Apply ML Model
To keep the walkthrough simple, we have chosen XGBRegressor
as the machine learning model for predicting movement in S&P 500 index.
The model will take 11 days of historical prices to predict the next day’s percentage change in S&P 500 Index.
When creating our training & testing set, we rebased the 11 days of historical prices on the first day of the 11 days to avoid data leakage.
We have obtained the following results from our models:
- RMSE on Raw S&P 500 Index: 1.5850%
- RMSE on Normal Hampel Filter smoothed S&P 500 Index: 1.5177% (4% better than raw)
- RMSE on Laplacian Hampel Filter smoothed S&P 500 Index: 1.5717% (0.8% better than raw)
Bonus: Is Dynamic Hampel Filter A Thing?
We know that stock markets do not necessarily fit well with Normal distributions. Could there be any chance that we can have Hampel Filter adjusted according to the detected distribution, i.e., dynamic Hampel Filter?
Let’s add two more functions to the mix: one fits the Normal distribution, and another fits the Laplace distribution. Both functions would return the fit’s negative log-likelihood value for determining whether Laplace or Normal is a better fit.
Note: Negative log-likelihood estimate the likelihood for producing the observed data assuming the event follows some pre-defined distribution. This is a function that we want to minimise i.e. the more negative the value, the better the fit.
- RMSE on Dynamic Hampel Filter smoothed S&P 500 Index: 1.5928% (0.4% worse than raw)
Surprise, surprise! The result seems to be worse than using the raw S&P 500 Index! In the next section, we will touch on some observed flaws in our application of the Hampel Filter, so let’s reflect on the results we have observed so far!
Reflection on Prediction Results
Note: As this is not a blog post about algorithmic trading, the reflection will be focused on Hampel Filter and its application instead.
Obviously, the models aren’t superb at predicting S&P 500 Index. But how can we make it better?
- Data Symmetry: As mentioned earlier, Hampel Filter works better on symmetrical datasets. Something that S&P 500 Index certainly is not. Techniques like Box-Cox transformation could be applied to improve data symmetry.
- Cyclical Patterns: When defining the window size, we have ignored any possible temporal patterns in the dataset. Analysis of autocorrelation, partial autocorrelation, or even Fourier-transformed series could be considered for a more appropriate window size.
- Value in Outliers: By removing outliers, we assume that outlier data would bring more harm than good to our final model. However, statistics such as the magnitude of deviation, frequency of outliers, and temporal pattern of outliers could all be valuable to the ultimate problem we want to solve. Before we try to replace the outliers, looking at just the outliers is worth some effort.
- Over-engineering: It is one thing to understand how the default parameters of Hampel were derived, but a completely different thing when it comes to tuning the parameters to our needs. In the bonus section, we put forward the hypothesis that the Hampel Filter should be adjusted differently based on the behavior of the underlying data. While that may be sound at the surface, we might have added unnecessary complexity to our model.
Simple models and a lot of data trump more elaborate models based on less data — Peter Norvig
[ad_2]
Source link