Today, we will continue our exploration of developments in the world of tidy models, and we will stick with our usual Fama French modeling flow to do so. For new readers who want get familiar with Fama French before diving into this post, see here where we covered importing and wrangling the data, here where we covered rolling models and visualization, and here where we covered managing many models. If you’re into Shiny, this flexdashboard might be of interest, as well.
Let’s get to it.
First, we need our data and, as usual, we’ll import data for daily prices of five ETFs, convert them to returns (have a look here for a refresher on that code flow), then import the five Fama French factor data and join it to our five ETF returns data. Here’s the code to make that happen (this code was covered in detail in this post:
symbols % map(~Ad(get(.))) %>% reduce(merge) %>% `colnames% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% gather(asset, prices, -date) %>% group_by(asset) %>% mutate(daily_returns = (log(prices) - log(lag(prices)))) %>% na.omit() factors_data_address % rename(date = X1, MKT = `Mkt-RF`) %>% mutate(date = ymd(parse_date_time(date, "%Y%m%d")))%>% mutate_if(is.numeric, funs(. / 100)) %>% select(-RF) data_joined_tidy % left_join(Global_5_Factors, by = "date") %>% na.omit()
For today, let’s work with just the
SPY data by filtering our data set by asset.
spy_2013_2017 % filter(asset == "SPY")
Next, we re-sample this five years’ worth of data into smaller subsets of training and testing sets. This is frequently done by k-fold cross validation (see here for an example), where random samples are taken from the data, but since we are working with time series, we will use a time-aware technique. The
rsample package has a function for exactly this purpose, the
rolling_origin() function. We covered this process extensively in this previous post. Here’s the code to make it happen.
rolling_origin_spy_2013_2017 % dim()
 1159 2
We now have a data object called
rolling_origin_spy_2013_2017 that holds 1159
splits of data. Each split consists of an analysis data set with 100 days of return and factor data, and an assessment data set with one day of return and factor data.
Now, we can start using that collection of data splits to fit a model on the assessment data, and then test our model on the assessment data. That means it’s time to introduce a relatively new addition to the R tool chain, the
parsnip is a unified model interface that allows us to create a model specification, set an analytic engine, and then fit a model. It’s a ‘unified’ interface in the sense that we can use the same scaffolding but insert different models, or different engines, or different modes. Let’s see how that works with linear regression.
Recall that in the previous post, we piped our data into a linear model like so:
analysis(rolling_origin_spy_2013_2017$splits[]) %>% do(model = lm(daily_returns ~ MKT + SMB + HML + RMW + CMA, data = .)) %>% tidy(model)
# A tibble: 6 x 6 # Groups: asset  asset term estimate std.error statistic p.value
1 SPY (Intercept) 0.000579 0.000338 1.71 8.98e- 2 2 SPY MKT 0.909 0.0739 12.3 2.79e-21 3 SPY SMB -0.495 0.112 -4.43 2.52e- 5 4 SPY HML -0.609 0.208 -2.92 4.38e- 3 5 SPY RMW -0.591 0.259 -2.28 2.47e- 2 6 SPY CMA -0.395 0.206 -1.92 5.81e- 2
Now, we will pipe into the
parsnip scaffolding, which will allow us to quickly change to a different model and specification further down in the code.
Since we are running a linear regression, we first create a specification with
linear_reg(), then set the engine with
set_engine("lm"), and finally fit the model with
fit(five_factor_model, data = one of our splits)
lm_model % set_engine("lm") %>% fit(daily_returns ~ MKT + SMB + HML + RMW + CMA, data = analysis(rolling_origin_spy_2013_2017$splits[])) lm_model
parsnip model object Call: stats::lm(formula = formula, data = data) Coefficients: (Intercept) MKT SMB HML RMW 0.0005794 0.9086303 -0.4951297 -0.6085088 -0.5910375 CMA -0.3954515
Now that we’ve fit the model on our test set, let’s see how well it predicted the test set. We can use the
predict() function and pass it the results of our
parnsip code flow, along with the
assessment(rolling_origin_spy_2013_2017$splits[]) %>% select(returns) %>% bind_cols(predict(lm_model, new_data = assessment(rolling_origin_spy_2013_2017$splits[])))
# A tibble: 1 x 3 # Groups: asset  asset returns .pred
1 SPY 148. 0.00737
That worked well, but now let’s head to a more complex model and use the
ranger package as an engine for a random forest analysis.
To set up the ranger random forest model in
parsnip, we first use
rand_forest(mode = "regression", mtry = 3, trees = 100) to create the specification,
set_engine("ranger") to set the engine as the
ranger package, and
fit(daily_returns ~ MKT + SMB + HML + RMW + CMA ~ , data = analysis(rolling_origin_spy_2013_2017$splits[]) to fit the five-factor Fama French model to the 100-day sample in our first split.
# Need to load the packages to be used as the random forest engine library(ranger) rand_forest(mode = "regression", mtry = 3, trees = 100) %>% set_engine("ranger") %>% fit(daily_returns ~ MKT + SMB + HML + RMW + CMA, data = analysis(rolling_origin_spy_2013_2017$splits[]))
parsnip model object Ranger result Call: ranger::ranger(formula = formula, data = data, mtry = ~3, num.trees = ~100, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) Type: Regression Number of trees: 100 Sample size: 100 Number of independent variables: 5 Mtry: 3 Target node size: 5 Variable importance mode: none Splitrule: variance OOB prediction error (MSE): 1.514654e-05 R squared (OOB): 0.6880896
ranger gives us an
OOB prediction error (MSE) value as part of its return.
parsnip returns to us what the underlying engine returns.
Now, let’s apply that random forest regression to all 1159 of our splits (recall that each split consists of 100 days of training data and one day of test data), so we can get an average RMSE. Warning: this will consume some resources on your machine and some time in your day.
To apply that model to our entire data set, we create a function that takes one split, passes it to our
parsnip enabled model, and then uses the
predict function to attempt to predict our
assessment split. The function also allows us to specify the number of trees and the number of variables randomly sampled at each tree split, which is set with the
ranger_rf_regress % set_engine("ranger") %>% fit(daily_returns ~ MKT + SMB + HML + RMW + CMA, data = analysis_set_rf) assessment_set_rf % select(date, daily_returns) %>% mutate(.pred = unlist(predict(model, new_data = assessment_set_rf))) %>% select(date, daily_returns, .pred) }
Now we want to pass it our object of 1159 splits,
rolling_origin_spy_2013_2017$splits, and we want the function to iterate over each split. For that we turn to
map_df() from the
purrr package, which allows us to iterate over the data object and return a data frame.
map_df() takes the data as an argument and our function as an argument.
Here are the results. We now have 1159 predictions.
ranger_results %>% head()
# A tibble: 6 x 4 # Groups: asset  asset date daily_returns .pred
1 SPY 2013-05-28 0.00597 0.00583 2 SPY 2013-05-29 -0.00652 -0.00403 3 SPY 2013-05-30 0.00369 0.00658 4 SPY 2013-05-31 -0.0145 -0.0114 5 SPY 2013-06-03 0.00549 0.00119 6 SPY 2013-06-04 -0.00482 0.00202
Notice how the date of each prediction is included since we included it in the
select() call in our function. That will come in handy for charting later.
Now, we can use the
rmse() function from
yardstick to calculate the root mean-squared error each of our predictions (our test sets had only one observation in them because we were testing on one month, so the RMSE is not a complex calculation here, but it would be the same code pattern if we had a larger test set). We can then find the average RMSE by calling
summarise(avg_rmse = mean(.estimate)).
library(yardstick) ranger_results %>% group_by(date) %>% rmse(daily_returns, .pred) %>% summarise(avg_rmse = mean(.estimate))
# A tibble: 1 x 1 avg_rmse
We have the average RMSE; let’s see if the RMSE were stable over time, first with
ranger_results %>% group_by(date) %>% rmse(daily_returns, .pred) %>% ggplot(aes(x = date, y = .estimate)) + geom_point(color = "cornflowerblue") + labs(y = "rmse", x = "", title = "RMSE over time via Ranger RF")
ranger_results %>% group_by(date) %>% rmse(daily_returns, .pred) %>% hchart(., hcaes(x = date, y = .estimate), type = "point") %>% hc_title(text = "RMSE over time via Ranger RF") %>% hc_yAxis(title = list(text = "RMSE"))
It looks like our RMSE is relatively stable, except for a period in mid to late 2015.
The amazing power of
parsnip is how efficiently we can toggle to another random forest engine. Let’s suppose we wished to use the
randomForest package instead of
ranger. Here’s how we could reconfigure our previous work to use a different engine.
First, we’ll load up the
randomForest package, because we need to load the package in order to use it as our engine. Then, we make one tweak to the original
ranger_rf_regress function, by changing
set_engine("randomForest"). That’s all, and we’re now running a random forest model using a different package.
library(randomForest) randomForest_rf_regress % set_engine("randomForest") %>% fit(daily_returns ~ MKT + SMB + HML + RMW + CMA, data = analysis_set_rf) assessment_set_rf % select(date, daily_returns) %>% mutate(.pred = unlist(predict(model, new_data = assessment_set_rf))) %>% select(date, daily_returns, .pred) }
We now have a new function called
randomForest_rf_regress() that uses
randomForest as the engine for our model and can use the same code scaffolding to run that model on our 1159 splits.
randomForest_results % head()
# A tibble: 6 x 4 # Groups: asset  asset date daily_returns .pred
1 SPY 2013-05-28 0.00597 0.00609 2 SPY 2013-05-29 -0.00652 -0.00438 3 SPY 2013-05-30 0.00369 0.00597 4 SPY 2013-05-31 -0.0145 -0.00987 5 SPY 2013-06-03 0.00549 0.00134 6 SPY 2013-06-04 -0.00482 0.00118
And we can use the same
yardstick code to extract the
randomForest_results %>% group_by(date) %>% rmse(daily_returns, .pred) %>% summarise(avg_rmse = mean(.estimate))
# A tibble: 1 x 1 avg_rmse
There’s a lot more to explore in the
parsnip package and the
tidymodels collection. See you next time when we’ll get into some classification!
Wait: shameless book plug for those who read to the end: if you like this sort of thing, check out my new book Reproducible Finance with R!
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…