There are many ways to look at the relationships between different aspects of racing data. There are also many assumptions, rules of thumb and outright myths regarding the veracity of some of those pieces of data.
One way in which to assess the strength of any relationship between two data variables is to use correlation coefficients. This article will not go deeply into the theory of correlation coefficients, as there are many online resources to assist with this, for those interested in reading more. This article will instead look at the relationship, in terms of correlation coefficients, between finishing position and starting stall. Examples will be provided using the R statistical programming language, providing readers a basis to further extend the code on their own datasets.
When viewing the results of a correlation test, the output will be between -1 and 1. Any results lower than -0.5 may be said to have a stronger negative correlation, those greater than 0.5 as having a stronger positive correlation. Unfortunately in racing, due to the interplay of so many variables and somewhat noisy data, it is difficult to find correlations at either ends of the scale.
To begin, we'll pull some data from the FormBet historic database between January 1st 2014 and January 30th 2016, load it into an R variable called fbHistoric, perform a baseline correlation test, with the built in R cor.test function, using finishing position and Betfair starting price. The output from this can be used as a baseline for future analysis.
# Load the data fbHistoric <- readRDS("fbHistoricResults.rds") # Use dplyr to convert to a data table fbHistoric <- dplyr::tbl_df(fbHistoric) # Set relevant fields to numeric fbHistoric$POS <- as.numeric(fbHistoric$POS) fbHistoric$BFOddsWin <- as.numeric(fbHistoric$BFOddsWin) # Use cor.test for correlation coefficients cor.test(fbHistoric$POS, fbHistoric$BFOddsWin) # Output of correlation test Pearson's product-moment correlation data: fbHistoric$POS and fbHistoric$BFOddsWin t = 154.02, df = 216450, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3104851 0.3180784 sample estimates: cor 0.3142868
The test results here says there was a sample size of 216,450 records, the correlation co-efficient was 0.31, the 95% confidence interval was between 0.310 to 0.318 and the t-test produced a figure of 154.02. In short, the 95% confidence interval was very narrow, which is great, and the t-test figure was high, with higher being better. The correlation co-efficient was not particularly high or low, but as noted earlier, this is now our baseline. Any correlations higher than this, or indeed below -0.30, have a stronger correlation with finishing position, than the starting price.
As the title of this article suggests, the aim is to look at the correlation between finishing position and starting stall. To begin, we look at the output for the horse's stall draw and finishing position across all British and Irish all courses.
# Subset the data for just flat and all weather races flatHistoric <- dplyr::filter(fbHistoric, grepl("FLT", RTA) | grepl("AWFP", RTA) | grepl("AWFF", RTA) | grepl("AWFT", RTA)) # Set the relevant columns to numeric flatHistoric$POS <- as.numeric(flatHistoric$POS) flatHistoric$DRAW <- as.numeric(flatHistoric$DRAW) # Perform the correlation test cor.test(flatHistoric$POS, flatHistoric$DRAW) # Output of correlation test Pearson's product-moment correlation data: flatHistoric$POS and flatHistoric$DRAW t = 100.63, df = 134540, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2595998 0.2695385 sample estimates: cor 0.2645761
The key number is the cor figure, which in this case is 0.2646. Therefore, the correlation between starting stall and finishing position is weaker than that between starting price and finishing position.
In the above example, we've looked at all courses and all distances. It is also worth investigating whether there are stronger or weaker correlations at some courses and specific distances.
Looking at the All Weather races at Lingfield, the overall figures are:
# Subset for Lingfield only lingfieldHistoric <- dplyr::filter(fbHistoric, grepl("Lingfield", COURSE)) lingfieldHistoric <- dplyr::filter(lingfieldHistoric, grepl("AWFP", RTA)) # Lingfield all distances correlation test. cor.test(lingfieldHistoric$POS, lingfieldHistoric$DRAW) # Output of correlation test Pearson's product-moment correlation data: lingfieldHistoric$POS and lingfieldHistoric$DRAW t = 22.165, df = 9917, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1984293 0.2359320 sample estimates: cor 0.2172609
Therefore, the correlation between starting stall and finishing position is even weaker at Lingfield than across all courses. However, this is not true for all distances.
# 7f at Lingfield lingfieldHistoric7f <- dplyr::filter(lingfieldHistoric, DISTANCE == 7.0) # Lingfield 7f correlation test cor.test(lingfieldHistoric7f$POS, lingfieldHistoric7f$DRAW) # Output of correlation test Pearson's product-moment correlation data: lingfieldHistoric7f$POS and lingfieldHistoric7f$DRAW t = 10.712, df = 1814, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2001492 0.2866842 sample estimates: cor 0.2439021
After testing all distances, it was found that over 7 furlongs at Lingfield the correlation between starting stall and finishing position is strongest.
Chester is an obvious candidate to test this correlation. A well known tight circuit, with strongly perceived advantage for some starting stalls, over some distances. Chester overall:
# Subset for Chester only chesterHistoric <- dplyr::filter(fbHistoric, grepl("Chester", COURSE)) chesterHistoric <- dplyr::filter(chesterHistoric, grepl("FLT", RTA)) # Chester all distances correlation test cor.test(chesterHistoric$POS, chesterHistoric$DRAW) # Output of correlation test Pearson's product-moment correlation data: chesterHistoric$POS and chesterHistoric$DRAW t = 16.649, df = 1711, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3318963 0.4134386 sample estimates: cor 0.3733885
Now we have some interesting data to look at. The correlation coefficient for stalls and finishing position at Chester overall is higher than across all courses in Britain and Ireland, and also higher than the correlation between starting price and finishing position.
Some distances at Chester have an even stronger correlation for these two variables.
# Chester 5 furlongs only chesterHistoric5f <- dplyr::filter(chesterHistoric, DISTANCE == 5.1 | DISTANCE == 5.5) # Chester 5f correlation test cor.test(chesterHistoric5f$POS, chesterHistoric5f$DRAW) # Output of correlation test Pearson's product-moment correlation data: chesterHistoric5f$POS and chesterHistoric5f$DRAW t = 9.0162, df = 301, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3675528 0.5454687 sample estimates: cor 0.461133
The correlation coefficient for starting stall and finishing position at Chester, over 5 furlongs, is approaching the 0.50 mark. However, before becoming too exciting, it's worth looking at the rest of the test output. The sample size is now much smaller, at 301 runs, the t-test is reasonable but not particularly high and the 95% confidence interval has spread. However, it can now be said that paying attention to the starting stalls at Chester, over 5 furlongs, is probably worthwhile. It is also worth pointing out that this correlation test does not specifically state which starting stalls are better. This is analysis best left to another article.
The correlation coefficient test can be extended to many different variables, as long as they are all numeric. For example, if one wished to examine the correlation of jockeys with finishing position, each jockey would need to be assigned a unique numeric value.
A variable other than finishing position could also be used. It could be interesting to substitute points return instead. For example, is there a correlation between jockeys and the points returned from a 1 unit bet.
Future articles may explore some of these factors further.
FormBet subscribers are able to download historic data sets from the FormBet Race Database (FRED) and perform this type of data analysis themselves.
Access to FormBet's daily ratings and online historic data analysis tool is by subscription only. Competitively priced packages start at just £2 a day (ratings only) and a fantastically discounted six month subscription for only £156.