Direct demand model: Validation

Introduction

This note provides a technical review of the accuracy of the direct demand model.

In order to assess the predictive accuracy of the direct demand model three factors should be borne in mind:

Pedestrian demand is affected by many factors, and data does not exist for some of these factors. For example, there is no comprehensive up-to-date network of footpaths in Queensland so it is not possible to incorporate footpath extent, or indeed quality, in a model at state level.
The data for predictors that do exist are often of varying quality. For example, there is no state-level dataset on as-built land uses. In lieu of such data a crowd-sourced dataset OpenStreetMap was used. This dataset is dependent upon users coding in features such as shopping centres and hotels, and assigning them the tags used to scrape them into the model.
The dependent variable (i.e. the pedestrian count) used in the regression is usually a single day count, and is obtained at different times of year and over different years. The interday variability of pedestrian demand is such that the true estimate of the average day may be very significantly different - potentially twice or half the measured value.

As such, both the dependent variable and predictors contain large errors. This compounds the challenges of estimating models that are both statistically significant and practically useful.

Finally, the model is entirely dependent on the quality and extent of the data. There is very little data from sites in extremely busy locations such as the Brisbane CBD, and in very quiet locations such as rural areas. When the model is used to extrapolate to these areas the results may be counterintuitive or wildly inaccurate.

Prediction error

In this section the ability of the model to fit the observed data is compared. That is, how well does the model predict demand at the sites upon which it is fit? It is recognised that this approach is biased, and that ideally a fraction of data would be “held out” from the regression to provide an independent validation set. However, given the limited data and large variability it was determined most appropriate to retain all data in the estimation.

Residuals

The residual is the difference between the dependent variable (pedestrian count) and the modelled (predicted) count. The residual distribution by location is shown below; there are many sites with modest absolute residuals (under 100 pedestrians/day) but at the tails of the distribution some extreme residuals where the predictions are in error by 1,000 pedestrians/day or greater.

The cumulative distribution of the absolute error is shown below. Half of the observations have an absolute error less than 94.1 and 90% have an absolute error less than 490 pedestrians per day.

Standardised residuals

The residuals are standardised by dividing by the standard deviation and are shown below. That there are relatively few observations with standardised residuals greater than \(\pm\) 2 reflects the wide variation in the underlying counts.

There is no clear bias in the model towards over- or underprediction, as illustrated below by the density distribution of the standardised residuals.

Rank order

For some purposes it may not be necessary to have an accurate forecast of demand. Instead, it may be sufficient to have confidence that site A has more demand than site B. This may be sufficient where projects are being prioritised for funding. The different in ranks between the observed and modelled counts are shown below. At the tails some sites are in significant error (e.g. a number of sites in the Gold Coast are forecast as being much higher demand than was observed, and hence are much higher ranked). The average absolute rank error is 73 ranks and median is 54 ranks over 425 sites.

Conclusion

How good the model is depends on the intended use - it’s about fitness for purpose. While pedestrian demand is clearly associated with a number of factors (as demonstrated by the statistically significant model coefficients and the improved in log-likelihood compared to a constant-only model) the model is at best only a modest predictor of pedestrian demand (as demonstrated by the residuals).

The key findings from this analysis are likely to be that:

the forecasts may be fair in suburban locations, but not in extreme locations such as central business districts
much better data is required, especially multiday pedestrian counts, to be able to significantly improve upon these models.