Walking in Wellington - insights from modelling

data science
walkability
transport
new zealand
network analysis
modelling
statistics
bayesian
Part 4: A Bayesian model of walkability.
Published

December 8, 2020

Technical details

Technical challenge Covered in
Geoprocessing to get accessibility within a suburb Jupyter Notebook
Extracting accessibility distributions by suburb Blog Post
Build a Bayesian model for an individual suburb Blog Post
Model average (\(\mu\)) and heterogeneity (\(\sigma\)) on two levels: (1) per suburb and, (2) across all suburbs Blog Post
Stan models for Bayesian analysis Code files
Use average and heterogeneity, relative to Wellington average, to classify accessibility characteristic for a given suburb Blog Post

Datasets

Dataset Format Link
WCC playground locations .zip Wellington City Council
WCC suburb boundaries .gdb Wellington City Council
StatsNZ Area Unit boundaries .gdb StatsNZ
StatsNZ 2019 meshblock boundaries .gdb Stats NZ
LINZ residential polygons .gdb LINZ
Wellington street network without elevation - OpenStreetMap via osmnx
Wellington street network with elevation - OpenStreetMap + Google Elevation API via osmnx

Accessibility by Wellington suburb

In all the previous posts (and series), we’ve only looked at accessibility for all of Wellington. Here, we show that the accessibility data can be filtered by the spatial boundary of choice. This is done by some clever geoprocessing enabled by geopandas.

The steps are pretty simple: - Extract node coordinates and accessibility from pandana network - Convert node coordinates to a geoseries - Tag nodes within a suburb boundary using the contain operation.

# Extract node coordinates and accessibility from pandana network
orig_nodes = network_hills.nodes_df
df_joined = pd.merge(orig_nodes.reset_index(),
                     total_hills_1.reset_index(),
                     how='inner')
df_joined.columns = ['node_id', 'lon', 'lat', 'accessibility']

# Convert lat and lon to geoseries
df_joined_coords = dp.coords_df_to_geopandas_points(df_joined,
                                                    crs={'init': 'epsg:4167'})

# Get suburb values for accessibility data
playground_df = geopandas.sjoin(wcc_suburbs,
                                df_joined_coords,
                                op='contains').drop('index_right', axis=1)

Visualising accessibility within suburb boundaries

With a dataframe containing both the accessibility information and the suburb, we can filter and plot the accessibility for specific suburbs. We only need the accessibility data for the suburb and the boundary information to overlay the two datasets.

karori_accessibility = playground_df[playground_df['suburb'] == 'Karori']
karori_boundary = wcc_suburbs[wcc_suburbs['suburb'] == 'Karori']

Extracting accessibility distributions by suburb

The same filters used to plot the accessibility heatmap within a suburb can be used to extract the accessibility values alone without any spatial information. This raw accessibility data is the basis for our statistical model.

Modelling of accessibility in a single suburb

Once we have the accessibility values by suburb, we can start thinking of how we might model them. From the previous figure, the suburban accessibility distributions have some distinctive features: - Values are all positive - Some are heaviliy skewed - Most have single mode - Some look like a normal distribution

uni_norm_model = su.load_or_generate_stan_model('stan', 'univariate_normal')
lower_trunc_norm_model = su.load_or_generate_stan_model('stan', 'lower_truncated_univariate_normal')

Normal Model

Given the unimodel nature of the majority of the accessibility distributions, a normal model is a trivial starting point. We won’t go into the details here but we can easily write a Stan model to compute the \(\mu\) and \(\sigma\) values of a univariate normal distribution. Results from a Stan fit are given below. I’ve also computed the posterior predictive - the ‘y_pred’ variable.

Inference for Stan model: anon_model_cc3fc1beb21cbbe7b94ad66105c98210.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu      22.99  6.3e-3   0.35   22.3  22.75  22.99  23.23  23.67   3113    1.0
sigma   14.28  4.3e-3   0.26  13.79   14.1  14.27  14.45  14.79   3630    1.0
y_pred  22.71    0.23   14.3  -5.62  12.89  22.82  32.59  50.67   4000    1.0
lp__    -5086    0.02    1.0  -5089  -5087  -5086  -5085  -5085   1809    1.0

Samples were drawn using NUTS at Mon Mar 18 15:07:35 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Truncated Normal model for better fit

The posterior predictive is a useful diagnostic as it reverse-engineers data from the specified model. Checking this generated data against our original data is a very useful tool to assess the suitablity of a model.

The normal model approximation for the Karori is seen in the LHS figure below. The RHS shows the raw accessibility values for Karori. It’s quite clear that the Normal approximation fails because it generates negative accessibilities.

A truncated Normal distribution can easily address this issue. Implementing the truncated Normal model variation in Stan is easy - we just need to specify the appropriate bounds for the Normal distribution: lower, upper or both. The only aspect that needed some research was generating posterior predictive samples. But helpfully others have encountered this before and there was a useful discourse in the Stan forums.

The figure below shows the posterior predictive samples generated from a Normal and truncated Normal distribution respectively. The truncated Normal can be judged a good fit as it matches the distinctive features of the accessibility data.

Normal model Truncated Normal model

A collective Bayesian model

Modelling every suburb on its own is doable but more useful is a succinct model that does this while also modeling summary statistics across all suburbs. This type of model is commonly known as a hierarchcal model. Hierarchy comes from the intuitive ordering of the models. Here, we have an overall city level model and many suburban level models.

Going through the details of a hierarchical model is beyond the scope of this post but there is a great introduction with Python and Pystan by Chris Fonnesbeck. Note that a multi-level model is synonymous with a hierarchical model.

We get two types of posterior distributions from the hierarchical model: - \(\mu_{s}\) and \(\sigma_{s}\) for each suburb. - A single \(\mu_{c}\) and \(\sigma_{c}\) for all suburbs (i.e. city level)

It is worth noting that only the top 80% of suburbs (by node count) have been included in the model. This was mainly to get the model to run reasonably quickly.

Results for \(\mu\) and \(\sigma\)

The posterior distributions output from the hierarchical model can be visualised in a convenient plot known as a Forest Plot. The mean values of the suburban level posterior distribution are plotted as circles. The relative value is used to order and colour code the circles. The 95% credible interval of each suburban \(\mu\) or \(\sigma\) is also plotted as horizontal bars - but these are usually smaller than the circle. The grey band represents the 95% credible interval for the city level average.

\(\mu\) \(\sigma\)

A couple of general points stand out clearly:

  • Wellington city averages 12 - 16 minutes total travel time to a playground.
  • However, the variability in accessibility within suburbs is quite high.

Quadrant visualisation

To pick out suburban character (in terms of playground accessibility), we need a visualisation that (1) focuses on the suburbs that fall outside the ‘average band’ and, (2) can consider \(\mu\) and \(\sigma\) at the same time. A simple 2D plot that can satisfy these criteria is the quadrant matrix.

We can modify the standard layout slightly to get both the quadrants and the average bands by: (1) normalising the suburban means (for both \(\mu\) and \(\sigma\)) with the city means and, (2) plotting the city level credible intervals as a cental cross. The resulting quadrants now represent combinations of suburban \(\mu\) and \(\sigma\) relative to the city.

The quadrants represent accessibility character which have a simplistic interpretation.

High \(\mu_{norm}\) Low \(\mu_{norm}\)
Low \(\sigma_{norm}\) Consistently good accessibility Consistent but poor accessibility
High \(\sigma_{norm}\) Poor accessibility for most areas Good accessibility for some areas

We can get the suburbs that lie in the 4 quadrants listed above with some simple data filters. The list goes from the best suburbs to the worst in terms of consistent accessibility to playgrounds.

suburb quadrant characteristic
Te Aro Low \(\sigma\) and \(\mu\) Consistently good accessibility
Newtown Low \(\sigma\) and \(\mu\) Consistently good accessibility
Pipitea Low \(\sigma\); High \(\mu\) Consistent but poor accessibility
Hataitai Low \(\sigma\); High \(\mu\) Consistent but poor accessibility
Newlands High \(\sigma\); Low \(\mu\) Good accessibility for some areas
Tawa High \(\sigma\); Low \(\mu\) Good accessibility for some areas
Brooklyn High \(\sigma\); Low \(\mu\) Good accessibility for some areas
Khandallah High \(\sigma\) and \(\mu\) Poor accessibility for most areas
Karori High \(\sigma\) and \(\mu\) Poor accessibility for most areas

Suburbs that don’t fit the model

Part of the reason that Karori performs so poorly is because our model is a poor fit to the accessibility data. In the figure below, we see that the raw values of accessibility appear to have 3 modes - a feature that is reduced to a single average mode in our model. This reduction causes both the \(\mu\) and \(\sigma\) values to inflate.

Once again, heterogeneity in suburban characteristic is the reason for the model being a poor fit for Rongotai. Rongotai is a suburb with both residential and industrial areas: the residential areas bordering Kilbirnie and, industrial areas containing Wellington Airport and the large shopping area near Lyall Bay. The dual nature of the suburb is reflected in two distinct modes in the raw accessibility values.

Making the model better

One way to better model Karori is to reduce the suburb into smaller units: perhaps Area Units (now called Statistical Area 2, SA2). SA2 / area unit boundaries are available from Stats NZ. Unfortunately, these units aren’t engineerd to fit perfectly within suburb since they’re primarily designed as statistical units for the census. We see below that SA2 units for Karori don’t extend as far as the suburban boundaries. However, this can be managed by a convenient overlay of SA2 units within the suburban boundaries. And this new overlay geometry could be used to subset the accessibility data.

An even better spatial filter is using the LINZ residential polygons to only model accessibilities that fall within the residential areas of the suburb. This should mitigate some of the high \(\sigma\) issues we’ve seen. In the table below, we see that blue regions are the residential polygons overlaid on the accessibility data for Karori and Rongotai. The regions cover the highest density regions of the suburb and make for a much more intuitive filter than area units.

Suburb Area Unit filter LINZ residential filter
Karori
Rongotai

Conclusions

We’ve done the hard yards in extracting some relevant insights about suburban walkability. The insights have a human-understandable frame and can be used to evaluate suburbs from a citizen’s point of view. However, there are some obvious improvements that can be made; the main one being the removal of accessibility for non-residential areas. This is quite an important update so we’ll cover it in detail in the next post.