Nearest Neighbor Methods

Spring 2023

Author

Serge Rey

Published

February 16, 2023

Introduction

Now that we have been introduced to the different statistical models than are used to represent point processes, we turn to the methods that are used to link observed point patterns back to the process that generated the pattern.

More specifically, the challenge that we face is as follows. Given an observed point pattern, we wish to make inferences about the process that generated the observed pattern.

The general approach that is used is to construct measures that characterise the observed point pattern, and then compare these against the proporties of the theoretical process models we explored previously.

For example, if we assume that the underlying process is CSR, we know what kinds of properties the empirical patterns from such a process should exhibit. The critical thing to keep in mind is that we never actually see the underlying process - we only see outcomes of the process (i.e., the pattern).

This raises a number of challenges that we will need to address later on, but for now we are going to build up an inituition of the general strategy for analyzing point patterns.

Example Patterns

To begin we are going to create two different point patterns, one from a CSR process and one from a clustered process. We will use these two patterns to introduce the different statistical methods used to analyze the patterns. Here we are in the rare circumstance in which we actually know what process generated the pattern.

CSR n=60

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(12345)
n = 60
xy = np.random.rand(60,2)
df = pd.DataFrame(data=xy, columns=['x', 'y'])
sns.scatterplot(x='x', y='y', data=df);

import pointpats as pp
<string>:1: UserWarning:

Shapely 2.0 is installed, but because PyGEOS is also installed, GeoPandas will still use PyGEOS by default for now. To force to use and test Shapely 2.0, you have to set the environment variable USE_PYGEOS=0. You can do this before starting the Python process, or in your code before importing geopandas:

import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
csr = pp.PointPattern(xy)
/home/serge/miniconda3/envs/classes/lib/python3.10/site-packages/libpysal/cg/shapes.py:1492: FutureWarning:

Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.

/home/serge/miniconda3/envs/classes/lib/python3.10/site-packages/libpysal/cg/shapes.py:1208: FutureWarning:

Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.
csr.summary()
Point Pattern
60 points
Bounding rectangle [(0.00838829794155349,0.024676210429265266), (0.9940145858999619,0.9613067360728214)]
Area of window: 0.9231676681785911
Intensity estimate for window: 64.99361066054225
          x         y
0  0.929616  0.316376
1  0.183919  0.204560
2  0.567725  0.595545
3  0.964515  0.653177
4  0.748907  0.653570
w = pp.Window([(0,0), (0,1), (1,1), (1,0), (0,0)])
draw = pp.PoissonClusterPointProcess(w, n, 2, 0.05, 1, asPP=True, conditioning=False)
draw.realizations[0].plot(window=True, title='Contagion Point Process (2 parents)')
/home/serge/miniconda3/envs/classes/lib/python3.10/site-packages/libpysal/cg/shapes.py:1492: FutureWarning:

Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.

/home/serge/miniconda3/envs/classes/lib/python3.10/site-packages/libpysal/cg/shapes.py:1208: FutureWarning:

Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.

/home/serge/miniconda3/envs/classes/lib/python3.10/site-packages/libpysal/cg/shapes.py:1923: FutureWarning:

Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.

/home/serge/miniconda3/envs/classes/lib/python3.10/site-packages/libpysal/cg/shapes.py:103: FutureWarning:

Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.

clustered = draw.realizations[0]
clustered.summary()
Point Pattern
60 points
Bounding rectangle [(0.47331760265312733,0.023178703349462502), (0.9696584457277277,0.6150208352748628)]
Area of window: 1.0
Intensity estimate for window: 60.0
          x         y
0  0.513060  0.541971
1  0.473318  0.578385
2  0.508373  0.536200
3  0.881716  0.060328
4  0.894221  0.059273

Quadrat Statistics

import pointpats.quadrat_statistics as qs
csr_qr = qs.QStatistic(csr, shape='rectangle', nx=3, ny=3)
csr_qr.plot()

csr_qr.chi2
10.8
csr_qr.chi2_pvalue
0.21329101843394052
clustered_qr = qs.QStatistic(clustered, shape='rectangle', nx=3, ny=3)
clustered_qr.plot()

clustered_qr.chi2
209.99999999999994
clustered_qr.chi2_pvalue
4.976940117448032e-41

Nearest Neighbor Distances

plt.scatter(csr.points.x, csr.points.y);

import networkx as nx
G = nx.DiGraph()
for idx, point in enumerate(csr.points.values):
    G.add_node(idx, pos=point)
    
pos = nx.get_node_attributes(G, 'pos')
nx.draw(G, pos, node_size=2)

nidx, nnd = csr.knn(1) # here we have the indices of the nearest neighbors (nidx) and the distances (nnd)
for idx, neighbor in enumerate(nidx):
    edge = (idx, neighbor[0])
    G.add_edges_from([edge])
    
pos = nx.get_node_attributes(G, 'pos')
nx.draw(G, pos, node_size=2)

Here we draw an arrow towards the nearest neighbor for a given observation.

In some cases, an observation is also the nearest neighbor to its nearest neighbor, or so-called “mutual nearest neighbors”. These pairs would appear at the the end of a segment with arrows on both ends.

If we look in the extreme northwest three points, we see one pair of mutual nearest neighbors, while the third point is not a mutual nearest neighbor.

It is also possible for a point to be a nearest neighbor to more than a single point, as is also seen in this case.

The nearest neighbor distances are the lengths of these segments.

Mean Nearest Neighbor Distance

Our first distance based statistic was suggested by Clark and Evans (1954) as the average nearest neighbor distances:

\[\bar{d}_{min} = \frac{1}{n} \sum_{i} d_{i, min} \]

where \(d_{i, min}\) is the nearest neighbor distance for observation \(i\), and \(n\) is the number of observations.

Under a CSR process, the expected value of this statistic is:

\[E[\bar{d}_{min}] = \frac{1}{2 \sqrt{\lambda}}\]

The logic of the statistic is to compare the observed mean nearest neighbor distance to this expectation forming their ratio:

\[R = \frac{\bar{d}_{min}}{\frac{1}{2 \sqrt{\lambda}}} = 2 \bar{d}_{min} \sqrt{\lambda}\]

Values of \(R<1\) are indicative of a tendancy towards clustering since the observed nearest neighbor distances are smaller than expected under CSR.

Values of \(R>1\) are indicative of a uniform or dispersed pattern.

nnd.mean() # the mean nearest neighbor distance
0.07360281110243255
csr.lambda_window # the intensity using the window for the point pattern
64.99361066054225
dmin = nnd.mean()
lam = csr.lambda_window
R = 2 * dmin * lam**(1/2)
R
1.1867513365512292

Let’s compare this to the same statistic based on the mean nearest neighbor distance for the clustered pattern:

nidx, nnd = clustered.knn(1) # here we have the indices of the nearest neighbors (nidx) and the distances (nnd)
dmin = nnd.mean()
lam = clustered.lambda_window
R = 2 * dmin * lam**(1/2)
R
0.13087134840769868

So we see that the \(R\) value for the clustered pattern is much below 1, while the R value for the CSR pattern is slightly over 1.

What we would like to know is if these values are significantly different from what we would expect if the underlying process that generated the patterns was CSR?

One approach is to use theoretical results on the distribution for the \(R\) statistic from Petrere (1985). The expected value of \(R\) is \(E[R]=1\). The variance of the \(R\) statistic is: \[ \sigma^2_R = \frac{0.2732}{n}\]

import scipy.stats
def R_test(pattern):
    nidx, nnd = pattern.knn(1) # here we have the indices of the nearest neighbors (nidx) and the distances (nnd)
    lam = pattern.lambda_window
    R = 2 * nnd.mean() * lam**(1/2)
    n = nnd.shape[0]
    var = 0.2732 / n
    se = var**(1/2)
    stat = (R - 1 )/ se
    p_value = scipy.stats.norm.sf(abs(stat)) * 2
    return R, stat, p_value
R_test(csr)
(1.1867513365512292, 2.7675724348891184, 0.005647549379017388)
R_test(clustered)
(0.13087134840769868, -12.880103258909552, 5.825756575218341e-38)

Inference via simulation

import pointpats
import numpy
samples = pointpats.PoissonPointProcess(csr.window, n, 99, asPP=True)

r_tests = np.array([R_test(samples.realizations[k]) for k in samples.realizations])

r_tests
/home/serge/miniconda3/envs/classes/lib/python3.10/site-packages/libpysal/cg/shapes.py:1923: FutureWarning:

Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.

/home/serge/miniconda3/envs/classes/lib/python3.10/site-packages/libpysal/cg/shapes.py:103: FutureWarning:

Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.
array([[ 9.81105197e-01, -2.80012645e-01,  7.79467803e-01],
       [ 9.28415089e-01, -1.06085681e+00,  2.88754981e-01],
       [ 1.05435206e+00,  8.05473519e-01,  4.20546481e-01],
       [ 1.11821360e+00,  1.75187335e+00,  7.97955886e-02],
       [ 1.12387424e+00,  1.83576157e+00,  6.63929272e-02],
       [ 1.16724422e+00,  2.47848561e+00,  1.31941433e-02],
       [ 1.10700813e+00,  1.58581325e+00,  1.12781678e-01],
       [ 1.09108797e+00,  1.34988348e+00,  1.77053361e-01],
       [ 9.69002352e-01, -4.59371472e-01,  6.45967431e-01],
       [ 1.03760497e+00,  5.57289124e-01,  5.77329905e-01],
       [ 1.02704882e+00,  4.00851546e-01,  6.88529426e-01],
       [ 1.10508460e+00,  1.55730742e+00,  1.19397514e-01],
       [ 1.15793416e+00,  2.34051463e+00,  1.92571837e-02],
       [ 1.04250723e+00,  6.29938355e-01,  5.28734918e-01],
       [ 1.10758728e+00,  1.59439606e+00,  1.10847354e-01],
       [ 1.07177650e+00,  1.06369612e+00,  2.87466383e-01],
       [ 1.02551521e+00,  3.78124168e-01,  7.05338356e-01],
       [ 9.46840660e-01, -7.87797970e-01,  4.30814888e-01],
       [ 1.03757582e+00,  5.56857151e-01,  5.77625033e-01],
       [ 9.90592969e-01, -1.39408044e-01,  8.89127716e-01],
       [ 1.04856233e+00,  7.19672330e-01,  4.71726767e-01],
       [ 1.10811705e+00,  1.60224699e+00,  1.09101003e-01],
       [ 9.54688187e-01, -6.71501077e-01,  5.01901374e-01],
       [ 1.10995667e+00,  1.62950930e+00,  1.03205247e-01],
       [ 1.15162238e+00,  2.24697675e+00,  2.46415129e-02],
       [ 9.31182120e-01, -1.01985062e+00,  3.07799310e-01],
       [ 9.71814911e-01, -4.17690588e-01,  6.76173355e-01],
       [ 1.14149435e+00,  2.09688392e+00,  3.60038522e-02],
       [ 1.07165594e+00,  1.06190948e+00,  2.88276781e-01],
       [ 1.14437709e+00,  2.13960485e+00,  3.23867145e-02],
       [ 9.26536025e-01, -1.08870370e+00,  2.76284570e-01],
       [ 1.00659687e+00,  9.77627322e-02,  9.22120701e-01],
       [ 1.19240160e+00,  2.85130688e+00,  4.35399261e-03],
       [ 1.02219381e+00,  3.28902433e-01,  7.42229435e-01],
       [ 1.17405296e+00,  2.57938804e+00,  9.89755372e-03],
       [ 1.15649472e+00,  2.31918276e+00,  2.03851289e-02],
       [ 9.45617539e-01, -8.05924083e-01,  4.20286624e-01],
       [ 1.05893614e+00,  8.73407565e-01,  3.82440969e-01],
       [ 1.09368529e+00,  1.38837464e+00,  1.65022993e-01],
       [ 1.07631454e+00,  1.13094776e+00,  2.58077078e-01],
       [ 9.78530574e-01, -3.18167420e-01,  7.50357945e-01],
       [ 1.07064731e+00,  1.04696203e+00,  2.95117090e-01],
       [ 1.08860485e+00,  1.31308486e+00,  1.89154355e-01],
       [ 1.19551229e+00,  2.89740597e+00,  3.76262510e-03],
       [ 1.07442246e+00,  1.10290799e+00,  2.70067126e-01],
       [ 1.09651586e+00,  1.43032247e+00,  1.52624489e-01],
       [ 1.10573862e+00,  1.56699976e+00,  1.17114749e-01],
       [ 9.18296882e-01, -1.21080418e+00,  2.25970464e-01],
       [ 9.75959250e-01, -3.56273307e-01,  7.21635897e-01],
       [ 1.12270070e+00,  1.81837021e+00,  6.90075678e-02],
       [ 1.16124927e+00,  2.38964311e+00,  1.68647518e-02],
       [ 1.14288622e+00,  2.11751071e+00,  3.42165274e-02],
       [ 8.95782250e-01, -1.54446109e+00,  1.22476670e-01],
       [ 1.05289923e+00,  7.83943286e-01,  4.33073389e-01],
       [ 1.21757563e+00,  3.22437488e+00,  1.26248010e-03],
       [ 1.03138599e+00,  4.65126525e-01,  6.41840852e-01],
       [ 1.07305973e+00,  1.08271294e+00,  2.78935859e-01],
       [ 1.11578890e+00,  1.71594046e+00,  8.61729410e-02],
       [ 9.58111837e-01, -6.20764093e-01,  5.34754852e-01],
       [ 9.37476396e-01, -9.26572231e-01,  3.54148678e-01],
       [ 9.63173031e-01, -5.45759439e-01,  5.85231308e-01],
       [ 1.03085430e+00,  4.57247129e-01,  6.47493427e-01],
       [ 1.12588430e+00,  1.86554984e+00,  6.21043732e-02],
       [ 1.04886527e+00,  7.24161732e-01,  4.68966450e-01],
       [ 1.11417108e+00,  1.69196511e+00,  9.06526264e-02],
       [ 9.09443001e-01, -1.34201478e+00,  1.79591204e-01],
       [ 1.11892848e+00,  1.76246763e+00,  7.79903206e-02],
       [ 1.14728332e+00,  2.18267389e+00,  2.90598342e-02],
       [ 9.66649872e-01, -4.94234183e-01,  6.21140802e-01],
       [ 1.01975820e+00,  2.92807770e-01,  7.69669089e-01],
       [ 1.06807716e+00,  1.00887343e+00,  3.13035340e-01],
       [ 8.87332323e-01, -1.66968527e+00,  9.49816494e-02],
       [ 1.06116818e+00,  9.06485480e-01,  3.64678947e-01],
       [ 9.73221131e-01, -3.96851022e-01,  6.91477323e-01],
       [ 1.10831264e+00,  1.60514552e+00,  1.08461783e-01],
       [ 1.15121609e+00,  2.24095569e+00,  2.50289451e-02],
       [ 1.06442269e+00,  9.54715861e-01,  3.39721407e-01],
       [ 1.09045697e+00,  1.34053244e+00,  1.80072304e-01],
       [ 1.21321146e+00,  3.15969983e+00,  1.57931759e-03],
       [ 1.03753327e+00,  5.56226553e-01,  5.78055989e-01],
       [ 9.92381824e-01, -1.12898015e-01,  9.10111410e-01],
       [ 9.89597771e-01, -1.54156441e-01,  8.77486386e-01],
       [ 8.69471915e-01, -1.93436865e+00,  5.30678183e-02],
       [ 1.07456759e+00,  1.10505883e+00,  2.69134096e-01],
       [ 1.01501782e+00,  2.22557427e-01,  8.23879974e-01],
       [ 1.06339883e+00,  9.39542766e-01,  3.47452146e-01],
       [ 1.04251892e+00,  6.30111598e-01,  5.28621572e-01],
       [ 1.01595310e+00,  2.36417893e-01,  8.13108413e-01],
       [ 9.76030325e-01, -3.55220016e-01,  7.22424770e-01],
       [ 1.02700974e+00,  4.00272457e-01,  6.88955852e-01],
       [ 1.05167838e+00,  7.65850808e-01,  4.43765079e-01],
       [ 1.03268084e+00,  4.84315702e-01,  6.28161834e-01],
       [ 1.07237329e+00,  1.07254022e+00,  2.83477458e-01],
       [ 1.17864464e+00,  2.64743472e+00,  8.11050176e-03],
       [ 1.00495856e+00,  7.34836527e-02,  9.41421252e-01],
       [ 9.57599581e-01, -6.28355512e-01,  5.29771074e-01],
       [ 1.01309195e+00,  1.94016950e-01,  8.46162610e-01],
       [ 1.06834012e+00,  1.01277043e+00,  3.11169828e-01],
       [ 1.04123161e+00,  6.11034274e-01,  5.41176890e-01]])
R_csr = R_test(csr)
R_csr[0]
1.1867513365512292
(r_tests[:,0] >= R_csr[0]).sum()
4
R_clustered = R_test(clustered)
import pandas

import seaborn as sns
df = pandas.DataFrame(data=r_tests, columns=['R', 'z', 'p'])

sns.displot(df, kind='kde', x="R")
plt.axvline(R_csr[0], 0, 0.1, color='g');
plt.axvline(R_clustered[0], 0, 0.1, color='r');

df.describe()
R z p
count 99.000000 99.000000 99.000000
mean 1.052286 0.774857 0.357230
std 0.078631 1.165280 0.282182
min 0.869472 -1.934369 0.001262
25% 0.991487 -0.126153 0.099093
50% 1.054352 0.805474 0.295117
75% 1.108215 1.603696 0.581644
max 1.217576 3.224375 0.941421

samples = pointpats.PoissonPointProcess(csr.window, n, 999, asPP=True)

r_tests = np.array([R_test(samples.realizations[k]) for k in samples.realizations])

r_tests
/home/serge/miniconda3/envs/classes/lib/python3.10/site-packages/libpysal/cg/shapes.py:103: FutureWarning:

Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.
array([[ 1.04108274,  0.60882811,  0.54263838],
       [ 1.08726785,  1.2932711 ,  0.19591731],
       [ 0.96240692, -0.5571128 ,  0.57745036],
       ...,
       [ 1.18467435,  2.73679237,  0.00620414],
       [ 1.05998867,  0.88900566,  0.37400004],
       [ 0.92024492, -1.18193507,  0.23723146]])
df = pandas.DataFrame(data=r_tests, columns=['R', 'z', 'p'])

sns.displot(df, kind='kde', x="R")
plt.axvline(R_csr[0], 0, 0.1, color='g');
plt.axvline(R_clustered[0], 0, 0.1, color='r');

df.describe()
R z p
count 999.000000 999.000000 999.000000
mean 1.057947 0.858756 0.357294
std 0.077832 1.153438 0.296165
min 0.805327 -2.884969 0.000001
25% 1.004814 0.071334 0.092286
50% 1.061867 0.916842 0.283196
75% 1.111235 1.648461 0.577714
max 1.328334 4.865765 0.998563