Distance Based Methods

Introduction

This notebook explores tests for point patterns that are based on various nearest neighbor and point-location, point-point, distances.

Specifically we examine the following statistics:

Ripley’s $G$
Ripley’s $F$
Ripley’s $K$

Data

from pointpats import (
    distance_statistics,
    QStatistic,
    random,
    PointPattern,
)

We will be examining the Tokyo Flickr data set

import pandas as pd
db = pd.read_csv("~/data/385/tokyo_clean.csv")
db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   user_id               10000 non-null  object 
 1   longitude             10000 non-null  float64
 2   latitude              10000 non-null  float64
 3   date_taken            10000 non-null  object 
 4   photo/video_page_url  10000 non-null  object 
 5   x                     10000 non-null  float64
 6   y                     10000 non-null  float64
dtypes: float64(4), object(3)
memory usage: 547.0+ KB

import seaborn
import contextily
joint_axes = seaborn.jointplot(
    x="longitude", y="latitude", data=db, s=0.5
)
contextily.add_basemap(
    joint_axes.ax_joint,
    crs="EPSG:4326",
    source=contextily.providers.CartoDB.PositronNoLabels,
);

We are going to focus on the most prolific user and subset our pattern to their photos:

user = db.query('user_id == "95795770@N00"')
coordinates = user[["x", "y"]].values

import libpysal
from pointpats import centrography
import geopandas

alpha_shape, alpha, circs = libpysal.cg.alpha_shape_auto(
    coordinates, return_circles=True
)
convex_hull_vertices = centrography.hull(coordinates)

import matplotlib.pyplot as plt
f, ax = plt.subplots(1, 1, figsize=(9, 9))

# Plot a green alpha shape
geopandas.GeoSeries(
    [alpha_shape]
).plot(
    ax=ax,
    edgecolor="green",
    facecolor="green",
    alpha=0.2,
    label="Tightest single alpha shape",
)

# Include the points for our prolific user in black
ax.scatter(
    *coordinates.T, color="k", marker=".", label="Source Points"
)

# plot the circles forming the boundary of the alpha shape
for i, circle in enumerate(circs):
    # only label the first circle of its kind
    if i == 0:
        label = "Bounding Circles"
    else:
        label = None
        ax.add_patch(
            plt.Circle(
                circle,
                radius=alpha,
                facecolor="none",
                edgecolor="r",
                label=label,
            )
        )
        
# add a blue convex hull
ax.add_patch(
    plt.Polygon(
        convex_hull_vertices,
        closed=True,
        edgecolor="blue",
        facecolor="none",
        linestyle=":",
        linewidth=2,
        label="Convex Hull",
    )
)

# Add basemap
contextily.add_basemap(
    ax, source=contextily.providers.CartoDB.Positron
)
plt.legend();

/tmp/ipykernel_435535/1608907436.py:54: UserWarning: Legend does not support handles for PatchCollection instances.
See: https://matplotlib.org/stable/tutorials/intermediate/legend_guide.html#implementing-a-custom-legend-handler
  plt.legend();

And we can contrast this against a CSR pattern in the same window:

random_pattern = random.poisson(coordinates, size=len(coordinates))

f, ax = plt.subplots(1, figsize=(9, 9))
plt.scatter(
    *coordinates.T,
    color="k",
    marker=".",
    label="Observed photographs"
)
plt.scatter(*random_pattern.T, color="r", marker="x", label="Random")
contextily.add_basemap(
    ax, source=contextily.providers.CartoDB.Positron
)
ax.legend(ncol=1, loc="center left")
plt.show()

random_pattern_ashape = random.poisson(
    alpha_shape, size=len(coordinates)
)

f, ax = plt.subplots(1, figsize=(9, 9))
plt.scatter(*coordinates.T, color="k", marker=".", label="Observed")
plt.scatter(
    *random_pattern_ashape.T, color="r", marker="x", label="Random"
)
contextily.add_basemap(
    ax, source=contextily.providers.CartoDB.Positron
)
ax.legend(ncol=1, loc="center left")
plt.show()

Quadrat Analysis

We revisit the use of quadrat analysis to test for departures for CSR, setting the stage for the distance based methods:

qstat = QStatistic(coordinates)
qstat.plot()

print(qstat.chi2_pvalue)

0.0

So the pattern of photo points is a significant departure for CSR.

And for our random pattern in the same window we have:

qstat_null = QStatistic(random_pattern)
qstat_null.plot()

print(qstat_null.chi2_pvalue)

0.16149772674569046

qstat_null_ashape = QStatistic(random_pattern_ashape)
qstat_null_ashape.plot()

print(qstat_null_ashape.chi2_pvalue)

5.63164842380419e-29

Ripley’s $G$

We now are ready to consider the distances between each point and its nearest neighbor again. For the Clark-Evans statistic we focused on the mean nearest neighbor distance. This throws away a lot of potential information. After all, the mean is the average of a set of $n$ nearest neighbor distances. Here we introduce Ripley’s $G$ statistic which considers the cumulative distribution of the nearest neighbor distances.

f, ax = plt.subplots(1, 2, figsize=(8, 4), sharex=True, sharey=True)
ax[0].scatter(*random_pattern.T, color="red")
ax[1].scatter(
    *random_pattern.T,
    color="red",
    zorder=100,
    marker=".",
    label="Points"
)
nn_ixs, nn_ds = PointPattern(random_pattern).knn(1)
first = True
for coord, nn_ix, nn_d in zip(random_pattern, nn_ixs, nn_ds):
    dx, dy = random_pattern[nn_ix].squeeze() - coord
    arrow = ax[1].arrow(
        *coord,
        dx,
        dy,
        length_includes_head=True,
        overhang=0,
        head_length=300 * 3,
        head_width=300 * 3,
        width=50 * 3,
        linewidth=0,
        facecolor="k",
        head_starts_at_zero=False
    )
    if first:
        plt.plot(
            (1e100, 1e101),
            (0, 1),
            color="k",
            marker="<",
            markersize=10,
            label="Nearest Neighbor to Point",
        )
    first = False

ax[0].axis([1.554e7, 1.556e7, 4240000, 4260000])
ax[0].set_xticklabels([])
ax[0].set_yticklabels([])
ax[0].set_xticks([])
ax[0].set_yticks([])
f.tight_layout()
ax[1].legend(bbox_to_anchor=(0.5, -0.06), fontsize=16)
plt.show()

As our observed $G(d)$ value is everywhere above the simulation envelopes, we can conclude that the pattern is a departure from CSR in the direction of clustering. The simulation envelopes report the range of the $G(d)$ values obtained for 99 simulated point patterns under the null at each value of $d$.

g_test = distance_statistics.g_test(
    coordinates, support=40, keep_simulations=True
)

import numpy
f, ax = plt.subplots(
   1, 2, figsize=(9, 3), gridspec_kw=dict(width_ratios=(6, 3))
)
# plot all the simulations with very fine lines
ax[0].plot(
    g_test.support, g_test.simulations.T, color="k", alpha=0.01
)
# and show the average of simulations
ax[0].plot(
    g_test.support,
    numpy.median(g_test.simulations, axis=0),
    color="cyan",
    label="median simulation",
)


# and the observed pattern's G function
ax[0].plot(
    g_test.support, g_test.statistic, label="observed", color="red"
)

# clean up labels and axes
ax[0].set_xlabel("distance")
ax[0].set_ylabel("% of nearest neighbor\ndistances shorter")
ax[0].legend()
ax[0].set_xlim(0, 2000)
ax[0].set_title(r"Ripley's $G(d)$ function")

# plot the pattern itself on the next frame
ax[1].scatter(*coordinates.T)

# and clean up labels and axes there, too
ax[1].set_xticks([])
ax[1].set_yticks([])
ax[1].set_xticklabels([])
ax[1].set_yticklabels([])
ax[1].set_title("Pattern")
f.tight_layout()
plt.show()

Ripley’s $F$

While Ripley’s $G$ takes into account more information from the nearest neighbor distribution, relative to the Clark-Evans statistic, Ripley’s $F$ statistic goes a bit further. Also referred to as the “empty space” function, the $F$ function is also a cumulative “nearest distance” distribution. However, here it records the minimum distance from a random set of points in the window to their nearest event point.

The advantage of $F$ over $G$ is that $G$ is limited by the number of event points, while we are free to simulate as many random points as we choose when calculating $F$:

f_test = distance_statistics.f_test(
    coordinates, support=40, keep_simulations=True
)

f, ax = plt.subplots(
    1, 2, figsize=(9, 3), gridspec_kw=dict(width_ratios=(6, 3))
)

# plot all the simulations with very fine lines
ax[0].plot(
    f_test.support, f_test.simulations.T, color="k", alpha=0.01
)
# and show the average of simulations
ax[0].plot(
    f_test.support,
    numpy.median(f_test.simulations, axis=0),
    color="cyan",
    label="median simulation",
)


# and the observed pattern's F function
ax[0].plot(
    f_test.support, f_test.statistic, label="observed", color="red"
)

# clean up labels and axes
ax[0].set_xlabel("distance")
ax[0].set_ylabel("% of nearest point in pattern\ndistances shorter")
ax[0].legend()
ax[0].set_xlim(0, 2000)
ax[0].set_title(r"Ripley's $F(d)$ function")

# plot the pattern itself on the next frame
ax[1].scatter(*coordinates.T)

# and clean up labels and axes there, too
ax[1].set_xticks([])
ax[1].set_yticks([])
ax[1].set_xticklabels([])
ax[1].set_yticklabels([])
ax[1].set_title("Pattern")
f.tight_layout()
plt.show()

Ripley’s $K$

While the $G$ and $F$ statistics make more use of the nearest-neighbor distance distribution, relative to Clark-Evans, they ignore the other $n-2$ distances between each point and its set of non-nearest neighbor points. Ripley’s $K$ function considers the cumulative distance distribution for all point-point pair distances.

k_test = distance_statistics.k_test(
    coordinates, support=40, keep_simulations=True
)

f, ax = plt.subplots(
    1, 2, figsize=(9, 3), gridspec_kw=dict(width_ratios=(6, 3))
)

# plot all the simulations with very fine lines
ax[0].plot(
    k_test.support, k_test.simulations.T, color="k", alpha=0.01
)
# and show the average of simulations
ax[0].plot(
    k_test.support,
    numpy.median(k_test.simulations, axis=0),
    color="cyan",
    label="median simulation",
)


# and the observed pattern's k function
ax[0].plot(
    k_test.support, k_test.statistic, label="observed", color="red"
)

# clean up labels and axes
ax[0].set_xlabel("distance")
ax[0].set_ylabel("Number of pairs in pattern\n separated by distance")
ax[0].legend()
ax[0].set_xlim(0, 2000)
ax[0].set_title(r"Ripley's $K(d)$ function")

# plot the pattern itself on the next frame
ax[1].scatter(*coordinates.T)

# and clean up labels and axes there, too
ax[1].set_xticks([])
ax[1].set_yticks([])
ax[1].set_xticklabels([])
ax[1].set_yticklabels([])
ax[1].set_title("Pattern")
f.tight_layout()
plt.show()

Similar to what we found for our other distance statistics, the observed value for Ripley’s $K$ is a significant departure from CSR, again pointing to clustering.

--- title: Distance Based Methods execute: cache: true echo: true format: html: code-fold: false jupyter: python3 --- ## Introduction This notebook explores tests for point patterns that are based on various nearest neighbor and point-location, point-point, distances. Specifically we examine the following statistics: - Ripley's $G$ - Ripley's $F$ - Ripley's $K$ ## Data ```{python} from pointpats import ( distance_statistics, QStatistic, random, PointPattern, ) ``` We will be examining the Tokyo Flickr data set ```{python} import pandas as pd db = pd.read_csv("~/data/385/tokyo_clean.csv") db.info() ``` ```{python} import seaborn import contextily joint_axes = seaborn.jointplot( x="longitude", y="latitude", data=db, s=0.5 ) contextily.add_basemap( joint_axes.ax_joint, crs="EPSG:4326", source=contextily.providers.CartoDB.PositronNoLabels, ); ``` We are going to focus on the most prolific user and subset our pattern to their photos: ```{python} user = db.query('user_id == "95795770@N00"') coordinates = user[["x", "y"]].values ``` ```{python} import libpysal from pointpats import centrography import geopandas ``` ```{python} alpha_shape, alpha, circs = libpysal.cg.alpha_shape_auto( coordinates, return_circles=True ) convex_hull_vertices = centrography.hull(coordinates) ``` ```{python} import matplotlib.pyplot as plt f, ax = plt.subplots(1, 1, figsize=(9, 9)) # Plot a green alpha shape geopandas.GeoSeries( [alpha_shape] ).plot( ax=ax, edgecolor="green", facecolor="green", alpha=0.2, label="Tightest single alpha shape", ) # Include the points for our prolific user in black ax.scatter( *coordinates.T, color="k", marker=".", label="Source Points" ) # plot the circles forming the boundary of the alpha shape for i, circle in enumerate(circs): # only label the first circle of its kind if i == 0: label = "Bounding Circles" else: label = None ax.add_patch( plt.Circle( circle, radius=alpha, facecolor="none", edgecolor="r", label=label, ) ) # add a blue convex hull ax.add_patch( plt.Polygon( convex_hull_vertices, closed=True, edgecolor="blue", facecolor="none", linestyle=":", linewidth=2, label="Convex Hull", ) ) # Add basemap contextily.add_basemap( ax, source=contextily.providers.CartoDB.Positron ) plt.legend(); ``` And we can contrast this against a CSR pattern in the same window: ```{python} random_pattern = random.poisson(coordinates, size=len(coordinates)) ``` ```{python} f, ax = plt.subplots(1, figsize=(9, 9)) plt.scatter( *coordinates.T, color="k", marker=".", label="Observed photographs" ) plt.scatter(*random_pattern.T, color="r", marker="x", label="Random") contextily.add_basemap( ax, source=contextily.providers.CartoDB.Positron ) ax.legend(ncol=1, loc="center left") plt.show() ``` ```{python} random_pattern_ashape = random.poisson( alpha_shape, size=len(coordinates) ) ``` ```{python} f, ax = plt.subplots(1, figsize=(9, 9)) plt.scatter(*coordinates.T, color="k", marker=".", label="Observed") plt.scatter( *random_pattern_ashape.T, color="r", marker="x", label="Random" ) contextily.add_basemap( ax, source=contextily.providers.CartoDB.Positron ) ax.legend(ncol=1, loc="center left") plt.show() ``` ## Quadrat Analysis We revisit the use of quadrat analysis to test for departures for CSR, setting the stage for the distance based methods: ```{python} qstat = QStatistic(coordinates) qstat.plot() ``` ```{python} print(qstat.chi2_pvalue) ``` So the pattern of photo points is a significant departure for CSR. And for our random pattern in the same window we have: ```{python} qstat_null = QStatistic(random_pattern) qstat_null.plot() ``` ```{python} print(qstat_null.chi2_pvalue) ``` ```{python} qstat_null_ashape = QStatistic(random_pattern_ashape) qstat_null_ashape.plot() ``` ```{python} print(qstat_null_ashape.chi2_pvalue) ``` ## Ripley's $G$ We now are ready to consider the distances between each point and its nearest neighbor again. For the Clark-Evans statistic we focused on the mean nearest neighbor distance. This throws away a lot of potential information. After all, the mean is the average of a set of $n$ nearest neighbor distances. Here we introduce Ripley's $G$ statistic which considers the cumulative distribution of the nearest neighbor distances. ```{python} f, ax = plt.subplots(1, 2, figsize=(8, 4), sharex=True, sharey=True) ax[0].scatter(*random_pattern.T, color="red") ax[1].scatter( *random_pattern.T, color="red", zorder=100, marker=".", label="Points" ) nn_ixs, nn_ds = PointPattern(random_pattern).knn(1) first = True for coord, nn_ix, nn_d in zip(random_pattern, nn_ixs, nn_ds): dx, dy = random_pattern[nn_ix].squeeze() - coord arrow = ax[1].arrow( *coord, dx, dy, length_includes_head=True, overhang=0, head_length=300 * 3, head_width=300 * 3, width=50 * 3, linewidth=0, facecolor="k", head_starts_at_zero=False ) if first: plt.plot( (1e100, 1e101), (0, 1), color="k", marker="<", markersize=10, label="Nearest Neighbor to Point", ) first = False ax[0].axis([1.554e7, 1.556e7, 4240000, 4260000]) ax[0].set_xticklabels([]) ax[0].set_yticklabels([]) ax[0].set_xticks([]) ax[0].set_yticks([]) f.tight_layout() ax[1].legend(bbox_to_anchor=(0.5, -0.06), fontsize=16) plt.show() ``` As our observed $G(d)$ value is everywhere above the simulation envelopes, we can conclude that the pattern is a departure from CSR in the direction of clustering. The simulation envelopes report the range of the $G(d)$ values obtained for 99 simulated point patterns under the null at each value of $d$. ```{python} g_test = distance_statistics.g_test( coordinates, support=40, keep_simulations=True ) ``` ```{python} import numpy f, ax = plt.subplots( 1, 2, figsize=(9, 3), gridspec_kw=dict(width_ratios=(6, 3)) ) # plot all the simulations with very fine lines ax[0].plot( g_test.support, g_test.simulations.T, color="k", alpha=0.01 ) # and show the average of simulations ax[0].plot( g_test.support, numpy.median(g_test.simulations, axis=0), color="cyan", label="median simulation", ) # and the observed pattern's G function ax[0].plot( g_test.support, g_test.statistic, label="observed", color="red" ) # clean up labels and axes ax[0].set_xlabel("distance") ax[0].set_ylabel("% of nearest neighbor\ndistances shorter") ax[0].legend() ax[0].set_xlim(0, 2000) ax[0].set_title(r"Ripley's $G(d)$ function") # plot the pattern itself on the next frame ax[1].scatter(*coordinates.T) # and clean up labels and axes there, too ax[1].set_xticks([]) ax[1].set_yticks([]) ax[1].set_xticklabels([]) ax[1].set_yticklabels([]) ax[1].set_title("Pattern") f.tight_layout() plt.show() ``` ## Ripley's $F$ While Ripley's $G$ takes into account more information from the nearest neighbor distribution, relative to the Clark-Evans statistic, Ripley's $F$ statistic goes a bit further. Also referred to as the "empty space" function, the $F$ function is also a cumulative "nearest distance" distribution. However, here it records the minimum distance from a random set of points in the window to their nearest event point. The advantage of $F$ over $G$ is that $G$ is limited by the number of event points, while we are free to simulate as many random points as we choose when calculating $F$: ```{python} f_test = distance_statistics.f_test( coordinates, support=40, keep_simulations=True ) ``` ```{python} f, ax = plt.subplots( 1, 2, figsize=(9, 3), gridspec_kw=dict(width_ratios=(6, 3)) ) # plot all the simulations with very fine lines ax[0].plot( f_test.support, f_test.simulations.T, color="k", alpha=0.01 ) # and show the average of simulations ax[0].plot( f_test.support, numpy.median(f_test.simulations, axis=0), color="cyan", label="median simulation", ) # and the observed pattern's F function ax[0].plot( f_test.support, f_test.statistic, label="observed", color="red" ) # clean up labels and axes ax[0].set_xlabel("distance") ax[0].set_ylabel("% of nearest point in pattern\ndistances shorter") ax[0].legend() ax[0].set_xlim(0, 2000) ax[0].set_title(r"Ripley's $F(d)$ function") # plot the pattern itself on the next frame ax[1].scatter(*coordinates.T) # and clean up labels and axes there, too ax[1].set_xticks([]) ax[1].set_yticks([]) ax[1].set_xticklabels([]) ax[1].set_yticklabels([]) ax[1].set_title("Pattern") f.tight_layout() plt.show() ``` ## Ripley's $K$ While the $G$ and $F$ statistics make more use of the nearest-neighbor distance distribution, relative to Clark-Evans, they ignore the other $n-2$ distances between each point and its set of non-nearest neighbor points. Ripley's $K$ function considers the cumulative distance distribution for all point-point pair distances. ```{python} k_test = distance_statistics.k_test( coordinates, support=40, keep_simulations=True ) ``` ```{python} f, ax = plt.subplots( 1, 2, figsize=(9, 3), gridspec_kw=dict(width_ratios=(6, 3)) ) # plot all the simulations with very fine lines ax[0].plot( k_test.support, k_test.simulations.T, color="k", alpha=0.01 ) # and show the average of simulations ax[0].plot( k_test.support, numpy.median(k_test.simulations, axis=0), color="cyan", label="median simulation", ) # and the observed pattern's k function ax[0].plot( k_test.support, k_test.statistic, label="observed", color="red" ) # clean up labels and axes ax[0].set_xlabel("distance") ax[0].set_ylabel("Number of pairs in pattern\n separated by distance") ax[0].legend() ax[0].set_xlim(0, 2000) ax[0].set_title(r"Ripley's $K(d)$ function") # plot the pattern itself on the next frame ax[1].scatter(*coordinates.T) # and clean up labels and axes there, too ax[1].set_xticks([]) ax[1].set_yticks([]) ax[1].set_xticklabels([]) ax[1].set_yticklabels([]) ax[1].set_title("Pattern") f.tight_layout() plt.show() ``` Similar to what we found for our other distance statistics, the observed value for Ripley's $K$ is a significant departure from CSR, again pointing to clustering.

Distance Based Methods

Introduction

Data

Quadrat Analysis

Ripley’s \(G\)

Ripley’s \(F\)

Ripley’s \(K\)