Skip to content

SpatialLazyFrame

SpatialLazyFrame is an immutable plan builder. Operations can be declared in any order — the optimizer reorders, fuses, and pushes them down before execution. Nothing runs until .collect() is called.

SpatialGroupBy is returned by .group_by() and holds the keys for a fused aggregate-join.

pycanopy.SpatialLazyFrame

Builds a spatial query plan declaratively. Declaration order is not execution order.

All methods return a new SpatialLazyFrame with the node appended without mutation. Join and kNN nodes act as barriers and are never reordered by the cost sort.

Parameters:

Name Type Description Default
sf SpatialFrame

The SpatialFrame that owns the Engine and DataFrame.

required
plan Plan

Current list of plan nodes (do not mutate directly).

required

collect(batch_size=None)

Optimise (SpatialOptimizer) and execute (SpatialExecutor) the plan.

A plan ending in a large-probe spatial join streams the probe in morsels and concatenates, bounding the intermediate. Indexing follows the frame's mode.

Parameters:

Name Type Description Default
batch_size int | None

Probe rows per morsel for streamed joins. Defaults to MORSEL_ROWS. Ignored for plans without a join.

None

Returns:

Type Description
DataFrame

The executed result as a Polars DataFrame.

collect_all(frames) staticmethod

Collect multiple SpatialLazyFrames, caching any shared plan prefix.

Caches the plan prefix shared by frames branched from the same base, emitting it once and building each branch's suffix from it.

Parameters:

Name Type Description Default
frames list[SpatialLazyFrame]

SpatialLazyFrames to collect. Must share a SpatialFrame.

required

Returns:

Type Description
list[DataFrame]

List of DataFrames in the same order as frames.

Raises:

Type Description
ValueError

If frames is empty or frames belong to different SpatialFrames.

collect_batched(batch_size=None)

Execute the plan and yield the result one morsel-frame at a time.

A join plan yields the result one joined morsel at a time so the full result never materialises. Plans without a join yield one frame.

Parameters:

Name Type Description Default
batch_size int | None

Probe rows per morsel. Defaults to MORSEL_ROWS.

None

Returns:

Type Description
Iterator[DataFrame]

An iterator of DataFrames, one per probe morsel.

contains(x, y)

Add a point-in-polygon filter (polygon dataset only).

Parameters:

Name Type Description Default
x float

X coordinate of the query point.

required
y float

Y coordinate of the query point.

required

Returns:

Type Description
SpatialLazyFrame

New SpatialLazyFrame with the contains node appended.

explain()

Return a human-readable description of the computed query plan.

Shows the optimised plan that collect() will execute (reordered operations, fused predicates, chosen EXPR or IO path) rather than the declaration order.

Returns:

Type Description
str

Multi-line string describing the plan. Print it for readable output.

filter(expr)

Add a scalar Polars expression filter.

Parameters:

Name Type Description Default
expr Expr

Any Polars expression that evaluates to a boolean column.

required

Returns:

Type Description
SpatialLazyFrame

New SpatialLazyFrame with the scalar node appended.

group_by(*keys)

Begin a grouped aggregation, reduced over the streamed join.

Parameters:

Name Type Description Default
keys str | list[str] | tuple[str, ...]

Group-by key columns, as varargs or a single list/tuple.

()

Returns:

Type Description
SpatialGroupBy

A SpatialGroupBy builder. Call .agg() to run the aggregation.

intersects_pairs()

Find all intersecting polygon pairs with overlap area and IoU (polygon dataset).

Returns:

Type Description
SpatialLazyFrame

New SpatialLazyFrame with the intersects self-join node appended.

knn(x, y, k)

Add a k-nearest-neighbour lookup.

Parameters:

Name Type Description Default
x float

X coordinate of the query point.

required
y float

Y coordinate of the query point.

required
k int

Number of neighbours to return.

required

Returns:

Type Description
SpatialLazyFrame

New SpatialLazyFrame with the knn node appended.

knn_join(query_df, x_col, y_col, k)

Spatial join: for each row in query_df find its k nearest in this Engine's dataset.

Result columns are query_df's followed by the Engine df's (conflicting right-side columns are prefixed with 'right_').

Parameters:

Name Type Description Default
query_df DataFrame

DataFrame of query points.

required
x_col str

Column in query_df holding x coordinates.

required
y_col str

Column in query_df holding y coordinates.

required
k int

Number of neighbours per query row.

required

Returns:

Type Description
SpatialLazyFrame

New SpatialLazyFrame with the knn join node appended.

lazy_source(batch_size=None)

Expose the plan's streamed output as a native Polars LazyFrame source.

The plan runs morsel by morsel as a Polars IO source, so downstream ops (sort, sink_parquet) fuse with the join into one out-of-core pipeline. A one-row probe runs first.

Parameters:

Name Type Description Default
batch_size int | None

Probe rows per morsel. Defaults to MORSEL_ROWS.

None

Returns:

Type Description
LazyFrame

A Polars LazyFrame that streams this plan's output.

points_within_distance_of_polygon(polygon, distance)

Keep points within distance of a single query polygon (point dataset).

Distance is measured to the polygon boundary (zero when the point is inside). The result is a subset of this frame's rows like a spatial filter.

Parameters:

Name Type Description Default
polygon

A single shapely Polygon (interior holes supported).

required
distance float

Maximum point-to-polygon distance for a row to be kept.

required

Returns:

Type Description
SpatialLazyFrame

New SpatialLazyFrame with the points-within-distance node appended.

polygon_knn_join(query_df, x_col, y_col, k, sorted_output=False)

Spatial join: for each point in query_df find its k nearest Engine polygons.

Ranking is by exact point-to-polygon distance and a 'distance_to_polygon' column is appended.

Parameters:

Name Type Description Default
query_df DataFrame

DataFrame of query points.

required
x_col str

Column in query_df holding x coordinates.

required
y_col str

Column in query_df holding y coordinates.

required
k int

Number of nearest polygons per query point.

required
sorted_output bool

If True, all pairs are sorted by (distance_to_polygon ASC, target_idx ASC) inside Rust via rayon before returning. The full result materialises in RAM, so morsel streaming is bypassed. Matches ORDER BY distance_to_building, b_buildingkey without a Polars sort step.

False

Returns:

Type Description
SpatialLazyFrame

New SpatialLazyFrame with the polygon kNN join node appended.

polygon_within_distance_join(query_df, x_col, y_col, distance)

Spatial join: for each point in query_df find Engine polygons within distance.

Distance is to the polygon boundary (zero when the point is inside). Result columns are query_df's then the Engine df's (conflicting right-side columns prefixed 'right_').

Parameters:

Name Type Description Default
query_df DataFrame

DataFrame of query points.

required
x_col str

Column in query_df holding x coordinates.

required
y_col str

Column in query_df holding y coordinates.

required
distance float

Maximum point-to-polygon distance for a match.

required

Returns:

Type Description
SpatialLazyFrame

New SpatialLazyFrame with the polygon within-distance join node appended.

range_query(min_x, min_y, max_x, max_y)

Add a bounding-box spatial filter.

Parameters:

Name Type Description Default
min_x float

Left edge of the query rectangle.

required
min_y float

Bottom edge of the query rectangle.

required
max_x float

Right edge of the query rectangle.

required
max_y float

Top edge of the query rectangle.

required

Returns:

Type Description
SpatialLazyFrame

New SpatialLazyFrame with the range node appended.

select(*columns)

Restrict the collected output to these columns, pushed into a join gather when present.

Parameters:

Name Type Description Default
columns str | list[str] | tuple[str, ...]

Output column names to keep, as varargs or a single list/tuple.

()

Returns:

Type Description
SpatialLazyFrame

New SpatialLazyFrame with the terminal select node appended.

sink_parquet(path, batch_size=None)

Execute the plan and stream its result to a Parquet file in bounded memory.

Parameters:

Name Type Description Default
path str | Path

Destination Parquet file path.

required
batch_size int | None

Probe rows per morsel. Defaults to MORSEL_ROWS.

None

within_distance_join(query_df, x_col, y_col, distance)

Spatial join: for each point in query_df find Engine points within distance.

Result columns are query_df's followed by the Engine df's (conflicting right-side columns are prefixed with 'right_').

Parameters:

Name Type Description Default
query_df DataFrame

DataFrame of query points.

required
x_col str

Column in query_df holding x coordinates.

required
y_col str

Column in query_df holding y coordinates.

required
distance float

Maximum Euclidean distance for a match.

required

Returns:

Type Description
SpatialLazyFrame

New SpatialLazyFrame with the within-distance join node appended.

within_join(query_df, x_col, y_col)

Spatial join: for each point in query_df find the Engine polygons that contain it.

Engine must be a polygon dataset. Result columns are query_df's then the Engine df's (conflicting right-side columns are prefixed with 'right_').

Parameters:

Name Type Description Default
query_df DataFrame

DataFrame of query points.

required
x_col str

Column in query_df holding x coordinates.

required
y_col str

Column in query_df holding y coordinates.

required

Returns:

Type Description
SpatialLazyFrame

New SpatialLazyFrame with the within join node appended.

pycanopy.SpatialGroupBy

Pending grouped aggregation over a SpatialLazyFrame. Created by .group_by().

Parameters:

Name Type Description Default
slf SpatialLazyFrame

The SpatialLazyFrame to aggregate.

required
keys list[str]

Group-by key columns.

required

agg(**named_aggs)

Run the grouped aggregation, reducing each join morsel into per-group partials.

Parameters:

Name Type Description Default
named_aggs AggSpec

Output column name to aggregation spec (pycanopy.agg.count, sum, etc).

{}

Returns:

Type Description
DataFrame

One row per group with the named aggregate columns.

Raises:

Type Description
ValueError

If no aggregations are given.