SpatialLazyFrame
SpatialLazyFrame is an immutable plan builder. Operations can be declared in any order — the optimizer reorders, fuses, and pushes them down before execution. Nothing runs until .collect() is called.
SpatialGroupBy is returned by .group_by() and holds the keys for a fused aggregate-join.
pycanopy.SpatialLazyFrame
Builds a spatial query plan declaratively. Declaration order is not execution order.
All methods return a new SpatialLazyFrame with the node appended without mutation. Join and kNN nodes act as barriers and are never reordered by the cost sort.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sf
|
SpatialFrame
|
The SpatialFrame that owns the Engine and DataFrame. |
required |
plan
|
Plan
|
Current list of plan nodes (do not mutate directly). |
required |
collect(batch_size=None)
Optimise (SpatialOptimizer) and execute (SpatialExecutor) the plan.
A plan ending in a large-probe spatial join streams the probe in morsels and concatenates, bounding the intermediate. Indexing follows the frame's mode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size
|
int | None
|
Probe rows per morsel for streamed joins. Defaults to MORSEL_ROWS. Ignored for plans without a join. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The executed result as a Polars DataFrame. |
collect_all(frames)
staticmethod
Collect multiple SpatialLazyFrames, caching any shared plan prefix.
Caches the plan prefix shared by frames branched from the same base, emitting it once and building each branch's suffix from it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
frames
|
list[SpatialLazyFrame]
|
SpatialLazyFrames to collect. Must share a SpatialFrame. |
required |
Returns:
| Type | Description |
|---|---|
list[DataFrame]
|
List of DataFrames in the same order as frames. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If frames is empty or frames belong to different SpatialFrames. |
collect_batched(batch_size=None)
Execute the plan and yield the result one morsel-frame at a time.
A join plan yields the result one joined morsel at a time so the full result never materialises. Plans without a join yield one frame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size
|
int | None
|
Probe rows per morsel. Defaults to MORSEL_ROWS. |
None
|
Returns:
| Type | Description |
|---|---|
Iterator[DataFrame]
|
An iterator of DataFrames, one per probe morsel. |
contains(x, y)
Add a point-in-polygon filter (polygon dataset only).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
float
|
X coordinate of the query point. |
required |
y
|
float
|
Y coordinate of the query point. |
required |
Returns:
| Type | Description |
|---|---|
SpatialLazyFrame
|
New SpatialLazyFrame with the contains node appended. |
explain()
Return a human-readable description of the computed query plan.
Shows the optimised plan that collect() will execute (reordered operations, fused predicates, chosen EXPR or IO path) rather than the declaration order.
Returns:
| Type | Description |
|---|---|
str
|
Multi-line string describing the plan. Print it for readable output. |
filter(expr)
Add a scalar Polars expression filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr
|
Expr
|
Any Polars expression that evaluates to a boolean column. |
required |
Returns:
| Type | Description |
|---|---|
SpatialLazyFrame
|
New SpatialLazyFrame with the scalar node appended. |
group_by(*keys)
Begin a grouped aggregation, reduced over the streamed join.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keys
|
str | list[str] | tuple[str, ...]
|
Group-by key columns, as varargs or a single list/tuple. |
()
|
Returns:
| Type | Description |
|---|---|
SpatialGroupBy
|
A SpatialGroupBy builder. Call .agg() to run the aggregation. |
intersects_pairs()
Find all intersecting polygon pairs with overlap area and IoU (polygon dataset).
Returns:
| Type | Description |
|---|---|
SpatialLazyFrame
|
New SpatialLazyFrame with the intersects self-join node appended. |
knn(x, y, k)
Add a k-nearest-neighbour lookup.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
float
|
X coordinate of the query point. |
required |
y
|
float
|
Y coordinate of the query point. |
required |
k
|
int
|
Number of neighbours to return. |
required |
Returns:
| Type | Description |
|---|---|
SpatialLazyFrame
|
New SpatialLazyFrame with the knn node appended. |
knn_join(query_df, x_col, y_col, k)
Spatial join: for each row in query_df find its k nearest in this Engine's dataset.
Result columns are query_df's followed by the Engine df's (conflicting right-side columns are prefixed with 'right_').
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_df
|
DataFrame
|
DataFrame of query points. |
required |
x_col
|
str
|
Column in query_df holding x coordinates. |
required |
y_col
|
str
|
Column in query_df holding y coordinates. |
required |
k
|
int
|
Number of neighbours per query row. |
required |
Returns:
| Type | Description |
|---|---|
SpatialLazyFrame
|
New SpatialLazyFrame with the knn join node appended. |
lazy_source(batch_size=None)
Expose the plan's streamed output as a native Polars LazyFrame source.
The plan runs morsel by morsel as a Polars IO source, so downstream ops (sort, sink_parquet) fuse with the join into one out-of-core pipeline. A one-row probe runs first.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size
|
int | None
|
Probe rows per morsel. Defaults to MORSEL_ROWS. |
None
|
Returns:
| Type | Description |
|---|---|
LazyFrame
|
A Polars LazyFrame that streams this plan's output. |
points_within_distance_of_polygon(polygon, distance)
Keep points within distance of a single query polygon (point dataset).
Distance is measured to the polygon boundary (zero when the point is inside). The result is a subset of this frame's rows like a spatial filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
polygon
|
A single shapely Polygon (interior holes supported). |
required | |
distance
|
float
|
Maximum point-to-polygon distance for a row to be kept. |
required |
Returns:
| Type | Description |
|---|---|
SpatialLazyFrame
|
New SpatialLazyFrame with the points-within-distance node appended. |
polygon_knn_join(query_df, x_col, y_col, k, sorted_output=False)
Spatial join: for each point in query_df find its k nearest Engine polygons.
Ranking is by exact point-to-polygon distance and a 'distance_to_polygon' column is appended.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_df
|
DataFrame
|
DataFrame of query points. |
required |
x_col
|
str
|
Column in query_df holding x coordinates. |
required |
y_col
|
str
|
Column in query_df holding y coordinates. |
required |
k
|
int
|
Number of nearest polygons per query point. |
required |
sorted_output
|
bool
|
If True, all pairs are sorted by (distance_to_polygon ASC, target_idx ASC) inside Rust via rayon before returning. The full result materialises in RAM, so morsel streaming is bypassed. Matches ORDER BY distance_to_building, b_buildingkey without a Polars sort step. |
False
|
Returns:
| Type | Description |
|---|---|
SpatialLazyFrame
|
New SpatialLazyFrame with the polygon kNN join node appended. |
polygon_within_distance_join(query_df, x_col, y_col, distance)
Spatial join: for each point in query_df find Engine polygons within distance.
Distance is to the polygon boundary (zero when the point is inside). Result columns are query_df's then the Engine df's (conflicting right-side columns prefixed 'right_').
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_df
|
DataFrame
|
DataFrame of query points. |
required |
x_col
|
str
|
Column in query_df holding x coordinates. |
required |
y_col
|
str
|
Column in query_df holding y coordinates. |
required |
distance
|
float
|
Maximum point-to-polygon distance for a match. |
required |
Returns:
| Type | Description |
|---|---|
SpatialLazyFrame
|
New SpatialLazyFrame with the polygon within-distance join node appended. |
range_query(min_x, min_y, max_x, max_y)
Add a bounding-box spatial filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_x
|
float
|
Left edge of the query rectangle. |
required |
min_y
|
float
|
Bottom edge of the query rectangle. |
required |
max_x
|
float
|
Right edge of the query rectangle. |
required |
max_y
|
float
|
Top edge of the query rectangle. |
required |
Returns:
| Type | Description |
|---|---|
SpatialLazyFrame
|
New SpatialLazyFrame with the range node appended. |
select(*columns)
Restrict the collected output to these columns, pushed into a join gather when present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns
|
str | list[str] | tuple[str, ...]
|
Output column names to keep, as varargs or a single list/tuple. |
()
|
Returns:
| Type | Description |
|---|---|
SpatialLazyFrame
|
New SpatialLazyFrame with the terminal select node appended. |
sink_parquet(path, batch_size=None)
Execute the plan and stream its result to a Parquet file in bounded memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Destination Parquet file path. |
required |
batch_size
|
int | None
|
Probe rows per morsel. Defaults to MORSEL_ROWS. |
None
|
within_distance_join(query_df, x_col, y_col, distance)
Spatial join: for each point in query_df find Engine points within distance.
Result columns are query_df's followed by the Engine df's (conflicting right-side columns are prefixed with 'right_').
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_df
|
DataFrame
|
DataFrame of query points. |
required |
x_col
|
str
|
Column in query_df holding x coordinates. |
required |
y_col
|
str
|
Column in query_df holding y coordinates. |
required |
distance
|
float
|
Maximum Euclidean distance for a match. |
required |
Returns:
| Type | Description |
|---|---|
SpatialLazyFrame
|
New SpatialLazyFrame with the within-distance join node appended. |
within_join(query_df, x_col, y_col)
Spatial join: for each point in query_df find the Engine polygons that contain it.
Engine must be a polygon dataset. Result columns are query_df's then the Engine df's (conflicting right-side columns are prefixed with 'right_').
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_df
|
DataFrame
|
DataFrame of query points. |
required |
x_col
|
str
|
Column in query_df holding x coordinates. |
required |
y_col
|
str
|
Column in query_df holding y coordinates. |
required |
Returns:
| Type | Description |
|---|---|
SpatialLazyFrame
|
New SpatialLazyFrame with the within join node appended. |
pycanopy.SpatialGroupBy
Pending grouped aggregation over a SpatialLazyFrame. Created by .group_by().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
slf
|
SpatialLazyFrame
|
The SpatialLazyFrame to aggregate. |
required |
keys
|
list[str]
|
Group-by key columns. |
required |
agg(**named_aggs)
Run the grouped aggregation, reducing each join morsel into per-group partials.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
named_aggs
|
AggSpec
|
Output column name to aggregation spec (pycanopy.agg.count, sum, etc). |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
One row per group with the named aggregate columns. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no aggregations are given. |