Enricher¶

In this notebook we’ll sprinkle some extra magic onto your urban layers. Let’s add e.g. average building floors to a layer and see it sparkle!

Data source used:

PLUTO data from NYC Open Data. https://www.nyc.gov/content/planning/pages/resources/datasets/mappluto-pluto-change

Let’s jazz things up! 🏙️

In [1]:

Copied!

import urban_mapper as um

# Start UrbanMapper
mapper = um.UrbanMapper()
import urban_mapper as um

# Start UrbanMapper
mapper = um.UrbanMapper()

Loading Data and Creating a Layer¶

First, let’s grab some PLUTO data and set up a street intersections layer for Downtown Brooklyn.

Note that:

Loader example can be seen in examples/Basics/loader.ipynb
Urban Layer example can be seen in examples/Basics/urban_layer.ipynb
Imputer example can be seen in examples/Basics/imputer.ipynb

In [2]:

Copied!





# Load data
# Note: For the documentation interactive mode, we only query 5000 records from the dataset.  Feel free to remove for a more realistic analysis.
data = (
    mapper
    .loader
    .from_huggingface("oscur/pluto", number_of_rows=5000, streaming=True).with_columns("longitude", "latitude").load()
    # From the loader module, from the following file within the HuggingFace OSCUR datasets hub and with the `longitude` and `latitude` or only `geometry`
)

# Create urban layer
layer = (
    mapper
    .urban_layer # From the urban_layer module
    .with_type("streets_intersections")  # With the type streets_intersections
    .from_place("Downtown Brooklyn, New York City, USA") # From place
    .build()
)

# Impute your data if they contain missing values
data = (
    mapper
    .imputer # From the imputer module
    .with_type("SimpleGeoImputer")  # With the type SimpleGeoImputer
    .on_columns(longitude_column="longitude", latitude_column="latitude") # On the columns longitude and latitude
    .transform(data, layer)  # All imputers require access to the urban layer in case they need to extract information from it.
)
# Load data
# Note: For the documentation interactive mode, we only query 5000 records from the dataset.  Feel free to remove for a more realistic analysis.
data = (
    mapper
    .loader
    .from_huggingface("oscur/pluto", number_of_rows=5000, streaming=True).with_columns("longitude", "latitude").load()
    # From the loader module, from the following file within the HuggingFace OSCUR datasets hub and with the `longitude` and `latitude` or only `geometry`
)

# Create urban layer
layer = (
    mapper
    .urban_layer # From the urban_layer module
    .with_type("streets_intersections")  # With the type streets_intersections
    .from_place("Downtown Brooklyn, New York City, USA") # From place
    .build()
)

# Impute your data if they contain missing values
data = (
    mapper
    .imputer # From the imputer module
    .with_type("SimpleGeoImputer")  # With the type SimpleGeoImputer
    .on_columns(longitude_column="longitude", latitude_column="latitude") # On the columns longitude and latitude
    .transform(data, layer)  # All imputers require access to the urban layer in case they need to extract information from it.
)

Enriching the Layer with Debug Enabled¶

Now that we've gathered the ingredients let's enrich our urban layer. E.g with the average number of floors per intersection. We’ll map the data, set up the enricher with the debug feature enabled, and apply it.

Feel free for further readings to explore our Figma system workflow at: https://www.figma.com/board/0uaU4vJiwyZJSntljJDKWf/Developer-Experience-Flow-Diagram---Snippet-Code?node-id=0-1&t=mESZ52qU1D2lfzvH-1

In [3]:

Copied!





# Map data to the nearest layer
# Here the point is to say which intersection of the city maps with which record(s) in your data
# so that we can take into account when enriching.
_, mapped_data = layer.map_nearest_layer(
    data,
    longitude_column="longitude", latitude_column="latitude",
#   geometry_column=<geometry_column_name>", # Replace <geometry_column_name> with the actual name of your geometry column instead of latitude and longitude columns.
    output_column="nearest_intersection", # Will create this column in the data, so that we can re-use that throughout the enriching process below.
)

# Set up and apply enricher with debug enabled
enricher = (
    mapper
    .enricher # From the enricher module
    .with_data(
        group_by="nearest_intersection", values_from="numfloors"
    ) # Reading: With data grouped by the nearest intersection, and the values from the attribute numfloors
    .aggregate_by(
        method="mean", output_column="avg_floors"
    ) # Reading: Aggregate by using the mean and output the computation into the avg_floors new attribute of the urban layer
    .with_debug()  # Enable debug to add DEBUG_avg_floors column which will contain the list of indices from the input data used for each enrichment
    .build()
)
enriched_layer = enricher.enrich(
    mapped_data, layer
)  # Data to use, Urban Layer to Enrich.
# Map data to the nearest layer
# Here the point is to say which intersection of the city maps with which record(s) in your data
# so that we can take into account when enriching.
_, mapped_data = layer.map_nearest_layer(
    data,
    longitude_column="longitude", latitude_column="latitude",
#   geometry_column=", # Replace  with the actual name of your geometry column instead of latitude and longitude columns.
    output_column="nearest_intersection", # Will create this column in the data, so that we can re-use that throughout the enriching process below.
)

# Set up and apply enricher with debug enabled
enricher = (
    mapper
    .enricher # From the enricher module
    .with_data(
        group_by="nearest_intersection", values_from="numfloors"
    ) # Reading: With data grouped by the nearest intersection, and the values from the attribute numfloors
    .aggregate_by(
        method="mean", output_column="avg_floors"
    ) # Reading: Aggregate by using the mean and output the computation into the avg_floors new attribute of the urban layer
    .with_debug()  # Enable debug to add DEBUG_avg_floors column which will contain the list of indices from the input data used for each enrichment
    .build()
)
enriched_layer = enricher.enrich(
    mapped_data, layer
)  # Data to use, Urban Layer to Enrich.

Inspecting the Enriched Layer with Debug Information¶

Let’s take a look at the enriched layer, which now includes the avg_floors column and the DEBUG_avg_floors column with the list of indices from the input data used for each enrichment.

In [4]:

Copied!

# Preview the enriched layer with debug information
print(enriched_layer.layer[['avg_floors', 'DEBUG_avg_floors']].head(50))
# Preview the enriched layer with debug information
print(enriched_layer.layer[['avg_floors', 'DEBUG_avg_floors']].head(50))

    avg_floors                            DEBUG_avg_floors
0     0.000000                                          []
1     0.000000                                          []
2     0.000000                                          []
3     0.000000                                          []
4     0.000000                                          []
5     0.000000                                          []
6     0.000000                                          []
7     0.000000                                          []
8     0.000000                                          []
9     0.000000                                          []
10    0.000000                                          []
11    0.000000                                          []
12    0.000000                                          []
13    0.000000                                          []
14    0.000000                                          []
15    0.000000                                          []
16    0.000000                                          []
17    0.000000                                          []
18    7.500000                        [735, 754, 757, 758]
19    0.000000                                          []
20    0.000000                                          []
21    0.000000                                          []
22    0.000000                                          []
23    0.000000                                          []
24    0.000000                                          []
25    3.333333   [932, 1639, 1645, 1670, 1930, 2523, 4480]
26    0.000000                                          []
27    0.000000                                          []
28    0.000000                                          []
29    0.000000                                          []
30    0.000000                                          []
31    0.000000                                          []
32    0.000000                                          []
33    0.000000                                          []
34    0.000000                                          []
35    0.000000                                          []
36    0.000000                                          []
37    0.000000                                          []
38    0.000000                                          []
39    0.000000                                          []
40    0.000000                                          []
41    0.000000                                          []
42    3.428571  [3076, 3081, 3204, 3360, 3363, 3365, 3369]
43    0.000000                                          []
44    0.000000                                          []
45    0.000000                                          []
46    0.000000                                          []
47    0.000000                                          []
48    0.000000                                          []
49    0.000000                                          []

Be Able To Preview Your Enricher¶

Fancy a peek at your enricher? Use preview() to see the setup—great for when you’re digging into someone else’s work!

In [5]:

Copied!

# Preview enricher
print(enricher.preview())
# Preview enricher
print(enricher.preview())

Enricher Workflow:
├── Step 1: Data Input
│   ├── Group By: nearest_intersection
│   └── Values From: numfloors
├── Step 2: Action
│   ├── Type: Aggregate
│   ├── Aggregator: SimpleAggregator
│   ├── Method: mean
│   └── Output Column: avg_floors
└── Step 3: Enricher
    ├── Type: SingleAggregatorEnricher
    └── Status: Ready

Provide many different datasets to the same enricher¶

You can load many datasets and feed the enricher with a dictionary. All the provided datasets should have the same columns provided in with_data, aggregate_by, etc.

The user can use the argument data_id to specify which dataset from the dictionary should be enrichered.

The output will have an enriched layer, with the specific columns, and and additional data_id column that identifies the origin of that row based on the input dictionary keys.

In [6]:

Copied!





# Load CSV data
data1 = (
    mapper
    .loader
    .from_huggingface("oscur/pluto", number_of_rows=1000, streaming=True)
    .with_columns("longitude", "latitude")
#    .with_columns(geometry_column=<geometry_column_name>") # Replace <geometry_column_name> with the actual name of your geometry column instead of latitude and longitude columns.    
    .load()
    # From the loader module, from the following file and with the `longitude` and `latitude` or only `geometry`
)

# Load Parquet data
data2 = (
    mapper
    .loader
    .from_huggingface("oscur/taxisvis1M", number_of_rows=1000, streaming=True) # To update with your own path
    .with_columns("pickup_longitude", "pickup_latitude") # Inform your long and lat columns
#    .with_columns(geometry_column=<geometry_column_name>") # Replace <geometry_column_name> with the actual name of your geometry column instead of latitude and longitude columns.    
    .with_map({"pickup_longitude": "longitude", "pickup_latitude": "latitude"}) ## Routines like layer.map_nearest_layer needs datasets with the same longitude_column and latitude_column
    .load() 
)

data = {
  "pluto_data": data1,
  "taxi_data": data2,
}

# Create a new urban layer to the data
layer = (
    mapper
    .urban_layer # From the urban_layer module
    .with_type("streets_intersections")  # With the type streets_intersections
    .from_place("Downtown Brooklyn, New York City, USA") # From place
    .build()
)

# Map datasets to the nearest layer
# Here the point is to say which intersection of the city maps with which record(s) in each of your datasets
# so that we can take into account when enriching.
_, mapped_data = layer.map_nearest_layer(
    data,
    longitude_column="longitude", latitude_column="latitude",
#    geometry_column=<geometry_column_name>", # Replace <geometry_column_name> with the actual name of your geometry column instead of latitude and longitude columns.
    output_column="nearest_intersection", # Will create this column in the data, so that we can re-use that throughout the enriching process below.
)

# Set up and apply enricher with debug enabled
enricher = (
    mapper
    .enricher # From the enricher module
    .with_data(
        group_by="nearest_intersection", values_from="numfloors", data_id="pluto_data"
    ) # Reading: With data grouped by the nearest intersection, and the values from the attribute numfloors
      #Both datasets should have the same group_by and values_from columns
    .aggregate_by(
        method="mean", output_column="avg_floors"
    ) # Reading: Aggregate by using the mean and output the computation into the avg_floors new attribute of the urban layer
    .with_debug()  # Enable debug to add DEBUG_avg_floors column which will contain the list of indices from the input data used for each enrichment
    .build()
)
enriched_layer = enricher.enrich(
    mapped_data, layer
)  # Data to use, Urban Layer to Enrich.

#present only the layer items with data_id
layer = enriched_layer.layer[~enriched_layer.layer.data_id.isna()]
layer.head()
# Load CSV data
data1 = (
    mapper
    .loader
    .from_huggingface("oscur/pluto", number_of_rows=1000, streaming=True)
    .with_columns("longitude", "latitude")
#    .with_columns(geometry_column=") # Replace  with the actual name of your geometry column instead of latitude and longitude columns.    
    .load()
    # From the loader module, from the following file and with the `longitude` and `latitude` or only `geometry`
)

# Load Parquet data
data2 = (
    mapper
    .loader
    .from_huggingface("oscur/taxisvis1M", number_of_rows=1000, streaming=True) # To update with your own path
    .with_columns("pickup_longitude", "pickup_latitude") # Inform your long and lat columns
#    .with_columns(geometry_column=") # Replace  with the actual name of your geometry column instead of latitude and longitude columns.    
    .with_map({"pickup_longitude": "longitude", "pickup_latitude": "latitude"}) ## Routines like layer.map_nearest_layer needs datasets with the same longitude_column and latitude_column
    .load() 
)

data = {
  "pluto_data": data1,
  "taxi_data": data2,
}

# Create a new urban layer to the data
layer = (
    mapper
    .urban_layer # From the urban_layer module
    .with_type("streets_intersections")  # With the type streets_intersections
    .from_place("Downtown Brooklyn, New York City, USA") # From place
    .build()
)

# Map datasets to the nearest layer
# Here the point is to say which intersection of the city maps with which record(s) in each of your datasets
# so that we can take into account when enriching.
_, mapped_data = layer.map_nearest_layer(
    data,
    longitude_column="longitude", latitude_column="latitude",
#    geometry_column=", # Replace  with the actual name of your geometry column instead of latitude and longitude columns.
    output_column="nearest_intersection", # Will create this column in the data, so that we can re-use that throughout the enriching process below.
)

# Set up and apply enricher with debug enabled
enricher = (
    mapper
    .enricher # From the enricher module
    .with_data(
        group_by="nearest_intersection", values_from="numfloors", data_id="pluto_data"
    ) # Reading: With data grouped by the nearest intersection, and the values from the attribute numfloors
      #Both datasets should have the same group_by and values_from columns
    .aggregate_by(
        method="mean", output_column="avg_floors"
    ) # Reading: Aggregate by using the mean and output the computation into the avg_floors new attribute of the urban layer
    .with_debug()  # Enable debug to add DEBUG_avg_floors column which will contain the list of indices from the input data used for each enrichment
    .build()
)
enriched_layer = enricher.enrich(
    mapped_data, layer
)  # Data to use, Urban Layer to Enrich.

#present only the layer items with data_id
layer = enriched_layer.layer[~enriched_layer.layer.data_id.isna()]
layer.head()

Out[6]:

	osmid	y	x	highway	street_count	railway	geometry	data_id	avg_floors	DEBUG_avg_floors
18	42480593	40.688537	-73.981526	NaN	1	NaN	POINT (-73.98153 40.68854)	pluto_data	7.5	[735, 754, 757, 758]
25	42491554	40.695966	-73.984590	NaN	4	NaN	POINT (-73.98459 40.69597)	pluto_data	0.0	[932]
72	1258875373	40.691334	-73.982077	traffic_signals	3	NaN	POINT (-73.98208 40.69133)	pluto_data	15.0	[156]
74	1258875710	40.692054	-73.982438	traffic_signals	4	NaN	POINT (-73.98244 40.69205)	pluto_data	15.0	[157]
94	3550741800	40.692464	-73.982639	NaN	3	NaN	POINT (-73.98264 40.69246)	pluto_data	9.0	[152]

More Enricher / Aggregators primitives ?¶

Yes ! We deliver cont_by instead of aggregate_by which simply count the number of records rather than aggregating. Further is shown per future examples outside Basics.

Wants more? Come shout that out on https://github.com/VIDA-NYU/UrbanMapper/issues/11

Wrapping Up¶

Smashing work! 🎉 Your layer’s now enriched with average floors and includes debug information to trace back to the original data. Try visualising it next with visualiser.