Imputer¶
In this notebook, we’re tackling the Imputer module—your best take for sorting out missing geospatial data. Let’s see it in action with some sample data!
Data source used:
- PLUTO data from NYC Open Data. https://www.nyc.gov/content/planning/pages/resources/datasets/mappluto-pluto-change
import urban_mapper as um
# Fire up UrbanMapper
mapper = um.UrbanMapper()
Loading Sample Data¶
First, let’s grab some sample CSV data. It might have a few gaps in the coordinates, but we’ll sort that out in a jiffy!
Note that:
- Loader example can be seen in
examples/Basics/loader.ipynb
especially to load your data.
# Load data
# Note: For the documentation interactive mode, we only query 20000 records from the dataset. Feel free to remove for a more realistic analysis.
data = (
mapper
.loader
.from_huggingface("oscur/pluto", number_of_rows=20000, streaming=True)
.with_columns("longitude", "latitude")
# .with_columns(geometry_column=<geometry_column_name>") # Replace <geometry_column_name> with the actual name of your geometry column instead of latitude and longitude columns.
.load()
# From the loader module, from the following file within the OSCUR HuggingFace datasets hub and with the `longitude` and `latitude` or only with `geometry`
)
Applying the Imputer¶
Now, let’s bring in the SimpleGeoImputer
to patch up any missing longitude or latitude values. We’ll tell it which columns to focus on.
SimpleGeoImputer
naively imputes missing values if either the longitude or latitude is missing.
However, more are available. See further in the documentation.
# Create an urban layer (needed for the imputer)
# See further in the urban_layer example at examples/Basics/urban_layer.ipynb
layer = (
mapper.urban_layer.with_type("streets_intersections") # From the urban layer module and with the type streets_intersections
.from_place("Downtown Brooklyn, New York City, USA") # From place
.build()
)
print(f"[Before Impute] Number of missing values in the longitude column: {data['longitude'].isnull().sum()}")
print(f"[Before Impute] Number of missing values in the latitude column: {data['latitude'].isnull().sum()}")
# Apply the imputer
imputed_data = (
mapper
.imputer # From the imputer module
.with_type("SimpleGeoImputer") # With the type SimpleGeoImputer
.on_columns(longitude_column="longitude", latitude_column="latitude") # On the columns longitude and latitude
# .on_columns(geometry_column=<geometry_column_name>") # Replace <geometry_column_name> with the actual name of your geometry column instead of latitude and longitude columns.
.transform(data, layer) # All imputers require access to the urban layer in case they need to extract information from it.
)
print(f"[After Impute] Number of missing values in the longitude column: {imputed_data['longitude'].isnull().sum()}")
print(f"[After Impute] Number of missing values in the latitude column: {imputed_data['latitude'].isnull().sum()}")
imputed_data
[Before Impute] Number of missing values in the longitude column: 1
[Before Impute] Number of missing values in the latitude column: 1
[After Impute] Number of missing values in the longitude column: 0
[After Impute] Number of missing values in the latitude column: 0
borough | block | lot | cd | bct2020 | bctcb2020 | ct2010 | cb2010 | schooldist | council | ... | appdate | plutomapid | firm07_flag | pfirm15_flag | version | dcpedited | latitude | longitude | notes | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BK | 5852 | 1 | 310.0 | 3003000.0 | 3.003000e+10 | 30.0 | 2000.0 | 20.0 | 47.0 | ... | None | 1 | NaN | NaN | 25v1 | None | 40.638298 | -74.030598 | None | POINT (-74.0306 40.6383) |
1 | BK | 5852 | 13 | 310.0 | 3003000.0 | 3.003000e+10 | 30.0 | 2000.0 | 20.0 | 47.0 | ... | None | 1 | NaN | NaN | 25v1 | None | 40.638575 | -74.030126 | None | POINT (-74.03013 40.63858) |
2 | BK | 5852 | 6 | 310.0 | 3003000.0 | 3.003000e+10 | 30.0 | 2000.0 | 20.0 | 47.0 | ... | None | 1 | NaN | NaN | 25v1 | None | 40.638567 | -74.030490 | None | POINT (-74.03049 40.63857) |
3 | BK | 5852 | 58 | 310.0 | 3003000.0 | 3.003000e+10 | 30.0 | 2000.0 | 20.0 | 47.0 | ... | None | 1 | NaN | NaN | 25v1 | None | 40.638142 | -74.029704 | None | POINT (-74.0297 40.63814) |
4 | BK | 5848 | 77 | 310.0 | 3003000.0 | 3.003000e+10 | 30.0 | 1007.0 | 20.0 | 47.0 | ... | None | 1 | NaN | NaN | 25v1 | None | 40.639039 | -74.030115 | None | POINT (-74.03012 40.63904) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
19995 | BK | 6165 | 18 | 310.0 | 3020800.0 | 3.020800e+10 | 208.0 | 1001.0 | 20.0 | 38.0 | ... | None | 1 | NaN | NaN | 25v1 | None | 40.625171 | -74.008296 | None | POINT (-74.0083 40.62517) |
19996 | BK | 6154 | 11 | 310.0 | 3020800.0 | 3.020800e+10 | 208.0 | 1000.0 | 20.0 | 38.0 | ... | None | 1 | NaN | NaN | 25v1 | None | 40.625917 | -74.008026 | None | POINT (-74.00803 40.62592) |
19997 | BK | 6176 | 69 | 310.0 | 3020800.0 | 3.020800e+10 | 208.0 | 2002.0 | 20.0 | 38.0 | ... | None | 1 | NaN | NaN | 25v1 | None | 40.624498 | -74.009287 | None | POINT (-74.00929 40.6245) |
19998 | BK | 5898 | 46 | 310.0 | 3020800.0 | 3.020800e+10 | 208.0 | 2000.0 | 20.0 | 38.0 | ... | None | 1 | NaN | NaN | 25v1 | None | 40.625566 | -74.009536 | None | POINT (-74.00954 40.62557) |
19999 | BK | 5898 | 44 | 310.0 | 3020800.0 | 3.020800e+10 | 208.0 | 2000.0 | 20.0 | 38.0 | ... | None | 1 | NaN | NaN | 25v1 | None | 40.625497 | -74.009424 | None | POINT (-74.00942 40.6255) |
19999 rows × 93 columns
Be Able To Preview Your Imputer's instance¶
Additionally, you can preview your imputer's instance to see what columns you've specified and the imputer type you've used. Pretty useful when you load a urban analysis shared by someone else.
print(mapper.imputer.preview())
Imputer: SimpleGeoImputer Action: Drop rows with missing 'latitude' or 'longitude'
None
Provide many different datasets to the same imputer¶
You can load many datasets and feed the imputer with a dictionary. In that case, the output will also be a dictonary. See the next simple example.
If you want to apply the imputer to a specific dataset of the dictionary, provide .with_data(data_id=...)
to the imputer.
# Load CSV data
data1 = (
mapper
.loader
.from_huggingface("oscur/pluto", number_of_rows=1000, streaming=True)
.with_columns("longitude", "latitude")
# .with_columns(geometry_column=<geometry_column_name>") # Replace <geometry_column_name> with the actual name of your geometry column instead of latitude and longitude columns.
.load()
# From the loader module, from the following file and with the `longitude` and `latitude` or only `geometry`
)
# Load Parquet data
data2 = (
mapper
.loader
.from_huggingface("oscur/taxisvis1M", number_of_rows=1000, streaming=True) # To update with your own path
.with_columns("pickup_longitude", "pickup_latitude").load() # Inform your long and lat columns
# .with_columns(geometry_column=<geometry_column_name>") # Replace <geometry_column_name> with the actual name of your geometry column instead of latitude and longitude columns.
)
data = {
"pluto_data": data1,
"taxi_data": data2,
}
# Apply the imputer.
# If the same imputer is applied to all datasets, and longitude_column/latitude_column have different names in each dataset, you can use loader.with_map
# to map columns, standardizing the column names
imputed_data = (
mapper
.imputer # From the imputer module
.with_type("SimpleGeoImputer") # With the type SimpleGeoImputer
.on_columns(longitude_column="longitude", latitude_column="latitude") # On the columns longitude and latitude
# .on_columns(geometry_column=<geometry_column_name>") # Replace <geometry_column_name> with the actual name of your geometry column instead of latitude and longitude columns.
.with_data(data_id="pluto_data") # On a specific data from the dictionary
.transform(data, layer) # All imputers require access to the urban layer in case they need to extract information from it.
)
More Geo Imputers primitives ?¶
Yes ! We deliver AddressGeoImputer
which simply geocode based on a given address
attribute in your dataset, the missing coordinates.
Wants more? Come shout that out on https://github.com/VIDA-NYU/UrbanMapper/issues/4
Wrapping Up¶
Brilliant! 🎉 You’ve patched up those missing coordinates like a champ. Your data’s looking spick and span!