![Image 1 — Multi-city OSM extraction pipeline: 332k features across 4 cities, manifest-driven architecture, Overpass-only — Cities Skylines 2 mapping use case but the GIS bits are generic [OC]](https://preview.redd.it/wbq7414nlz1h1.png?width=1887&format=png&auto=webp&s=f4083b86ba93933fd31f789fa1c0fd947f78509b)
![Image 2 — Multi-city OSM extraction pipeline: 332k features across 4 cities, manifest-driven architecture, Overpass-only — Cities Skylines 2 mapping use case but the GIS bits are generic [OC]](https://preview.redd.it/2s5rfkqrlz1h1.png?width=1901&format=png&auto=webp&s=245752996f6bf821d6b77b2f8ee4bc3131a8d64d)
Multi-city OSM extraction pipeline: 332k features across 4 cities, manifest-driven architecture, Overpass-only — Cities Skylines 2 mapping use case but the GIS bits are generic [OC]
Sharing v3.3 of an open-source project that pulls landuse/building/service polygons from OpenStreetMap via Overpass and classifies them into 11 zoning categories. The use case is reference data for a city-building game (Cities: Skylines 2), but the pipeline is generic for urban analysis at scale.
Just refactored from single-city (Minneapolis) to multi-city. 4 cities now in the same pipeline. Sharing some of the architecture decisions in case they're useful for others doing similar OSM-at-scale work.
**Architecture changes from single-city → multi-city:**
**City registry as single source of truth.** A root `cities.json` declares: slug, name, bbox, country, what modules are supported. Pipeline reads this; no more bbox constants scattered across extract scripts. Adding a city = adding 1 entry + running 3 extract scripts.
**Per-city manifest with content hashes.** Each city has `visualizer/cities/<slug>/manifest.json` listing every prebuilt module + sha256 of the JS file. The web viewer fetches the manifest, only injects `<script>` tags for modules that exist, and uses the sha256 as a cache-busting query string. Means a zoning-only city like Amsterdam doesn't try to load a roads layer.
**Killed the live-Overpass fallback.** v3.2 had ~300 lines of code that would re-query Overpass at view time if the prebuilt JSON was missing. Sounded great, broke half the time — Overpass community endpoints return 200 with empty `{"elements": []}` under load instead of 504, so the fallback's retry logic was useless. v3.3 is prebuilt-only. More reliable, ~30% less code.
**Overpass-specific gotchas worth flagging:**
- **Spatial joins with named sets are non-obvious.** For mixed-use detection (commercial POIs inside apartment building polygons), the naive `(around:5)` returns 0. The fix is explicit named sets:
```overpass
(
node["shop"](bbox);
node["amenity"~"^(restaurant|cafe|bar|pub|fast_food)$"](bbox);
)->.comm;
(
way["building"="apartments"](around.comm:5);
way["building"="residential"](around.comm:5);
);
out body geom;
```
This returns 123 polygons in Minneapolis. Naive query returns 3.
- **Multi-endpoint rotation matters at scale.** I rotate across `overpass-api.de`, `kumi.systems`, `openstreetmap.ru`, `maps.mail.ru` with 3s/6s/12s backoff. For 332k features across 4 bboxes, hitting just one endpoint = guaranteed throttle.
- **Cross-country tagging variance.** Amsterdam's Dutch OSM contributors tag at parcel resolution (89k features for a small bbox — densest per km² of the four cities). US cities are much sparser — Charleston was 14.5k for a similar-sized historic district. The classifier intentionally degrades gracefully — anything unclassifiable gets dropped, not mis-classified — but raw count varies by contributor density, not just urban form.
- **Streaming sha256 for prebuilt manifest.** With 100 MB of prebuilts across 4 cities, computing hashes by reading the whole file into memory was wasteful. Switched to chunked streaming (8 KB chunks), zero impact on RAM.
**Stats per city (zoning only — Mpls also has roads + services):**
- Minneapolis: 192k features (zoning + 108k roads + 2.3k services)
- Amsterdam: 89k features (densest per km², fine-grained OSM tagging at parcel resolution)
- Madison: 37k features
- Charleston: 14.5k features (added via community-request workflow, 30 min from issue to deployed — validates the per-request bbox→prebuilt pipeline at human scale)
**Stack:**
- Python 3.11+, Overpass API, no PostGIS, no QGIS, no geopandas
- Leaflet.js + Canvas renderer for the visualizer
- 171 pytest tests passing
- MIT license
**Repo:** https://github.com/Osyanne/CitiesSkylines2-osm-toolkit
**Hosted viewer:** https://osyanne.github.io/CitiesSkylines2-osm-toolkit/
**Methodology doc** (in repo): walks through every classification decision + the spatial-join gotcha + the multi-endpoint retry logic
---
## Contributing / requesting cities
Adding a new city is a single GitHub Issue: bbox + slug + name. Pipeline runs on my end in ~30 min, no Python or Overpass knowledge required on the requester side. Useful for folks who want to see the classifier output for a specific area without setting up the toolchain locally.
**Template:** [city-request issue](https://github.com/Osyanne/CitiesSkylines2-osm-toolkit/issues/new?template=city-request.yml)
---
Feedback welcome. The contributor-density variance between Amsterdam (89k features tagged at parcel resolution) and US cities (sparser, gappier) was the most interesting bit — if you've worked on normalizing OSM extraction across regions with different contributor cultures, I'd love to compare notes.