OSM data hacking for fun and profit - Part 1
Introduction
There are many visulizations online that rely on maps, and there are many mapping solutions available. For simple large area maps such as those for countries/states, d3 with topojson/geojson will generally suffice. For other applications requiring more detailed street level mapping, OSM (Open Street Map) is pretty much the only game in town.
In this series of articles, we will be looking at how to extract features from OSM and displaying them on a map with Leaflet. In particular, we will be looking at extracting railways around San Francisco and the associated stations.
Data Sources
There are several ways of getting the exact data we want (especially since the Planet data is over 617GB uncompressed). If you just want a porton of a city, then the export tab on the main OSM website will probably do the trick. In my (limited) experience, the export tab is not very reliable, and frequently you cannot get the complete data needed (the zoom level often gets too large and the site gives you an “area too large” error).
A better way of getting city level data is to use Metro Extract. It is a service run by Mapzen that gives us a weekly extract of all major metropolitan areas in various formats for quick and painless download. For a big area like California, metro level download is your best bet, since the OSM XML bzipped file for the bay area is ~165MB (or about 2.5G uncompressed), and a similar file for California is ~900MB (which gets to a point of being unwieldy).
The next level up from city is state/country/region, which can all be easily downloaded from the Geofabrik-maintained distribution. They have a nice interface which tells you how large the files are before you download.
To see an updated list of OSM data sources, please refer to the Download Data section of OSM Wiki.
Prerequsites
Before moving on to data loading and processing. It is good to talk about the things you will need in order to get started. For the purposes of this article, we will be using Postgresql + Postgis. (It is also very useful to install pgrouting extension for Postgresql as well).
For data loading, you will also need to install osm2postgresql (optionally osm2pgrouting)
On a Mac, it should be as simple as running the following commands (using Homebrew):
brew install postgresql
brew install postgis
brew install osm2postgresql
brew install pgrouting
brew install osm2pgrouting
Optionally, I also recommend a good graphical SQL client. I personally use Postico, which is an execellent native Mac client that does the basics very well (executing SQL queries and seeing the results without lag). The demo version should work well until you need the full version features.
Your experience may vary. Google is probably your best bet in finding solutions to any issues that may arise.
Data Loading
Once you have gotten the data you want for the project you are doing, let’s see how we go about loading the data into Postgresql.
First, we need to create the database that we will be using for this guide, and then go into the database:
createdb osmfun
psql -d osmfun
Then we will run a few commands to install the extensions we will be using in this database:
create extension postgis;
At this point, the databases should be all set up and ready to go.
Next up, we need to get some data from the afromentioned sources. Note, if you are using a mac, .pbf
(compresssed) format support is not compiled into the tool at the time of the writing. You will want to download either .osm
or where available .osm.bz2
.
Once you have the .osm
file and ready to go, it is a simple matter of running
osm2pgsql -s -d osmfun san-francisco-bay_california.osm
After it finishes running (~500s on a late 2013 MBP), you should get the following set of tables:
psql -d osmfun -c "select table_name, table_type from information_schema.tables where table_schema = 'public'"
table_name | table_type
--------------------+------------
geography_columns | VIEW
geometry_columns | VIEW
spatial_ref_sys | BASE TABLE
raster_columns | VIEW
raster_overviews | VIEW
planet_osm_line | BASE TABLE
planet_osm_polygon | BASE TABLE
planet_osm_nodes | BASE TABLE
planet_osm_ways | BASE TABLE
planet_osm_rels | BASE TABLE
planet_osm_point | BASE TABLE
planet_osm_roads | BASE TABLE
(12 rows)
Data Exploration
Now that we have our data in a easily queryable format, it is time to have some fun and see what we can get out of it.
One of the things I was interested in was how to get a map of public transportation.
To get started, I generally first head to the official osm map, and click around in the relevant area. In this case, I am after a station on the BART line (since it is easier to search for stations than a stretch of track). After looking around, I settled on Powell Street station:
Let’s see how we would go about finding this particular station in the database. If we execute the following query:
select osm_id, name, operator, railway, construction
from planet_osm_point
where name = 'Powell Street'
We should get the following results:
osm_id | name | operator | railway |
---|---|---|---|
2208932563 | Powell Street | ||
1723739029 | Powell Street | Muni | tram_stop |
2150077204 | Powell Street | BART;Muni | station |
2208932567 | Powell Street |
On a side note, if you tried to select *
, then you will see that the OSM table is in fact quite sparse. There are a lot of fields that are null
and only a few are filled in. This is one of the reasons why a graphical SQL client is useful.
Another reason why the official osm map is helpful is because it gives you all the attributes that are set for this particular object. It gives you a sense of what can be queried if you were trying to figure out how to query for something.
Referring to the table above, we see that for the actual BART station, there is one useful attribute railway
. As a result, if we query for where railway = 'station'
, we can get a list of all the station that is in the planet_osm_point
table.
OSM table structure
At this point, it is useful to know a bit about how the data in OSM is organized. The core tables that we will need to know are the following:
planet_osm_line
Mostly line features, we will be extracting rail lines from this table
planet_osm_point
As the name suggests, this one contains features that can be described by a point. However, because some stations are bigger and/or more detailed than other stations (aka not fully described by a point), just selecting from this table is not enough as bigger stations will be in planet_osm_polygon
below.
planet_osm_polygon
This table contains shape features, such as buildings, larger stations, etc.
planet_osm_roads
This table is not used for our purposes. As far as I can tell, it contains all the roads in the region selected.
Other tables
If you will notice, we did not discuss other tables such as planet_osm_nodes
, planet_osm_ways
or planet_osm_rels
. According to the wiki, these tables are intermediate tables used in the construction of the tables listed above. From my experimentation, it seems that they do contain the same data, albeit in a more concise format (less sparse). It’s fun to take a look if you have time.
To Be Continued…
That’s it for today! Join me next week for part 2, where we will talk about how to extract the data into GeoJSON format and put it on a map.