OSM data hacking for fun and profit - Part 1

Introduction

There are many visulizations online that rely on maps, and there are many mapping solutions available. For simple large area maps such as those for countries/states, d3 with topojson/geojson will generally suffice. For other applications requiring more detailed street level mapping, OSM (Open Street Map) is pretty much the only game in town.

In this series of articles, we will be looking at how to extract features from OSM and displaying them on a map with Leaflet. In particular, we will be looking at extracting railways around San Francisco and the associated stations.

Data Sources

There are several ways of getting the exact data we want (especially since the Planet data is over 617GB uncompressed). If you just want a porton of a city, then the export tab on the main OSM website will probably do the trick. In my (limited) experience, the export tab is not very reliable, and frequently you cannot get the complete data needed (the zoom level often gets too large and the site gives you an “area too large” error).

A better way of getting city level data is to use Metro Extract. It is a service run by Mapzen that gives us a weekly extract of all major metropolitan areas in various formats for quick and painless download. For a big area like California, metro level download is your best bet, since the OSM XML bzipped file for the bay area is ~165MB (or about 2.5G uncompressed), and a similar file for California is ~900MB (which gets to a point of being unwieldy).

San Francisco Metro Area

The next level up from city is state/country/region, which can all be easily downloaded from the Geofabrik-maintained distribution. They have a nice interface which tells you how large the files are before you download.

To see an updated list of OSM data sources, please refer to the Download Data section of OSM Wiki.

Prerequsites

Before moving on to data loading and processing. It is good to talk about the things you will need in order to get started. For the purposes of this article, we will be using Postgresql + Postgis. (It is also very useful to install pgrouting extension for Postgresql as well).

For data loading, you will also need to install osm2postgresql (optionally osm2pgrouting)

On a Mac, it should be as simple as running the following commands (using Homebrew):

brew install postgresql
brew install postgis
brew install osm2postgresql
brew install pgrouting
brew install osm2pgrouting

Optionally, I also recommend a good graphical SQL client. I personally use Postico, which is an execellent native Mac client that does the basics very well (executing SQL queries and seeing the results without lag). The demo version should work well until you need the full version features.

Your experience may vary. Google is probably your best bet in finding solutions to any issues that may arise.

Data Loading

Once you have gotten the data you want for the project you are doing, let’s see how we go about loading the data into Postgresql.

First, we need to create the database that we will be using for this guide, and then go into the database:

createdb osmfun
psql -d osmfun

Then we will run a few commands to install the extensions we will be using in this database:

create extension postgis;

At this point, the databases should be all set up and ready to go.

Next up, we need to get some data from the afromentioned sources. Note, if you are using a mac, .pbf (compresssed) format support is not compiled into the tool at the time of the writing. You will want to download either .osm or where available .osm.bz2.

Once you have the .osm file and ready to go, it is a simple matter of running

osm2pgsql -s -d osmfun san-francisco-bay_california.osm

After it finishes running (~500s on a late 2013 MBP), you should get the following set of tables:

psql -d osmfun -c "select table_name, table_type from information_schema.tables where table_schema = 'public'"

     table_name     | table_type
--------------------+------------
 geography_columns  | VIEW
 geometry_columns   | VIEW
 spatial_ref_sys    | BASE TABLE
 raster_columns     | VIEW
 raster_overviews   | VIEW
 planet_osm_line    | BASE TABLE
 planet_osm_polygon | BASE TABLE
 planet_osm_nodes   | BASE TABLE
 planet_osm_ways    | BASE TABLE
 planet_osm_rels    | BASE TABLE
 planet_osm_point   | BASE TABLE
 planet_osm_roads   | BASE TABLE
 (12 rows)

Data Exploration

Now that we have our data in a easily queryable format, it is time to have some fun and see what we can get out of it.

One of the things I was interested in was how to get a map of public transportation.

To get started, I generally first head to the official osm map, and click around in the relevant area. In this case, I am after a station on the BART line (since it is easier to search for stations than a stretch of track). After looking around, I settled on Powell Street station:

Powell Station Map

Let’s see how we would go about finding this particular station in the database. If we execute the following query:

select osm_id, name, operator, railway, construction
from planet_osm_point
where name = 'Powell Street'

We should get the following results:

osm_idnameoperatorrailway
2208932563Powell Street  
1723739029Powell StreetMunitram_stop
2150077204Powell StreetBART;Munistation
2208932567Powell Street  

On a side note, if you tried to select *, then you will see that the OSM table is in fact quite sparse. There are a lot of fields that are null and only a few are filled in. This is one of the reasons why a graphical SQL client is useful.

Another reason why the official osm map is helpful is because it gives you all the attributes that are set for this particular object. It gives you a sense of what can be queried if you were trying to figure out how to query for something.

Referring to the table above, we see that for the actual BART station, there is one useful attribute railway. As a result, if we query for where railway = 'station', we can get a list of all the station that is in the planet_osm_point table.

OSM table structure

At this point, it is useful to know a bit about how the data in OSM is organized. The core tables that we will need to know are the following:

planet_osm_line

Mostly line features, we will be extracting rail lines from this table

planet_osm_point

As the name suggests, this one contains features that can be described by a point. However, because some stations are bigger and/or more detailed than other stations (aka not fully described by a point), just selecting from this table is not enough as bigger stations will be in planet_osm_polygon below.

planet_osm_polygon

This table contains shape features, such as buildings, larger stations, etc.

planet_osm_roads

This table is not used for our purposes. As far as I can tell, it contains all the roads in the region selected.

Other tables

If you will notice, we did not discuss other tables such as planet_osm_nodes, planet_osm_ways or planet_osm_rels. According to the wiki, these tables are intermediate tables used in the construction of the tables listed above. From my experimentation, it seems that they do contain the same data, albeit in a more concise format (less sparse). It’s fun to take a look if you have time.

To Be Continued…

That’s it for today! Join me next week for part 2, where we will talk about how to extract the data into GeoJSON format and put it on a map.

Back