OpenStreetMap

RFC: wikidata->osm lookup table

Posted by mcld on 22 March 2017 in English.

OpenStreetMap has a wikidata tag which lets us connect OSM objects to their corresponding Wikidata items.

(Technical note: it’s a “same as” relationship - i.e. the tag asserts that the two items in different systems refer to the same entity. However, sometimes things in OSM are split into multiple objects; and sometimes one object in OSM actually refers to multiple items in Wikidata. So it’s actually a “many-to-many” matching, not “one-to-one”: a single OSM object sometimes has multiple semicolon-separated Wikidata identifiers, and multiple OSM objects sometimes have the same Wikidata identifier.)

There are over 600,000 OSM objects with the “wikidata” tag. OK great, job done? I mean, nothing’s ever “complete” in these big open-ended crowdsource projects, but if we have more than half a million crosslinks between the systems, that’s really good going.

BUT THERE’S A PROBLEM!

Using the tag to jump from OSM to Wikidata works fine. But from Wikidata to OSM? Well, there’s no persistent way to link from wkd->osm, simply because OSM’s identifiers are impermanent - they’re not guaranteed to continue existing, or to continue referring to the same thing. So it’s not particularly sensible to store OSM identifiers in Wikidata. Instead, an Overpass lookup is required.

For example, on the OSM Wikidata page I found this friendly Wikidata interface called “Reasonator” - all very nice, but instead of cross-linking immediately to the OSM object, it offers a little “Overpass” link which you can click to do a dynamic lookup.

The effect is that it makes Wikidata->OSM connections indirect, obscured, only-for-those-who-know-they-want-it. If a Wikidata coder says “OK great how do I jump to the item in OSM?” you first have to teach them what Overpass is and how it relates to OSM, then how to use its query language, how many queries a day you’re allowed to do on Overpass… bleh.

PROPOSED SOLUTION

Pretty simple proposal, then: a script that produces a Wikidata->OSM lookup table. This could be run as a weekly cron job perhaps (or something monitoring minutely diffs for any changed wikidata tag? dunno) and it could produce a lookup table that is easy for non-OSM users to consume. For example, it could produce a big CSV file like this:

 Q1002133,node/29541385
 Q1002826,node/20919015
 Q1002845,node/241795518
 Q1004173,way/38387732
 Q1004824,node/29164070
 Q1026205,node/410291638,relation/1061137
 Q1005234,relation/2797450
 ...

and a JSON file like this:

 {
 "Q1002133": [["node",29541385]],
 "Q1002826": [["node",20919015]],
 "Q1002845": [["node",241795518]],
 "Q1004173": [["way",38387732]],
 "Q1004824": [["node",29164070]],
 "Q1026205": [["node",410291638], ["relation",1061137]],
 "Q1005234": [["relation",2797450]],
 ...
 }

and then what might be useful could be for these to be published at a stable location, for other programmers to make use of dynamically. The intention is to make it easy for someone with no OSM knowledge and no GIS knowledge to be able to hook OSM into their open data ecosystem.

I wrote a Python script that makes these lookup tables. On my home desktop, it takes about 2 minutes to scan the UK extract; for the whole planet file, it takes a lot longer… 90 minutes! Oof. (The CSV and JSON files produced are 14 MB & 19 MB in size.)

Your thoughts?

Discussion

Comment from d1g on 23 March 2017 at 09:15

All we need is a property to contain “node/410291638” OR “relation/1061137”.

Discussion about P402 RFD is rather twisted: part votes to remove the inaccurate prop, part votes to keep data, part votes to keep “at least single way to do it”.

Comment from PlaneMad on 23 March 2017 at 09:16

This could be super handy to validate names and coordinates in Wikidata against OSM translations!

Comment from mcld on 23 March 2017 at 09:34

@d1g

So there’s Wikidata Property P402 which means “OSM relation” - and already this is a bit weird cos it’s only relations not nodes or ways. And there’s a Request for Deletion (RFD) being discussed to get rid of it. The debate over there is indeed a bit convoluted.

But if I understand you right @d1g, you’re saying that the Wikidata community might after all be interested in storing wkd->osm references, even though they’re not guaranteed stable?

Comment from d1g on 23 March 2017 at 10:49

@mcld your solution adds a third component into the equation when mappings can be integrated into Wikidata. If any solution works, why not host it at WD?

Wikidata has many references, while OSM is huge if you count nodes. Only limited subset of OSM is meaningful.

Comment from sabas88 on 24 March 2017 at 11:22

The list could be generated from WMF side, I suggested it for example in

https://phabricator.wikimedia.org/T159205

Comment from Nizil on 27 March 2017 at 20:17

Is possible to write a script or bot which adds coordinates of node of osm in Wikidata? This way, even if ID of node change, Wikidata can continue to point at the same place.

Comment from Nizil on 27 March 2017 at 20:18

Is it possible to write a script or bot which adds coordinates of node of osm in Wikidata? This way, even if ID of node change, Wikidata can continue to point at the same place.

Log in to leave a comment