OpenStreetMap logo OpenStreetMap

Dirty datasets

Posted by Cascafico on 14 November 2018 in Italian (Italiano). Last updated on 30 September 2020.

Mini abstract

I’ve found a 600+ rows Bed&Breakfast dataset, available opendata by RAFVG. No geo coordinates. Since hiousenumbers were recently imported from RAFVG dataset, I decided to go for geocoding. To get reliable coordinates I used csvgeocode script attached to nominatim geocoding service. Nominatim requires almost perfect (standardized) odonyms, hence I started openrefine and a reconcile service which comes in a separate jar. Reconcile service needs a csv with authoritative names, which I get from overpass-turbo and some filtering.

Dataset

Bed and Breakfast is a rather new dataset (Oct 17) with more than 600 POIs. Many useful fields such as

  • name and operator
  • phone
  • email
  • site
  • opening hours
  • category (standard, comfort, superior)

Cleaning data

Such duty has been accomplished by OpenRefine and Reconcilie plugin, connected as a reconciliation service,

In order to standardize messy B&B addresses (entered by B&B operators theirselves) I had to provide Reconcile with an authoritative set of highway names, which I got from overpass-turbo (see Strade d’Italia diary entry).

Geocoder

Just happened for other projects, I choose csvgecode which features pretty simple usage.

Here is a run using mapbox service:

$ csvgeocode input.csv output.csv –handler mapbox –delay 1000 –verbose –url “http://api.tiles.mapbox.com/v4/geocode/mapbox.places/{{INDIRIZZO}},{{CAP}} {{COMUNE}}.json?access_token="

here using nominatim, instead:

$ csvgeocode input.csv output.csv –handler osm –delay 1000 –verbose –url “http://nominatim.openstreetmap.org/search?q={{INDIRIZZO 1}}, {{COMUNE}}&format=json” Rows geocoded: 468
Rows failed: 114
Time elapsed: 879.4 seconds

114 rows not geocoded expose a geocoder problem with apostrophes in city field. Workaround to bypass such not escapable apostrophe is both removing it (ie: Farra d’Isonzo » Farra disonzo) or use postcode instead. Same problem for address, which only remove solution (ie: San Francesco d’Assisi » San Francescao dassisi). Of course above edits are for geocoding sake only. Besides, part of “success” geocoding rows could have been geocoded even with missing housenumber, resulting in highway centroid coordinates. To limit these false positives, I had to check which municipality were imported, faceting out rows belonging to municipalities in red

Conflating

Conflation matched just 11 POIs, so If you want to collaborate in B&B import, here you can find the audit map to review mostly new nodes.

Discussion

Log in to leave a comment