fititnt's Diary

Recent diary entries

Generalization of extraction of example codes, tabular data and Infoboxes from MediaWikis such as OSM.wiki

Posted by fititnt on 24 July 2023 in English.

This text is a continuation of my previous diary and do what the title says. The draft already existed 6 months ago, but just today I’m publishing this diary. Anyway, there’s comment I head about this on the @Wikimaps telegram

“You seem to be doing what we were doing 10+ years ago before Wikidata existed” – Maarten Dammers opinion on what this approach doing

Well, he’s right… but there’s a reason for that. This diary have 4 parts, the examples are on 3.

1. Preface

This extension could be perceived as one approach to make general proposed data extraction from OpenStreetMap Wiki, which , in a ABox vs TBox dicotomy, is the closest of a TBox for OpenStreetMap (*).

*: if we ignore id-tagging-schema and, obviously, other custom strategies to explain the meaning of OpenStreetMap data, which could include cartocss used to explain how to render as image. I do have a rudimentary draft of try to make sense of all of her encodings here, but not ready for today.

1.1 Wikibase is not consensus even between what would be ontologists on OpenStreetMap

Tip: for those wanting to view/review some past discussions, check https://wiki.openstreetmap.org/wiki/User:Minh_Nguyen/Wikidata_discussions#Wikidata_link_in_wiki_infoboxes. The same page from Minh Nguyen has other links, such as discussions to remove the entire Wikibase extension from OSM.wiki at https://github.com/openstreetmap/operations/issues/764.

On the surface, it may appear that the partial opposition for Wikibase was because of some minor user interfaces issues. But reading old discussions (for those not alware, I’m new to OpenStreetmap, joined October 2022) this would be insufficient to understand why a stricter, overcentralized approach is rejected. I suspect that part of the complains (which are reflected on very early criticisms, including from the Taginfo developer on the mail list but even himself years before was criticizes when attempted to improve standardization (at least on parsing the wiki, which is very relevant here); I’m seeing a trend here: innovators today, conservatives tomorrow) seems to be that attempting to encode in a single storage a strictly logically consistent T-Box would not be feasible: some definitions might contradict each other.

One fact is that OpenStreetMap dada is used successfully in production, and there’s significant number of tools focused on its data. Wikidata may be more known as a community contributed linked data repository, however OpenStreetMap, while RDF format is less standardized today, is known to be used in production with few to none transformation than data repacking. In other words, mass rewrites on OSM data can easily break lot of applications. Note my focus here: “production use” means the developers which also consume the data are focused on making it usable, not wanting to break things unless there’s a valid reason for it. One impact is proposals wanting to refactor already used tagging on data likely will be refused.

However, similar to how Wikidata has proposals for properties, OpenStreetMap does have a formal process for tagging (in addition. To simply be “de facto” or “in use”). This alone is proof that, while some might not call themselves ontologists, and defend the idea of Any tags you like, they actually have a role of ontologists. The mere fact they don’t call themselves this, or not use popular strategy to encode ontologies, e.g. RDF, don’t make their criticism invalid, because they can be simply not complaining about the standards (or even the Wikibase itself) but the idea on how these are used to solve problems on OpenStreetMap.

I’m trying to keep this topic short, but my current hypothesis is the reason TBox on OpenStreetMap cannot be fully centralized is because while developers might have several points in common and would be wishing to integrate in their software (both id-tagging-schema and editor-layer-index are examples of it), they have technical reasons to not agree on 100%, so strategies to make easier makes sense. For example, either some tags can contradict each other (which even on semantic reasoning is a blocker; because the tag cannot “be fixed” if it is realistic with implementation) or their definition might be too complex to production implementation.

On this aspect, the current deliverable of this diary might seem a step backwards to how Wikibase works, but in addition to trying to further formalize and help to do data mining on OSM.wiki infoboxes, it starts the idea of getting even more data from wiki pages. And yes, in future it could be used by other tools to help synchronize OSM infoboxes with a Wikibase instance such as Data Items again, even if it means detect differences so humans could act. Even knowing to be impossible to reach 100%, we could try work on baseline which could help others consume not just OpenStreetMap data, ABox, but also part of it’s tagging, which is part (but not full), TBox, but in this journey, far before might be necessary help understand inconsistencies.

1.2 Wikibase is not even the only approach using MediaWiki for structured content

Wikibase, while it powers Wikidata, is not the only extension which can be used with MediaWiki. Likely a good link for a general overview is this one https://www.mediawiki.org/wiki/Manual:Managing_data_in_MediaWiki. These MediaWiki extensions focused on structured data are server side, a centralized approach (which assumes others agree with how to implement it from start). Since a text field with wikitext of all pages on MediaWiki database wouldn’t be queryable, these extensions actually use MediaWiki as permanent, versioned storage, but they take the responsibility of synchronize such data with some more specialized database engine (or at least use same database, but with additional tables). Even Wikibase still relies on external RDF triplestore to allow running SPARQL; it’s user interface (the one humans edit on sites like Wikidata) are an abstraction to store the data like a page in the MediaWiki (Wikibase extension actually uses undocumented JSON, not Wikitext).

One (in the author’s knowledge) unique feature of the implementation this diary presents to you is the following: it doesn’t require installation on the MediaWiki server. One side effect is it can also, out of the box, parse data from multiple MediaWiki wikis, and I’m not only talking about mix OSM.wiki and the OpenStreetMap Foundation wiki, but could extract data from wikipedias. You are free to decide which pages in the selected wiki should contain the data, without any specific URL pattern (like prefixes with Qs or Ps), and this aspect is more similar to other MediaWiki alternatives to Wikibase.

1.3 Then, what means decentralized, without particular database, approach

I’m very sure, especially for ontologists (the ones less aware of diverse ecosystem of data costumers on OpenStreetMap), that the very idea of not optimizing for a centralized storage would be perceived as anti-pattern. However, while requiring more work, those interested could still ingest the data on a single database. The command line implementation does not dictate how the data should be consumed, because it has other priorities.

“Make each program do one thing well.” – (part of) The UNIX Philosophy

Quote Make each program do one thing well.

All MediaWiki extensions have in common parse Wikitext (Wikibase is JSON), and this one does this specific part.For sake of make easier for the user (and make wiki admins less likely to have incentive to block data mining with this extension) it actually caches the data locally in a SQLite database, so if somewhat make friendly for repeated use (maybe even offline/backup if you setup higher expiration date). But unless you work directly with its default outputs (explained in next section) if you want a full solution, you will still need to customize which storage to save the data optimized for your use cases. So, this implementation could help synchronize OSM Infoboxes with the OSM Data Items, but its use actually is an abstraction for generic use cases. In the OpenStreetMap world, there’s TagInfo is know to parse the data, and also the Nominarim uses the Wiki to extract some information.

2. The “data model” (…of exported data handcrafted in the Wiki)

With all the explanation on the preface, the implementation of the result of data mining optimizes for a dump-like interoperable file format.

I do have some experience generating and to document group of files in the Humanitarian eXchange Language Standard, so at least for extracted for tabular data, if there's sufficient, instead of custom JSON-LD, could be some of these packaging standards highly optimized for traditional SQL databases instead of what could be archived if data inside this top level JSON-LD could be directly usable as RDF. But let's focus for now.

Some technical decisions at the moment of the generic approach:

The exported data is JSON where individual parts of the page are inside an list/array of the top level “data” key. This is a popular convention on REST APIs, another would be to use the top level “error” key.
The alternative is JSON-seq (RFC 7464), which would make it friendly to work with continuously streaming or merge different datasets by… Just concatenating the file. This approach also could in future be highly optimized for massive datasets with low memory use.
The fields are documented in JSON-LD and JSON Schema and everything else (from standards to tooling) able to work with this. The working draft is available at https://wtxt.etica.ai/
As one alternative, the implementation also allows to materialize individual items extracted from the pages as files both with global (unique if merging different wikis) and optional customized file names. Output is a Zip file with a previsible default directory structure.

One known limitation of the overgeneralization is that only the top level of JSON-LD and @types are by default strictly documented. Sorry .

Would it be possible to allow customization of the internal parts in future? “Yes”. However, (and this is from someone who already made CLI tools with a massive number of options) doesn’t seem to be a good usability idea to make way too many command line configurations instead of simulating them as some kind of file (which could potentially be extracted itself from wikis). But for those thinking on it, let me say upfront that to fill this gap, MediaWiki templates (aka the Infoboxes) and the tabular data could have at least per Wiki profiles for what become consensuses. And tables and subset of relevant for reuse syntaxhighlight codes (or some kinds of Templates with SPARQL / OverpassQL which are also example codes), could have additional hidden comments to give hints now they’re exported, at minimum their suggested file names. To maximize such approach, every MediaWiki would require a some sort of global dictionary (for things which already are global meaning, not varying by context), to give hints of how to convert, as example , {{yes}} to something machine readable like true. Another missing point would be have conversion tables which might depend on context (such as “inuse” -> “in use” on OSM Infoboxes), so as much as possible, the generated templates avoid humans to rewrite of 100s pages with misspellings or synonyms as long as some page on wiki can centralized these profiles.

3. Pratical part

With all this said, let’s cite the examples not already on the README.md and --help option of the tool.

3.0. Requirements for installation of `wiki_as_base` cli tool

pip install wiki_as_base==0.5.10
# latest, use
# pip install wiki_as_base --upgrade

3.1. JSON-LD version of a single page on OSM.wiki

By default the tool assumes you want parse OpenStreetMap Wiki and would be ok with a cache of 23 hours (which would be similar if already would be parsing the wiki dump)

The following example will download OSM.wiki Tag:highway=residential

wiki_as_base --titles 'Tag:highway=residential'

3.2. “Just give me the code example files” of a single page on OSM.wiki

The parser tries its best to detect what’s on the Wikitext without any customization. For example, if the wikitext is using the right syntaxhighlight codes, it tries to use that as a suggestion for which file extension that code would have.

Let’s use this example page, the User:EmericusPetro/sandbox/Wiki-as-base. A different parameter will export a zip file instead of JSON-LD

# JSON-LD output
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base'

# Files (inside a zip)
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-zip-file sandbox-Wiki-as-base.zip

Wikitext parsing (the one done by this implementation) can benefit from receiving more explicit suggestions of preferred exported filename. So for pages focused as technical guide, a proxy of this could allow a download link with a tutorial with predictable filenames, while others wiki contributors could still improve it over time.

3.3. Download all parseable information of pages by an small category on OSM.wiki

Let’s say you want to fetch OSM.wiki Category:OSM_best_practice, not merely an article of it, like the Relations_are_not_categories, but all pages with the respective category.

# JSON-LD output
wiki_as_base --input-autodetect 'Category:OSM_best_practice'

# Files (inside a zip)
wiki_as_base --input-autodetect 'Category:OSM_best_practice' --output-zip-file Category:OSM_best_practice.zip

Trivia: this request is done with only 2 background fetches: one to know pages for the category and one for all pages.

3.4. Download all parseable information of pages by an well used category on OSM.wiki

Let’s say you want a Category:References. Now, the cli tool will behave differently as it assumes it can take at maximum 50 pages in one step (the default most MediaWikis for non-admin/bots can ask). This means it will paginate and save on local cache, and ultimately just output the final result.

# The --verbose argument will output more information,
# in this case hints about looping, if have cache, etc.
# It will take 50 seconds plus server delay plus internal time to compute the result
wiki_as_base --input-autodetect 'Category:References' --verbose --output-zip-file Category:References.zip
# (print to stderr)
#    loop... Cached: [False] Expired: [False] delay if not cached [10]
#    loop... Cached: [False] Expired: [False] delay if not cached [10]
#    loop... Cached: [False] Expired: [False] delay if not cached [10]
#    loop... Cached: [False] Expired: [False] delay if not cached [10]
#    loop... Cached: [False] Expired: [False] delay if not cached [10]

# Now, lets run it again. However, since the raw requests are cached by 23 hours
# it will reuse the cache.
wiki_as_base --input-autodetect 'Category:References' --verbose > Category:References.jsonld
# (print to stderr)
#    loop... Cached: [True] Expired: [False] delay if not cached [10]
#    loop... Cached: [True] Expired: [False] delay if not cached [10]
#    loop... Cached: [True] Expired: [False] delay if not cached [10]
#    loop... Cached: [True] Expired: [False] delay if not cached [10]
#    loop... Cached: [True] Expired: [False] delay if not cached [10]


# Current directory
ls -lh | awk '{print $5, $9}'
# 
#    668K Category:References.jsonld
#    315K Category:References.zip
#    540K wikiasbase.sqlite

3.4.1 Controling the delay for pagination requests

By default, not only does the tool do caching, but the cli will intentionally add a delay 10 times slower if you don’t customize the user agent hint and it detects must paginate more background requests. Currently, 10 times means 10 x 1 second (only sequential, not parallel requests), but if this get heavier usage it could be increased.

The logic for the cli tool to delay more non customized user agents is to have less users not changing contact information. Here the behavior if you detect you customized its contact information then point to the developer of the tool.

## change the contact information on the next line
# export WIKI_AS_BASE_BOT_CONTACT='https://github.com/fititnt/wiki_as_base-py; generic@example.org'
export WIKI_AS_BASE_BOT_CONTACT='https://wiki.openstreetmap.org/wiki/User:MyUsername; mycontact@gmail.com'

# time will output real time to finish the command. On this case, 5 x 1 are artificial delay,
# but 10s both download time and (which is not instantaneous), internal computation
time wiki_as_base --input-autodetect 'Category:References' --verbose > Category:References.jsonld
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#
#    real	0m15,170s
#    user	0m1,518s
#    sys	0m0,041s

However, if you do want to identify yourself, but believe 1 second additional delay in sequential request is too low, (which might be a case for a bot without human supervision), the next example will use 30 seconds.


export WIKI_AS_BASE_BOT_CONTACT='https://wiki.openstreetmap.org/wiki/User:MyUsername; mycontact@gmail.com'
export WIKI_AS_BASE_BOT_CUSTOM_DELAY='30'
time wiki_as_base --input-autodetect 'Category:References' --verbose > Category:References.jsonld
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#
#    real	2m40,390s
#    user	0m1,565s
#    sys	0m0,036s

3.5. Download all parseable information of know exact list of Wiki Pages on OSM.wiki

The initial command used to fetch a single page actually accept multiples ones: just divide them with |.

In this example we’re already using another parameter, --pageids, not by name.

## Uses curl, tr and jq <https://jqlang.github.io> as one example on how to get some examples of pages.
# curl --silent 'https://wiki.openstreetmap.org/w/api.php?action=query&cmtitle=Category:Overpass_API&list=categorymembers&cmlimit=500&format=json' | jq '.query.categorymembers | .[] | .pageid' |  tr -s "\n" "|"


# Manually setup the pageids, without use of categories
wiki_as_base --pageids '35322|253043|104140|100013|156642|96046|141055|101307|72215|98438|89410|250961|133391|242270|85360|97208|181541|90307|150883|98210|254719|137435|99030|163708|241349|305815|74105|104139|162633|170198|160054|150897|106651|180544|92605|78244|187965|187964|105268' --verbose > My-custom-list.jsonld

If the number of explicitly individual pages is greater than the pagination (which is 50) then the cli, similar to how it deals with wiki pages from large categories, will paginate.

3.6. “Just give me the example files” of a single page on a different wiki than OSM.wiki

Note: for wikimedia-related websites, the prefix used uses the logic from the database naming dumps on https://dumps.wikimedia.org/backup-index.html. e.g. wikidata.org = wikidatawiki.

This is the same as 2, however the same content is both as https://wiki.openstreetmap.org/wiki/User:EmericusPetro/sandbox/Wiki-as-base and https://www.wikidata.org/wiki/User:EmericusPetro/sandbox/Wiki-as-base.

The idea here is to explain how to customize a different Wiki. This is done with the two environment variables.

wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' > Wiki-as-base_from-osm.jsonld


# If you just want change environment variables for single command without affecting next commands, then prepared the option on that single line
WIKI_NS='osmwiki' WIKI_API='https://wiki.openstreetmap.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' > Wiki-as-base_from-osmwiki.jsonld
WIKI_NS='wikidatawiki' WIKI_API='https://www.wikidata.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' > Wiki-as-base_from-wikidatawiki.jsonld



# If your focus is a single wiki, but the default being OpenStreetMap Wiki make longer commands, then define as environment variable
export WIKI_NS='wikidatawiki'
export WWIKI_API='https://www.wikidata.org/w/api.php'

wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' > Wiki-as-base_from-wikidatawiki.jsonl

Note the irony: Using Wikidata (wiki) but parsing wikitext of generic Wiki Pages, not Wikibase 🙃! Anyway, could you guess what wiki_as_base --titles 'Item:Q5043|Property:P12' returns on osmwiki?

3.7. Merge content of several pages in different Wikis and the `--output-streaming`

Here things start to get interesting, and might explain why all unique filenames are namespaced by wikiprefix: you might at some point want to store them in the same folder, maybe also match same kind of content on different wikis.

Also, this is the time the exported file is not JSON-LD with individual items inside key data on the top level of the object, but JSON text sequences, where each individual item is in its own line. this format allow user simpler tools to merge the files.

#### merging the files at creation time

echo "" > merged-same-file-before.jsonl

# the ">" create file and replacey any previous content, if existed
# the ">>" only append content at the end of file, but create if not exist
WIKI_NS='osmwiki' WIKI_API='https://wiki.openstreetmap.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-streaming >> merged-same-file-before.jsonl
WIKI_NS='wikidatawiki' WIKI_API='https://www.wikidata.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-streaming >> merged-same-file-before.jsonl

#### dumping file by file, but then merge files at the end
mkdir temp-output/
WIKI_NS='osmwiki' WIKI_API='https://wiki.openstreetmap.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-streaming > temp-output/osmwiki-page-1.jsonl
WIKI_NS='wikidatawiki' WIKI_API='https://www.wikidata.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-streaming > temp-output/wikidatawiki-page-1.jsonl
cat temp-output/*.jsonl > merged-cat-after.jsonl

And that was the last practical example. A place with other MediaWikis (which may or not be up to date, which means this tool cannot undestand their API version) are listed on https://wikiindex.org/.

4. What’s next: feedback (in special of OSM.wiki editors) are welcomed in next months

That’s it. This approach is a very niche, so likely the ones which may be interested are heavy wiki editors, in special early adopters who could benefit from move some of their software options for contributors use the Wiki to make changes and do not have yet a parser strategy like Nominarim/Taginfo have.

In the case of OpenStreetMap, since the infoboxes for tags and tag-values are very important, I’m especially interested in suggestions on how we could use the wiki page itself to at least give hints on expected values, maybe further hints on how to normalize the values. Currently, the implementation does not have option to initialize with such extra information (still a bit hardcoded), but even if each Wiki could have some page where most people would agree how to define the parser, I believe the tool should allow user customize it (so someone could use customized version from own it’s user namespace, potentially “proving” how the final result would work). This might explain I took some time to reply ChrisMap when asked to “design an example”, but to not delay further I just posted this diary today with how to at least do the basics of extract data. I already had a draft for this since January 2023, and asked some parts of it on the Talk Wiki, but after today this part is somewhat officially released for feedback either here in the comments, in the Wiki or in the GitHub issues.

Dismistifying Wikidata and standards compliant semantic approach on tags on OpenStreetMap to make tooling smarter on medium to long term

Posted by fititnt on 12 November 2022 in English.

This is my first attempt on the subject of the title divided in 6 topics. Sorry for the long text (but could be far longer).

Disclaimer: low experience as OSM mapper!

While I do have prior advanced experience in other areas, as you can see from my account, I’m so new to the project that as a newbie user of iD left after the tutorial in India I got scared that if someone touches something, after that validators will assume that person is responsible for errors in that something. In my case it was “Mapbox: Fictional mapping” from OSMCha.

So assume that this text is written by someone who one day ignored iD warnings for something I touched, still not sure how to fix the changeset 127073124 😐

Some parts of this post, such as reference to notability (from this discussion here https://wiki.openstreetmap.org/wiki/Talk:Wiki#Use_Wikibase_to_document_OSM_software) and gives some hints of unexplored potential which not even current OpenStreetMap Data items are doing (from this discussion here Remove Wikibase extension from all OSM wikis #764) are the reason for the dismistifing part of the title.

1. Differences in notability of Wikidata, Wikipedia, and Commons make what is acceptable different in each project

I tried to find how OpenStreetMap defines notability, but the closest I found was this:

https://wiki.openstreetmap.org/wiki/Welcome_to_Wikipedia_users#We_don’t_have_a_notability_rule

For sake of this post:

Commons Notability: https://commons.wikimedia.org/wiki/Commons:Notability
Wikipedia (EN) Notability: https://en.wikipedia.org/wiki/Wikipedia:Notability
Wikidata Notability: https://www.wikidata.org/wiki/Wikidata:Notability

What I discovered is that Commons already is used as a suggested place to host for example images, in particular what would go on the OpenStreetMap Wiki.

Wikipedia is likely to be far more well known than Wikidata and (I suppose) people know that Wikipedias tend to be quite strict on what goes there.

And Wikidata? Well, without explaining too much, it is more flexible than Wikipedia’s notability, however (and this is important) is not as flexible as the Notability Rule on OpenStreetMap if we assume that there’s not explicitly one.

In other words: as flexible as Wikidata is, there’s things that do exist in the real world (let’s say, an individual tree in someone’s backyard) that are notable to be on OpenStreetMap, but are not to be on Wikidata.. And, unless there is some attachment (something worth to put on Commons, like 3D file) I would assume uploading low level data of micromapping of some building (creating huge amounts of unique Wikidata Qs) might be considered vandalism there.

1.1 When to use Wikidata?

I think I will agree with what others said sometimes about preferring to keep concepts that are worth being on Wikidata, on Wikidata.

But with this in mind, it is still relevant to have Listeria (which is a bot, not a installable extension) on OpenStreetMap Wiki. Might not be a short time priority, but Wikidata already have relevant information related to OpenStretMap.

2. Differences in how data is structured makes hard for RDF triplestores (like Wikidata) to store less structured content

In an ideal world, I would summarize how the RDF data store works. RDF is quite simple after someone understands the basics like sum + and subtraction - operations in RDF, the problem is often users will jump not only to multiplication, but differential equations. SPARQL is more powerful than SQL, and the principles of Wikidata have existed for over 2 decades. However most people will use someone else’s example ready to run.

Without getting into low level details of data storage, it might be better to just cite as an example that Wikidata recommends storing administrative boundaries as files on Commons. For example this is the one for the country of Brazil (Q155) links to https://commons.wikimedia.org/wiki/Data:Brazil.map. OpenStreetMap doesn’t require Commons for this (because store all information and can still very efficient), however RDF even with extensions such as geoSPARQL, does not provide low level access for things such as what would be a node in OpenStreetMap (at least the nodes without any extra metadata, which only exist because are part of something else).

Question against RDF: if the RDF triple store is so flexible and powerful, why not make it able to store EVERY detail, so it becomes a 1 for 1 to OpenStreetMap? Well, it is possible, however storing such data info RDF triplestore would take more disk space. Sophox already avoid some types of content

One way able to use SPARQL would, in fact, be an abstraction to another storage with R2ML and an implementation such as ONTOP VKG to rewrite SPARQL queries to SQL queries, so in worst case scenario at least it could always be using up to date data. But this is the focus of this post.

In other words: is overkill to store low level details on RDF triplestores even if we could do it if could afford the hardware. They’re not a replacement for OpenStreetMap.

3. Advantage of RDF triplestores (Wikidata, Wikibase,…): be welcoming concepts without geographic reference

Something OpenStreetMap cannot compete with Wikidata: relationship between things, and storage of things without geographic reference. Actually, most, if not all, tools that deal with OpenStreetMap data don’t know how to deal with an abstraction concept which cannot be plotted in the map. This is not an exclusive issue, because it happens with most GIS tools. They will break.

In my journey to understand OpenStreetMap with an Wikidata school of thought, after some questions in my local Telegram group about how to map back OpenStreetMap to Wikidata, I received this link:

https://wiki.openstreetmap.org/wiki/Relations_are_not_categories

Truth to be told, I loved this explanation! But without making this post overly long to make analogy with both Wikidata vs OpenStreetMap:

OpenStreetMap can store reference to something such as individual buildings for firefighter’s stations of a province Province AA in a country CountryA
Wikidata can store the abstract concept that represents the organization that coordinates all firefighting stations in ProvinceAA, and also that this is part of the Civil Defense organization in the CountryA. Both concepts might be even notable enough to have dedicated pages on Wikipedia, and photos on Commons.

This is where, without off-the-wire agreements or custom protocols, the tools which handle OpenStreetMap data are not designed to handle concepts that explains things which OpenStreetMap happily will store from its users. Someone can plot a building for an organization, but not the structural need of what that organization is that such building is part of.

Truth to be told, such uses of Wikidata concepts are already being used in the wild. However, it seems this is very rudimentar, mostly to allow translations and images such as for brandings used by Name Suggestion Index in tools such as the iD editor, not what these brands represent. But everything already tagged with Wikidata Qs or Ps already is viable to download this extra meaning.

The discussions about API changes (such as https://wiki.openstreetmap.org/wiki/API_v1.0) are sort of more low level. What today is on the database schema https://wiki.openstreetmap.org/wiki/Rails_port/Database_schema doesn’t need to change (it’s quite efficient already, and previous point admitted the limitations of RDF Triplestores for low level of details).

In the best case scenario, this might help understand existing data, and make stronger validations because could make easier to find patterns, but does not require change underlining database, but the validation rules become sort of cross platform. For things simpler (like know if something is acceptable or not) no semantic reasoning is need, could be done automated rule generation in SHACL (https://en.wikipedia.org/wiki/SHACL), so if today someone is doing import of several items, but some of then classes with existing ones, could be simple to the person just click “ignore the errors for me” and SHACL could only allow the things that will validate.

But this SHACL could take years. I mean, if some countries would want to make very strict rules, could be possible that in that region, these things become enforced.

4. RDF/OWL allow state of the art semantic reasoning (and shared public identifiers from Wikidata are a good thing)

In an ideal world and with enough time, behind the idea of ontology engineering, I would introduce mereology, the idea of Universals vs Particulars, and that when designing reusable ontologies, the best practices are not mere translation of words people use, but underlying concepts that may not even have a formal name, so giving numbers make things simpler.

The foundations for mimicking human thinking from rules is far older than RDF.

RDF provides sums and subtractions, it’s very simple, but an early attempt RDFS (RDF Schema), was insufficient for developers to implement semantic reasoning. The OWL1, sort of a inspired in one DARPA project (DAML, later DAML+OL), aimed to allow such semantic reasoning, however computability was accidentally not on scope. This means that, by design, a computation could run forever without being feasible now upfront, so it failed. Then after all this saga, OWL2 was designed from the ground to avoid mistakes from OWL1 to allow it to stay in the realm of computability (not just be a project to call attention from others, but actually be possible to implement by tools). So today, a user, without resort to command line, can use Protégé and know upfront if the triplestore doesn’t have logical errors. However, since semantic reasoning can be computationally expensive, often is not enabled by default in public endpoints (think: Wikidata and Sophox), but anyone could download all required data (e.g instead of .osm file, some flavor of .rdf file, or convert .osm to RDF after download it) and turn the thing on.

Example of inference

For example, when 2 rules are created, <CityAAA "located_in" ProviceAA>, <ProvinceAA "located_in" CountryA>, the way “located_in” is encoded could say that the inverse is “location_of” so the reasoner could infer that <CountryA "location_of" CityAAA> is true. At minimum, even without semantic reasoner turned on (it is not on Wikidata; this is why the interface warns user to be more explicit), is possible validate errors, with very primitive rules, but it also means that dumps of OSM data for regions (or worldwide, but subset of features) if converted to RDF and loaded in memory with reasoning turned on, allow deduce things very fast.

This example of “located_in” / “location_of” is simplistic, however with or without a reasoner turned on, RDF makes data interoperable in other domains even if individual rules are simple. Also, rules can depend on other rules, so there is a viable chain effect. It is possible to teach machines not mere “part_of” or “subclass_of” most people learn in diagrams used only for business, but cause and effect. And the language used to encode these meanings already is an standard.

One major reason to consider using Wikidata is to have well defined, uniquely identified, abstract concepts notable enough to be there. At minimum (like is used today) it helps with having labels in up to 200 languages, however the tendency would be that both Wikidata contributors and OpenStreetMap contributors on taxonomy help each other.

Trivia: tools such as Apache Jena even allow running via command lines (such as SPARQL queries you would ask for Sophos) from an static dump file locally or in a pre-processed file remote server.

5. Relevance to Overpass Turbo, Normatim, and creators of data validators

As explained before, the OpenStreetMap data model doesn’t handle structural concepts that couldn’t be plotted in a map. The way the so called semantic web works, could be possible to either A) rely full on Wikidata (even for internal properties; this is what OpenStreetMap Wikibase do with Data Items; but this is not the discussion today) or B) just for things that are notable enough to be there and interlink from some RDF triplestores on OpenStreetMap.

Such abstract concepts, even if they could be added as tags on things OpenStreetMap can plot on map, would take too much space. If someone has a less powerful tool (that really needs explicit tags, think like some JavaScript rendering library) then semantic reasoners can expand, missing on the fly, that implicit knowledge and tools use this version.

Something such as Overpass turbo doesn’t need to also allow SPARQL as additional flavor of query (but maybe with ONTOP, it could and with live data, but this is not the discussion here), but the advantage a more well defined ontological definition means the overpass turbo can get more smarter: an user could search for an abstract concept, that could represent a group of different tags (and this tags vary per region) and Overpass Turbo could preprocess/rewrite such advanced queries in more low level queries it know today that work today without user need to care about this.

Existing tools can understand the concept of “near me” (physical distance) but they can’t cope with something’s that are not an obvious tag. Actually, current version of Normatim seems not aware if asked by a category (let’s say, “hospital”) so it relies too much on the name of the feature, because even if is trivial to have translations of “hospital” (Q16917, full RDF link: http://www.wikidata.org/wiki/Special:EntityData/Q16917.ttl) from Wikidata, tools such as Normatim don’t know what the meaning of hospital. In this text, I’m arguing that semantic reasoning would allow the user asking from a generic category to return the abstract concept such as 911 (or whatever is the numbers for police and etc in your region) in addition to the objects in the map. OpenStreetMap Relations are the closest from this (but I think it would be better if such abstracts do not need to be on the same database; the closest to this are Data Items Qs).

And what advantage for current strategies to validate/review existing data? Well, while the idea of making Normatim aware of text by categories is very specific to a use case, the abstract concepts would allow searching things by abstract meaning and (like Overpass already allow) recursion. An unique opaque (e.g. numeric, not resembling real tags) identifier can by itself contain the meaning (like be alias for several tagging patterns, both old and new, and even varying by region of the world) so the questions become simpler.

6. On the OpenStreetMap Data Items (Wikibase extension on OpenStreetMap Wiki) and SPARQL access to data

Like I said in the start, I’m new to OpenStreetMap, and despite knowing other areas, my opinion might evolve after this text is written in face of more evidence.

6.1. I support (meaning: willing to help with tooling) the idea of have OWL-like approach to encode taxonomy and consider multilingualism important

I do like the idea of a place to centralize more semantic versions of OpenStreetMap metadata. The Data items do use Wikibase (which is used by Wikidata), so they’re one way to do it. It has fewer user gadgets than Wikidata, but the basics are there.

However, as long as it works, the way to edit the rules could be even editing files by hand. Most ontologies people do this way (sometimes with Protege). However, OpenStreetMap has a massive user base and the translations to data items already have far more than the Wiki pages for the same tags.

Even if the rules could be broken into some centralized GitHub repository (like is today with Name Suggestion Index, but there is less Pull Request, because is mostly the semantic rules) without some user interface like Wikibase allows, it would be very hard to allow collaboration that already was happening on the translations.

6.2. I don’t think criticism against customization of Wikibase Q or complain about not be able to use full text as identifiers makes sense

There’s some criticism about the Wikibase interface and those might even be trivial to deal with. But the idea of persistent identifiers being as opaque as possible, to disencourage users’ desire to change then in the future is a good practice. This actually is the only one I really disagree with.

DOIs and ARKs have a whole discussion on this. DOIs for example, despite being designed to persist like a century, the major reason people break systems was the customized prefixes. So as much as someone would like a custom prefix instead of Q124 be OSM123 this unlikely would persist more than one decade or two.

Also, the idea of allowing full customizable IDs, such as instead of Q123 use addr:street is even more prone to lead to inconsistencies either misleading users or braking systems because users didn’t like the older name. So Q123, as ugly as it may seem, is likely to only be deprecated by serious errors rather than the naming choosing by itself.

Note that I’m not arguing against the addr:street tag, this obviously is a property (and such property itself needs to be defined). *The argument is that structural codes should be as opaque as possible to only change in worst cases. If tag addr:street is (inside OpenStreetMap) notable enough, it can receive a code such as Q123. Then OWL semantics could even deal with depreciated, have two tags as aliases for each other etc, because it was designed from the ground to help with this. That’s the logic behind opaque codes.

If someone doesn’t know what Q123 means, we add contextual information about it on the interfaces.

6.3. Wiki Infoxes issues

I guess more than one tool already does data mining from OpenStreetMap Infoboxes. Whatever would be some strategy to synchronize semantic version of taxonomy, is important it be done to keep running if the users already not doing there directly. From time to time, things may break (like a bot refusing to override human edit) then relevant reports of what is failing.

I don’t have an opinion on this, just that out-of sync Information is bad.

6.4. Interest in get realist opinions from Names Suggestion Index, Taginfo, Geofabrik (e.g it’s data dictionary), and open source initiatives with heavy use on taxonomy

Despite my bias to “make things semantic” just to say here (not need to write in the comments, just to make public my view) I’m genuinely interested in knowing why the Data Items was not used to its full potential. I might not agree, but that doesn’t mean I’m not interested to hear.

Wikidata is heavily used by major companies (Google, Facebook, Apple, Microsoft,…) because it is useful, so I’m a bit surprised OpenStreetMap Data Items is less well known.

If the problem is how to export data into other formats, I could document such queries. Also, for things which are public IDs (such as Geofabrik numeric codes on http://download.geofabrik.de/osm-data-in-gis-formats-free.pdf) similar to how Wikidata allows external identities, would make sense if the Data Items have such properties. The more people are already making use of it, the more likely it is to be well cared for.

6.5 Strategies to allow run SPARQL against up to date data

While I’m mostly interested in having some place always in real time with translations and semantic relationships of taxonomic concepts, at minimum I’m personally interested in having some way to convert data dumps to RDF/OWL. But for clients that already export slices from OpenStreetMap data (such as overpass-turbo) it is feasible to export RDF triples as an additional format. Is hard to understand RDF or SPARQL, but it is far easier to export it.

However, running a full public SPARQL service with data for the entire world (while maybe not worse than what already is OpenStreetMap API and overpass-turbo) is CPU intensive. But if it becomes relevant enough (for example, for people to find potential errors with more advanced queries) then any public server ideally should have significant no lag. This is something I would personally like to help. One alternative to R2RML+ONTOP could be (after first global import) to have some strategy to convert differences from live services from the last state, then these differences instead of SQL, be UPDATE / DELETE SPARQL queries.

I’m open to opinions of how important it is to others to have some public endpoint with small lag. Might take some time to know more about OSM software stack, but scripts to synchronize from main repository data seems a win-win to create and let it public for anyone to use.

That’s it for my long post!

My Question for OpenStreetMap Foundation Candidates in 2022

Posted by fititnt on 9 November 2022 in English.

This was my original question on the Wiki :

The OpenStreetMap Foundation ("OSMF") already had discussions and even a committee on takeover mitigation and this question focuses on this topic. The Humanitarian OpenStreetMap Team United States Inc ("HOTUSI"), which it's grow up over 100x OSMF budget (using 2020 as year, 26,562,141 USD vs 226,273 GBP), on its board minutes date 2022-01-24 (archived version here) already admitted interest on trademark agreement "with clear, irrevocable rights to the name" as option to "Ensure that the HOT Brand name is not in danger and is formerly in HOT’s hands", however this explicitly require OpenStreetMap Foundation approval at least once in its history. Already before this election, the new discourse community, which is public know have receive support from HOTUSI, had a paid HOTUSI employee closing a discussion about HOTUSI which also asked why the site redesign still being delayed to a point of know to not happens before the OpenStreetMap Foundation election, even if this already was asked on OSMF mail lists, and the incident sparked a discussion on handling conflict of interest on moderation channels. At this very moment of the history of OpenStreetMap, majority of candidates in this election do have links with HOTUSI, so it is viable that the result will allow a single corporation to make decisions in self interest against OSMF, in which you hopefully will win as a candidate. So the question to you is: how will you handle conflicts of interest in the OpenStreetMap Foundation board itself under this challenging context?

Regardless of this, I’m actually very okay with the set of official questions proposed for candidates to be asked to answer, since common themes were grouped. And the fact to point to the Trademark Policy was better than the ones I used to contextualize. Fantastic!

New absurd events

Because the last date to send questions was 2022-11-01, sadly I was not able to cite a real world example (like one sentence more with links on my original question) about how absurd things can get when an organization has so much money that can simply focusing on whoever is willing to be bought to disrupt regional groups without remorse.

As a sort of public response for a complaint by Mario that I wasn’t aware about the #communities:latam on 2022-11-02T17:48 happened at 2022-11-05T02:31: a moderator of the forum, despite use of euphemisms, actually wrote in the that the Humanitarian OpenStreetMap Team United States Inc was willing to pay money for projects in the LATAM region to get more support. While I wasn’t expecting much based on what happened in Philiphines, no discussion at all of the bigger issue Mario was discussing in Spanish in several threads. Just this.

Let me repeat: the response to perceived conflicts of interest by moderators in a subforum of community.opensteetmap.org was one of those moderators going to the same subforum and offering money from the very same organization while in the role of moderator.

“Das ist nicht nur nicht richtig; es ist nicht einmal falsch!” – Wolfgang Pauli

What happens if let others keep sponsoring against OpenStreetMap?

Posted by fititnt on 21 October 2022 in English.

The original of this post is on Discourse https://community.openstreetmap.org/t/what-happens-if-let-others-keep-sponsoring-against-openstreetmap/4343?u=fititnt .

The ad (circa 2017)

Source

https://twitter.com/sp8962/status/838676848301260800

Finances (2020, around 100x difference without need to do any core function)

OpenStreetMap Foundation (226,273 GBP at 2020, https://wiki.osmfoundation.org/w/images/d/df/OSMF_directors_report_and_unaudited_financial_statements_year_ended_20201231.PDF
Humanitarian OpenStreetMap Team United States Inc (26,562,141 USD at 2020, https://github.com/hotosm/hotosm-website/blob/gh-pages/downloads/2020-Form-990.pdf)

Location: Historic District, Porto Alegre, Região Geográfica Imediata de Porto Alegre, Metropolitan Region of Porto Alegre, Região Geográfica Intermediária de Porto Alegre, Rio Grande do Sul, South Region, Brazil