Organic Maps GSoC 2023 Final Report

2023-08-17 (Last edited 2023-08-25)
software gsoc

This summer I took part in Google’s Summer of Code (GSoC) program, working with the open-source project Organic Maps and mentored by Alexander Borsuk.

Organic Maps is an mapping app available on Android and iOS, similar to Google/Apple Maps, but it uses data from OpenStreetMap (OSM), a community-maintained map of the world that anyone can edit a la Wikipedia. It also works without an internet connection and doesn’t send your location to any servers.

Many map locations in OSM link to Wikipedia articles that describe them. For example, the Eiffel Tower in OSM links to the French Tour Eiffel article on Wikipedia and the Q243 item on Wikidata, a sort of database that connects items in different Wikimedia projects. Organic Maps includes the text of those linked articles in the downloaded map files so that you can see a description of a location offline, and read about it further if you choose. Unfortunately, the method used by Organic Maps to download the Wikipedia articles was unreliable, so the articles available in the app were last updated in 2022.

I created a different method of getting this article text that involved downloading a copy of all the Wikipedia articles and then searching for the ones needed by the app, instead of requesting each article from Wikipedia individually.

For more background you can read my proposal and the issue page on GitHub.

Changes from Proposal

After discussing it with Alexander and clarifying the “experimental” status with the Wikimedia Foundation, we decided to use the “Enterprise HTML” dumps instead of the XML dumps. While the file size is larger and the lack of a standard parallelized gzip decompressor means that reading the data is slower, having the full HTML text instead of working with Wikipedia’s special wikitext format makes the implementation much simpler.

What I Did

All of my work is captured in a new organicmaps/wikiparser repository.

In summary, I wrote a new program (in the rust programming language) that runs on the map server and handles:

Getting the wikidata and wikipedia tags from an OSM dump file.
Getting article text that matches the OSM tags from Wikipedia Enterprise HTML dump files.
Simplifying the article HTML content to reduce size.
Writing them to disk for the map generator to use.

This work was done in several stages:

A new Wikipedia article processing pipeline based on the HTML Dumps that handles deduplication (#6).
Article content is stripped down to reduce size, equivalent to the original process, duplicated Extracts API, but ability to add links, pictures, videos, and more later (#26).
Tag extraction is separated from the maps generator, so descriptions can be updated out-of-band (#23).
A shell script for coordinating the OSM tag data and extracting multiple dumps in parallel (#21).
A shell script for automated downloads (#22).
I also created an issue and fix for a bug I discovered in the ego-tree crate used by the scraper HTML parsing crate.

Current State

At this point the dump extractor has been added to the maps build pipeline, and the articles from it were rolled out to the app starting in late August (the 2023.08.18-8-android tag). Extracting the ~4 million articles in 5 languages took around 2 hours on the maps build server.

All of the primary goals of the project have been accomplished!

Of course, there’s still more work that can be done:

Resolving missing Wikidata links in the dumps
Contributing fixes for broken wikidata/wikipedia tags to OSM
Enabling images in the downloaded articles
Matching articles on additional wikipedia tags](https://wiki.openstreetmap.org/wiki/Key:wikipedia)
Coordinating with the map generator to use “secondary language” tags.
At this point gzip/tar is the I/O bottleneck, so finding a faster gzip decompressor is the next easy performance improvement