This summer I took part in Google’s Summer of Code (GSoC) program, working with the open-source project Organic Maps and mentored by Alexander Borsuk.
Organic Maps is an mapping app available on Android and iOS, similar to Google/Apple Maps, but it uses data from OpenStreetMap (OSM), a community-maintained map of the world that anyone can edit a la Wikipedia. It also works without an internet connection and doesn’t send your location to any servers.
Many map locations in OSM link to Wikipedia articles that describe them. For example, the Eiffel Tower in OSM links to the French Tour Eiffel article on Wikipedia and the Q243 item on Wikidata, a sort of database that connects items in different Wikimedia projects. Organic Maps includes the text of those linked articles in the downloaded map files so that you can see a description of a location offline, and read about it further if you choose. Unfortunately, the method used by Organic Maps to download the Wikipedia articles was unreliable, so the articles available in the app were last updated in 2022.
I created a different method of getting this article text that involved downloading a copy of all the Wikipedia articles and then searching for the ones needed by the app, instead of requesting each article from Wikipedia individually.
For more background you can read my proposal and the issue page on GitHub.
Changes from Proposal
After discussing it with Alexander and clarifying the “experimental” status with the Wikimedia Foundation, we decided to use the “Enterprise HTML” dumps instead of the XML dumps. While the file size is larger and the lack of a standard parallelized gzip decompressor means that reading the data is slower, having the full HTML text instead of working with Wikipedia’s special wikitext format makes the implementation much simpler.
What I Did
All of my work is captured in a new organicmaps/wikiparser repository.
In summary, I wrote a new program (in the rust programming language) that runs on the map server and handles:
- Getting the
wikipediatags from an OSM dump file.
- Getting article text that matches the OSM tags from Wikipedia Enterprise HTML dump files.
- Simplifying the article HTML content to reduce size.
- Writing them to disk for the map generator to use.
This work was done in several stages:
- A new Wikipedia article processing pipeline based on the HTML Dumps that handles deduplication (#6).
- Article content is stripped down to reduce size, equivalent to the original process, duplicated Extracts API, but ability to add links, pictures, videos, and more later (#26).
- Tag extraction is separated from the maps generator, so descriptions can be updated out-of-band (#23).
- A shell script for coordinating the OSM tag data and extracting multiple dumps in parallel (#21).
- A shell script for automated downloads (#22).
- I also created an issue and fix for a bug I discovered in the
ego-treecrate used by the
scraperHTML parsing crate.
At this point the dump extractor has been added to the maps build pipeline, and the articles from it were rolled out to the app starting in late August (the
Extracting the ~4 million articles in 5 languages took around 2 hours on the maps build server.
All of the primary goals of the project have been accomplished!
Of course, there’s still more work that can be done:
- Resolving missing Wikidata links in the dumps
- Contributing fixes for broken
wikipediatags to OSM
- Enabling images in the downloaded articles
- Matching articles on additional
- Coordinating with the map generator to use “secondary language” tags.
- At this point gzip/tar is the I/O bottleneck, so finding a faster gzip decompressor is the next easy performance improvement