Organic Maps GSoC 2023 Proposal: Wikipedia Dump Extractor

2023-04-04 (Last edited 2024-06-03)
software gsoc

What follows is my proposal for Google’s Summer of Code program under the Organic Maps organization.

Background

Organic Maps is an offline mapping app that uses data based on Open Street Maps (OSM). Sections of the map are downloaded once by users and updated as needed. OSM location data can link to Wikipedia and Wikidata (a structured data graph, which links back to Wikipedia and other Wikimedia projects).

Organic Maps includes any linked Wikipedia article texts in the downloaded map data, which users can read in the app without an active internet connection - especially useful while traveling or exploring new areas with little or inconsistent internet access.

Problem Description

Currently Organic Maps downloads the Wikipedia article text by crawling Wikipedia webpages with the Wikipedia API. The crawling can be slow and time-consuming, and a full update requires downloading millions of pages. Conforming to Wikipedia’s API client requirements and rate limitations was difficult and resulted in a ban of the crawling script.

This project aims to rework the Wikipedia article downloading using a different method: using the database dumps of Wikidata and Wikipedia articles.

This will involve automating the downloading and processing of tens of gigabytes of compressed files while converting the article data from Wikipedia’s internal wikitext format to HTML, all in a performant and robust manner.

For context, the OSMAnd project (another OSM-powered offline map app) also uses wikipedia/wikidata dumps to provide their offline articles.

Implementation Plan

Organized from most to least unknowns.

MediaWiki Parser Research

Determine which of the available wikitext parsers best meet the performance and quality goals of the project.

Currently Organic Maps stores and renders the Wikipedia descriptions in HTML. The current scraper design downloads HTML directly, but the Wikipedia article dumps only contain the raw wikitext. To remain compatible with the app, the article text will need to be converted to HTMl.

There are Wikimedia Enterprise dumps that contain pre-rendered HTML, but:

it is an “experimental” service with no guarantees of long-term updates
the HTML doesn’t contain any images, which Organic Maps wants to include in a later point Building an extractor based on these dumps would be simpler, but without any reliability guarantees, the rest of this proposal will consider using the article dumps.

Mediawiki WikiText is notoriously difficult to parse (it was even the subject of former GSoC projects). The format includes a templating system that is heavily used by wikipedia, embedded HTML, and overlapping or misformatted text is common. The authoritative parser/converter used by Wikimedia is Parsoid, implemented in PHP. To meet the performance and portability goals of this project, using Parsoid and the PHP runtime is undesirable.

There are many other parsers for the format, but none of them are considered a complete replacement for Parsoid. This wikipedia page lists many of them, but is low on details.

Fortunately, for the goals of this project, not all of wikitext’s formatting is required. The current design removes tables, markup, and whole sections to keep mostly the text content.

Ideally, one of the alternative parsers can:

handle the necessary markup for the project
hide/gracefully ignore any markup it can’t handle
be performant enough to keep run time under several hours on the build server
convert directly to HTML

To determine the capabilities of each, I recommend selecting a sample of articles used by the app, running them through a selection of parsers, and comparing the output against Parsoid and the existing scraped pages as reference.

Top candidates that I would recommend starting with:

Limiting the amount of time spent on this will be important.

An explicit non-goal of this project is writing a new wikitext parser. Implementing custom HTML conversion for a subset of a parser’s output may be necessary.

Initial Dump Extractor

Given supported languages, Wikipedia article titles and Wikidata ids, and Wikipedia/Wikidata dumps, convert Wikidata ids to Wikipedia article titles, then extract article text for all Wikipedia article titles in the supported languages.

To optimize performance, the extractor will likely need to be written in the same language as the selected parser(s) or one that interfaces well with it. Writing all articles in wikitext to disk and then processing them into HTML would add a lot of I/O overhead.

At this point, I think Rust is a good choice. It:

is performant and easily parallelizable
has good hooks to be called from within other language runtimes (making it somewhat parser agnostic)
meets an Organic Maps goal of introducing it to the codebase
is a language I am very comfortable with

But, depending on the results of the parser research, golang or even python might be a reasonable choice.

The current scraper is provided with lists of Wikipedia article titles and Wikidata IDs. The dump extractor will need to:

Convert Wikidata IDs to Wikipedia article titles in supported languages using one of the Wikidata dumps:
- the full json dump (~80GB)
- the full xml dump (~150GB)
- the wb_items_per_site dump (~1.5GB) of only inter-wiki links in a MySQL dump
For each supported language:
1. Use the compiled list of article titles with the dump’s multistream index to create a list of byte offsets and articles to start decompressing at.
2. Seek to each archive, decompress, and search the xml soup for matching article titles.
3. Extract the matching wikitext, parse, convert, simplify and minify it into HTML, and write that to an output file.

Using streaming bzip, xml, and json parsers that only keep the relevant data in memory will be an important aspect for performance.

Redirects

One wrinkle in this is that the Wikipedia titles can be redirects to actual articles, and OSM wiki even recommends linking to a page that redirects instead of a specific section. Since there is no guarantee that a title will be redirected to a title later in the dump, getting all linked articles in a single pass might not be possible.

Helpfully, the Wikipedia server doesn’t support more than one redirect, but that isn’t a guarantee that there won’t be chains of redirects in the dump data.

This needs to be investigated further, but some potential options are:

set a limit on redirect depth/rescans
resolve all redirects in an initial round of passes before processing the articles
unpack the dump into a format better suited for random access, e.g. a database or filesystem (this could significantly increase the I/O and processing time)

Images

A goal of Organic Maps is to include image links in the output, and dynamically block/fetch them in the app depending on the user’s data usage preferences. This should be relatively simple for the extractor to handle as long as the wikitext parser supports it.

Most of the work will be in designing the handling on the app side. At first glance it seems like injecting some Javascript into the webviews that display the articles would work, but coordinating and testing that across the different platform APIs will take time.

Wikipedia Dump Downloader

Automatically download the latest required Wikipedia and Wikidata dumps.

New Wikipedia dumps are created one to two times a month. Currently Organic Maps updates it’s map files roughly every month. To keep this process as low maintenance as possible, a program to download and verify the Wikidata and Wikipedia data will be created.

The dumps are made available through an html webpage with available mirrors. Each dump also provides a json status endpoint and md5 and sha1 hashes for all files.

The downloader should be able to:

determine what the latest completed dump is
determine if the latest files are already downloaded
retry on recoverable errors with some form of backoff
continue interrupted downloads
verify the downloaded files against the public hashes

Organic Maps has some download logic already implemented in python as part of the map generator. This could be extended, or a new program/script can be created. There are many existing libraries and programs for this sort of work, so I expect this will not be too difficult.

Nice to Haves

Once the extractor and downloader are working, there is more work that can be done.

Further Optimization of Dump Extractor

Depending on the bottlenecks discovered when profiling the extractor, there will be several opportunities for improving performance:

avoid processing and overwriting articles that haven’t changed since the last dump, a straightforward method would be to sync the last modified time in the xml with the modified time of the output file
parallelizing the decompression similar to other parallel bzip programs
parallelizing the article conversion
parallelizing the article output
pregrouping the output into map regions for the map generator

While the goal is not for the initial extractor to be slow, optimization can be a rabbit-hole and is best when guided by actual analysis, so I think it is best to leave it for the end, rather than prevent the rest of the goals from being accomplished. Proper design of the extractor will make this a smoother process.

OSM Tag Extractor

Currently the OSM Wikipedia/Wikidata tags are extracted by the map generator. Ideally the Wikipedia generator would not depend on the current generator in any way, so that it can be run independently.

Extracting tags from OSM data is a common task, so there are many available options:

various existing OSM tools like osmconvert
parsing the OSM data with the same language the Wikipedia extractor is written in
parsing with an OSM library like libosmium

Deliverables

report comparing parser outputs
updated mediawiki parser table on wikipedia
Wikipedia/Wikidata dump downloader tool, either integrated with the current map generation tool or standalone
database extractor tool that is a drop-in replacement for the defunct scraper
documentation of the installation, configuration, and design of the above
processed Wikipedia articles ready for use in the app

Timeline

To be upfront, I also plan to take a course in sign language this summer, but beyond that (and the wedding mentioned below) have no commitments. The stipend from GSoC would allow me to devote the required time to this project.

May 4 - 28 (Community Bonding Period)

get access to output of existing scraper
get sample inputs of Wikidata/Wikipedia IDs/titles
download full database dumps (this may take 3 weeks on my internet connection)
set up and test wikitext parsers

May 29 (Begin Coding)

begin extractor
Implement Wikipedia dump -> HTML pipeline

June 14 - 21 (Busy)

I’ll be traveling for a friend’s wedding, I expect to get minimal work done this week and push the working hours to before/after.

June 22 - July 10

Implement multistream index handling
Implement Wikidata parsing
Implement redirect handling

July 10 - 14 (Midterm evaluations)

Begin downloader

July 14 - July 31

Test downloader on production server

August 1 - 21

Finish downloader
Run extractor/downloader on complete data set
Document downloader and extractor
Work on “Nice to haves”

August 21 - 28 (Final week)

Wrap up documentation because everything will be done :)