Real mobile Wikipedia

I’ve seen lots of talk recently about ‘mobile’ versions of Wikipedia. I’m a big fan of Wikipedia and I use it all the time. I’ve looked at all of the “solutions” for so-called mobile versions of Wikipedia and they all suck. First of all, many of them require full-on ‘net access or a cell phone, or they only let you load a subset of the data with limitations. Pffff. What’s the point then? I want everything, all the time, and I want it to be fast. Never mind that the english Wikipedia is huge – an XML dump of just the page contents is 7GB uncompressed.

Hrm. I seem to have these two little handheld computers here, and they run Linux and have 20GB hard disks you say? Well geez, what are we waiting for! All we need to run MediaWiki – the free software that powers Wikipedia – is Linux (check), Apache (web server), MySQL (database), and PHP (web scripting language) – often called LAMP for short. Oh, and we need a dump of Wikipedia, of course – those are generated regularly. Off we go:

B1B

The Pad 3, with its Fedora Core underpinnings, makes getting LAMP install ridiculously easy. Open an xterm (Ctrl-Shift-1). Issue this command:

yum install httpd php-mysql mysql-server

Let it resolve all dependencies and install all that stuff. Great.

B2B

in contrast to its younger brother, the Pad 2 is based on MontaVista Linux CEE and Professional 3.1. This makes a LAMP stack harder to setup for two reasons:

  • the included apache-dev package is broken six ways from sunday
  • there are no MySQL or PHP packages included.

I manually patched the busted files from apache-dev: in short, one of the config files in /usr/share/apache-2.0/build/ has to have all of its paths fixed, and the same for the apr-config script. I used the generic MySQL 4.1.21 source RPMs to build some halfway decent RPMs for the Pad 2: you still have to manually setup your my.cf and a startup script though. PHP I compiled and installed from a source tarball. Let it be known that MediaWiki and PHP should also work with thttpd, which is also provided by MontaVista and might not be so broken.

For MySQL, you will have to set the max_allowed_packet setting to 4M in /etc/my.cnf otherwise your import process could fail.

B3B

Follow the MediaWiki setup insructions. In a nutshell: extract the tarball, rename the directory to something useful (like ‘wikipedia’), move the directory to your webroot (/var/www/html), and run the config page to generate LocalSettings.php. You probably want to disable all of the e-mail options, since you won’t be editing this wiki.

Now, use mysql and delete all the records from the page, revision, and text tables.

B4B

Wikipedia provides database dumps as compressed XML files. However we need to import this into the MySQL database. A Java tool called mwdumper is the best way to go about this: it reads the compressed XML and outputs SQL statements, or will even directly connect to a MySQL server and insert them for you. But, this will take FOREVER (at least 24 hours) and if you have to interrupt it, you have to start from scratch. Provided you have another Linux box on which to do some preprocessing though, you can make things easier on yourself. Here’s what to do.

  1. Get a Wikipedia dump file – the one you want is called pages-articles.xml.bz2. Download it to your Linux box. It’ll be big.
  2. Get mwdumper as well.
  3. Get Sun Java 1.4 or 5.0. You can use mwdumper with GNU GCJ, but in my test it’s about 10x slower.
  4. Put both files in a directory and run java -jar mwdumper --format=sql:1.5 pages-articles.bz2 | sed -e 's/^INSERT/INSERT IGNORE/' | split - wikidump- and make sure you have about 7GB of free disk space.
  5. This process will chug along for a while and the result will be a bunch of files called wikidump-aa, wikidump-ab, etc. They contain the SQL statements that inject the data, but we’ve split it up into smaller chunks. That way we can interrupt things in the middle. We changed the INSERT command to INSERT IGNORE so that if we do interrupt things and start importing a file again, MySQL won’t complain about duplicate keys.
  6. Feed the data to the MySQL database on your Pepper: mysql -h pepper -u root -p'password' < wikidump-aa substituting your Pad’s IP, MySQL username and password (the ones you used in the MediaWiki setup, remember?) Or use a shell script to automate this, printing the filename before it starts loading so that you know where you are. We’re talking about 3.8 million wiki pages here. As of this very moment I’m somewhere in file ‘ah’ and I’ve done 784,000 pages.
 

4 replies


  1. Hi,
    Why not trying to convert the mediawiki website to html and from that point to wml using tools such as html2wml ?


  2. You may be interested in checking out http://www.moulinwiki.org. We got the entire french version running on a CD 554MB big (standalone browser with search support. An english version is coming soon too.


  3. In case you are not aware already, you can put wikipedia (images only) on your ipod for mobile wikipedia access. Not the best user interface, but it’s a start.


  4. You say you have looked at all the mobile solutions – you seem to have missed TomeRaider – wikipedia works quick on my HTC device – 1.2 Gbyte on memory card.