Month: October 2006

  • Morning conversations…

    Yesterday morning:

    RING RING… TekSavvy technical support, Nick speaking…

    Hi, my DSL is flapping up and down yet again, can you run a sync test please?

    Sure thing sir… yes, wow, your stats are terrible – this must be a line problem – would you like me to open a ticket with Bell?

    YES – and please note that this happened for two months in the spring and they did nothing about it, and now it’s back and I WANT A TEMP LINE

    Sure.. Bell should call you within 24 hours.

    (grumble grumble bloody hell)

    This morning:

    Maptuit, Victor speaking?

    Hi, this is Chris from the Bell Test Center, your ISP called about a problem with your line?

    Yes, my line goes to crap everytime the weather gets cold and wet… in the spring they already determined that all my buried service wire pairs were boned but then the weather got better and the problem went away – I WANT A TEMP LINE

    Well, it looks like there’s a ground fault on your line sir, shall I send a tech out?

    I KNOW I WANT A TEMP

    Does tomorrow morning work for you?

    YES AND THE TECH BETTER GIVE ME A SODDING TEMP LINE

  • Real mobile Wikipedia

    I’ve seen lots of talk recently about ‘mobile’ versions of Wikipedia. I’m a big fan of Wikipedia and I use it all the time. I’ve looked at all of the “solutions” for so-called mobile versions of Wikipedia and they all suck. First of all, many of them require full-on ‘net access or a cell phone, or they only let you load a subset of the data with limitations. Pffff. What’s the point then? I want everything, all the time, and I want it to be fast. Never mind that the english Wikipedia is huge – an XML dump of just the page contents is 7GB uncompressed.

    Hrm. I seem to have these two little handheld computers here, and they run Linux and have 20GB hard disks you say? Well geez, what are we waiting for! All we need to run MediaWiki – the free software that powers Wikipedia – is Linux (check), Apache (web server), MySQL (database), and PHP (web scripting language) – often called LAMP for short. Oh, and we need a dump of Wikipedia, of course – those are generated regularly. Off we go:

    B1B

    The Pad 3, with its Fedora Core underpinnings, makes getting LAMP install ridiculously easy. Open an xterm (Ctrl-Shift-1). Issue this command:

    yum install httpd php-mysql mysql-server
    

    Let it resolve all dependencies and install all that stuff. Great.

    B2B

    in contrast to its younger brother, the Pad 2 is based on MontaVista Linux CEE and Professional 3.1. This makes a LAMP stack harder to setup for two reasons:

    • the included apache-dev package is broken six ways from sunday
    • there are no MySQL or PHP packages included.

    I manually patched the busted files from apache-dev: in short, one of the config files in /usr/share/apache-2.0/build/ has to have all of its paths fixed, and the same for the apr-config script. I used the generic MySQL 4.1.21 source RPMs to build some halfway decent RPMs for the Pad 2: you still have to manually setup your my.cf and a startup script though. PHP I compiled and installed from a source tarball. Let it be known that MediaWiki and PHP should also work with thttpd, which is also provided by MontaVista and might not be so broken.

    For MySQL, you will have to set the max_allowed_packet setting to 4M in /etc/my.cnf otherwise your import process could fail.

    B3B

    Follow the MediaWiki setup insructions. In a nutshell: extract the tarball, rename the directory to something useful (like ‘wikipedia’), move the directory to your webroot (/var/www/html), and run the config page to generate LocalSettings.php. You probably want to disable all of the e-mail options, since you won’t be editing this wiki.

    Now, use mysql and delete all the records from the page, revision, and text tables.

    B4B

    Wikipedia provides database dumps as compressed XML files. However we need to import this into the MySQL database. A Java tool called mwdumper is the best way to go about this: it reads the compressed XML and outputs SQL statements, or will even directly connect to a MySQL server and insert them for you. But, this will take FOREVER (at least 24 hours) and if you have to interrupt it, you have to start from scratch. Provided you have another Linux box on which to do some preprocessing though, you can make things easier on yourself. Here’s what to do.

    1. Get a Wikipedia dump file – the one you want is called pages-articles.xml.bz2. Download it to your Linux box. It’ll be big.
    2. Get mwdumper as well.
    3. Get Sun Java 1.4 or 5.0. You can use mwdumper with GNU GCJ, but in my test it’s about 10x slower.
    4. Put both files in a directory and run java -jar mwdumper --format=sql:1.5 pages-articles.bz2 | sed -e 's/^INSERT/INSERT IGNORE/' | split - wikidump- and make sure you have about 7GB of free disk space.
    5. This process will chug along for a while and the result will be a bunch of files called wikidump-aa, wikidump-ab, etc. They contain the SQL statements that inject the data, but we’ve split it up into smaller chunks. That way we can interrupt things in the middle. We changed the INSERT command to INSERT IGNORE so that if we do interrupt things and start importing a file again, MySQL won’t complain about duplicate keys.
    6. Feed the data to the MySQL database on your Pepper: mysql -h pepper -u root -p'password' < wikidump-aa substituting your Pad’s IP, MySQL username and password (the ones you used in the MediaWiki setup, remember?) Or use a shell script to automate this, printing the filename before it starts loading so that you know where you are. We’re talking about 3.8 million wiki pages here. As of this very moment I’m somewhere in file ‘ah’ and I’ve done 784,000 pages.