Reigniting My Star Wars Data Project

I shelved this project for a while after a string of avoidable problems: unstable containers, MongoDB volume mishaps, API overuse worries, and a general sense that I was “hammering” the Star Wars Fandom wiki harder than felt respectful. This post covers how I rebooted the whole effort around a radically simpler (and more ethical) model: pull an official dump, stand up a local MediaWiki, and iterate offline.

Why I Paused the Project

Short version of why it stalled:

Heavy API crawling felt disrespectful.
Lost data after container / volume mistakes.
Mongo added fragility I didn’t need.
Small model tweaks forced full re-pulls.

New Approach (Snapshot, Not Scrape)

Download official dump (starwars_pages_current.xml.7z).
Extract once.
Import into a local MediaWiki.
Do all parsing / modeling offline.

Outcome: reproducible, zero API pressure, easy to refresh later.

Minimal Compose (Unraid)

services:
  db:
    image: mariadb:11.4
    environment:
      - MARIADB_ROOT_PASSWORD=changeme
      - MARIADB_DATABASE=mediawiki
      - MARIADB_USER=wiki
      - MARIADB_PASSWORD=changeme
    volumes:
      - /mnt/user/appdata/starwars-mediawiki/mysql:/var/lib/mysql
  mediawiki:
    image: mediawiki:1.44
    ports:
      - "8000:80"
    volumes:
      - /mnt/user/appdata/starwars-mediawiki/LocalSettings.php:/var/www/html/LocalSettings.php:ro
      - /mnt/user/appdata/starwars-mediawiki/dumps:/dumps:ro
networks:
  default:
    external: true
    name: proxynet

Import (Inside Container)

php maintenance/importDump.php --report=1000 /dumps/starwars_pages_current.xml
php maintenance/runJobs.php --maxjobs 5000

Time: ~30–45 min for >200k pages on my setup.

Quick Validation

Compare counts with Special:Statistics.
Random spot check a few deep lore pages.
Note: only current revisions (that’s fine for now).

Why This Is Better

Single atomic snapshot.
No accidental API hammering.
Easily reproducible.
Safe space for schema & embedding experiments.

Next Iterations

Automate monthly dump refresh.
Add diff detection (hash + date).
Build entity graph (character → appearances → timeline).
Optional embeddings index.

One-Liners

Simplicity beats clever ETL. Start from dumps if they exist.

That’s it—concise and stable foundation re-established.