Final touches cleaning Mediawiki tables after removing spam pages

134 views Asked by At

As a testament to how good my SEO efforts have been for one of our websites, a wiki residing on the same domain got 2601 spam pages in 2 days (coincidence, got listed on SERPs 2 days ago...).

I have locked the wiki down (read only), enabled block lists, Captchas etc. etc. and used the Nuke extension to remove all the spam.

Now, this is remarkable for just one extension, but it still left stuff here and there, which I'd love to trim out.

Basically, Nuke (which I think it's an official extension) left "orphaned" records in the following tables: pagelinks, searchindex, users.

I have no issues deleting records around but I don't want to break the database relational consistency by randomly pruning stuff about. I am able to understand how to execute SQL queries, Linux command line scripts and all sorts of advanced stuff.

So, here are some questions for some helpful StackOverflow readers who know Mediawiki internals:

  • May I freely delete users table rows? I just need to keep two rows so the SQL query is easy. I just don't want to cause side effects with whatever other tables could need to link to them.

  • What could I do to remove the orphaned records in pagelinks? They clearly point to now gone pages, yet the default maintenance Mediawiki scripts I have used (first the nuke extension, then rebuildall.php) don't trim those orphans away. This leads me to believe I might still have garbage somewhere causing the script to not remove the links pointing to it. However I have triple checked the pages... only the few pages made by us are left any more. I have purged the revisions as well.

I have tried using the console refreshLinks.php and orphans.php scripts but they did nothing relevant.

I am sure the pagelinks table can be further trimmed down, because by using the dumpLinks.php console maintenance script I can easily grep all sorts of "inconvenient" words and links.

1

There are 1 answers

3
Collector On

Hopefully, you backup your databases at least once a day. In which case, assuming the wiki is rather new, it might have been easiest to simply revert to a non-spammed version of your DB and alert or manually repeat changed done during these two days.

Generally, a relational database should have strict relations that won't allow you to leave it in inconsistent state by either presenting an error or cascading your action. Not sure how well MediaWiki defined its relations though.

I've removed rows from the users table and haven't noticed any problems. I'd suggest removing the rows from pagelinks table and see what happens.

You could verify the sanity of your wiki by launching an automated crawler on it and seeing if any errors come up.