Pavlo import project

From WikiApiary, monitoring the MediaWiki universe

Nemo bis mentioned on User_talk:Thingles#mutante.27s_wikistats Pavlo's list of wikis. Rather than deal with the format of this in the mutante list, it's easier to deal with this file directly.


Pages in Category:Pavlo import: 4,713

Page growth captured after finishing import.
Sat Jan 26 07:54:26 CST 2013
Ran the first 100 lines of the file. Processed: 100 Success: 70 Logos: 56 Fail: 30 See Category:Pavlo import. /Trial run output (first 100 lines) WARNING: Any edits made to these pages will be erased in a subsequent run! The check for SMW info found two successes in this first 100.
Sat Jan 26 14:48:38 CST 2013
Started import of 900 more sites from this file. Will be doing it in chunks of 1000. Will put output of import when done. Task is running with a 2 second delay between each site. Should take about 4 hours.
Sat Jan 26 22:57:03 CST 2013
Have imported the first 2000 records, just started the next thousand to run overnight.
Sun Jan 27 08:23:59 CST 2013
Started the final two file segments to load. Pushing things in this morning, loading 3 files in parallel with no delay. Server CPU is about 50% doing this, and the wiki is a little slower than typical but holding fine.
CPU usage during import.
Sun Jan 27 10:29:27 CST 2013
All files process and imported. Will be uploading the log files shortly.


Given the size of this import, I think it makes sense to add a specific category tag to these pages. Suggesting Category:Pavlo import be placed in all these pages. This would especially help with any Replace Text needs or other followup mass actions. Thingles (talk) 03:18, 26 January 2013 (UTC)

I reran it with this category code added to the template. Seems like a good idea. Thingles (talk) 03:28, 26 January 2013 (UTC)
Yes, this should be done. Also it would help to track the source of the content. --[[kgh]] (talk) 10:05, 26 January 2013 (UTC)

This import is going to be messy. As you can see in the debug output, these sites would be imported with Validated false and Active false. It may make sense to just have active false. Since the name of the wiki is retrieved from the API on this import, the API URL is already validated. Thingles (talk) 03:20, 26 January 2013 (UTC) PS - Note Validated=No in the test runs below.

I would import Validated=no since this would help to distinguish if a human eye looked at the page or not. Since Active has to be ticked anyway ... There may probably still be wikis not being activated though they are valid (e. g. Audit Bee or other reasons) --[[kgh]] (talk) 10:05, 26 January 2013 (UTC)

Audit Bee will run this import. Right now he is not checking for any preexisting sites. There could be a query to see if any existing sites have the same API URL being reviewed before importing, but it would slow it down dramatically and I suspect would not hit any matches. Either way, if a site is overwritten the revision history would be able to recover the original. Thingles (talk) 03:23, 26 January 2013 (UTC)

Looking at the file a bit more, this may be needed.
% grep wikinews wikilist.txt | wc -l
A check on a preexisting API URL would match for these and could be skipped. Thingles (talk) 03:30, 26 January 2013 (UTC)
A cheap and cheesy alternative solution would be to check the result of the edit action to the wiki. If it returns a new page was created there is no conflict. If it returns anything else I could flag that page name to be reviewed. This seems workable. Thoughts? A new page will return
{'edit': {'pageid': 2541, 'title': 'WKGE-Wiki/Extensions', 'newtimestamp': '2013-01-23T06:17:17Z', 'result': 'Success', 'new': '', 'oldrevid': 0, 'newrevid': 10141}}
Note the 'oldrevid': 0. If the edit is a nochange or the oldrevid is not 0 then further investigation would be warranted. Thingles (talk) 03:48, 26 January 2013 (UTC)
Yeah, this would be the way. --[[kgh]] (talk) 10:05, 26 January 2013 (UTC)

It is really hard to know if a language code should be added to the target page names. Not sure how to deal with that, or if it's even worth trying. Thingles (talk) 03:30, 26 January 2013 (UTC)

This should only be done if there is some kind of language farm. However, in this case the page get's overwritten by Audit Bee over and over, telling us a second import with language codes added to the page name should be done. --[[kgh]] (talk) 10:05, 26 January 2013 (UTC)
Nice to see this moving! Note that wikistats' mediawikis list also has some wikis not listed here, and some other noise and duplicates removed. You'll have to check for duplicates later, and/or remove from the lists those that were already reported in wikistats' bugzilla component. --Nemo 14:24, 26 January 2013 (UTC)
I'm going to have Template:Website check to see if it has any duplicates in the wiki. And future importers will need to have better logic to see if they are trying to import something that already exists. This list will give a very wide and diverse base, and really everything WikiApiary needs to insure that all the other related systems will scale up to a larger number of wikis. I'm really excited to run this and fill out the wiki list! Thingles (talk) 15:06, 26 January 2013 (UTC)

Next steps: What do we see needing to change, if anything, before running the entire list into the wiki? I would add a sleep in the bot and then let it run. Probably take about a day to import the entire file. Thoughts? Thingles (talk) 15:04, 26 January 2013 (UTC)

Now that the upload is completed it is safe to say that only about 4500 of 7300 wikis were still valid? That's quite a bunch since the list is only 1 1/2 years old, if I remember correctly. --[[kgh]] (talk) 17:19, 27 January 2013 (UTC)

To Do

  1. Add 2nd category for groups of 100 sites in the import. Useful to limit Replace Text actions.
  2. Attempt to determine is Semantic MediaWiki is available and set flag appropriately.
    • This would include determining if SMW is 1.6+. We actually need a further property what distinguishes between < 1.6 and >= 1.6. The latter should be done anyway. --[[kgh]] (talk) 11:35, 26 January 2013 (UTC)
      • I'm going to do it easier and just ask for the SMW stats API method, if it returns I know it should be collected. Thingles (talk) 12:43, 26 January 2013 (UTC)
  3. Attempt to identify remote $wgLogo, upload file and set it.
    • This would include renaming the file name to the wiki's name. Heaps of files will be just called logo.png Working on logos will, besides adding the description (I am afraid that nothing can be done about this automatically), take most of the time. --[[kgh]] (talk) 11:32, 26 January 2013 (UTC)
      • I've got it working for a large number of them! :-) And they will be renamed with the wiki name and logo appended. Thingles (talk) 12:43, 26 January 2013 (UTC)
        • This is news! Amazing! --[[kgh]] (talk) 14:59, 26 January 2013 (UTC)
  4. Identify import overwrites for review.
    • Perhaps it may check for existing pages on the wiki like the Replace Text does rather than checking if an specific API URL already exists. --[[kgh]] (talk) 14:58, 26 January 2013 (UTC)