WikiApiary talk:Operations/2013/June

Wiki of the Month
WikiAPIary is being showcased as Wiki of the Month on Semantic-MediaWiki.org for the month of June, 2013. --&#91;&#91;kgh&#93;&#93; (talk) 07:50, 1 June 2013 (UTC)


 * [[Image:thumbs-up-smiley.jpg|200px]] 🐝 thingles (talk) 14:14, 1 June 2013 (UTC)

Founded dates
I've seen a number of people adding founded dates to websites. Honestly, way more than I expected. I figured that would be a pain to find by hand. :-) Anyway, once bug #48838 is cleared in Semantic Forms I plan on adding a check in User:Audit Bee that will get the date of revision 1 for every wiki and populate it in founded date. I'll only do this if a date isn't there, so it won't overwrite any existing data. Once I make that change I'm going to flag all websites for a new audit so they get populated universally. Just an FYI that you needn't worry much about getting this data point. 🐝 thingles (talk) 18:56, 2 June 2013 (UTC)

IRC #wikiapiary and WorkerBee
I've setup a channel on Freenode at. Feel free to stop by and camp. I also have setup a bot called WorkerBee. The bot is powered by Willie and is running on my server. I've used Willie in the past to interface with the MediaWiki API and I'm thinking it would be fun to have some commands that WorkerBee can do in IRC to interface with WikiApiary. Additionally I plan on having him share some Bot log information into the IRC channel. If you have suggestions or ideas please share! 🐝 thingles (talk) 19:14, 2 June 2013 (UTC) ''PS: IRC bots are fun. :-)''

Introducing wmbot
You may see some edits from User:Wmbot. This is a bot that I'm writing to do two things:


 * 1) This bot is pulling the Wikimedia configuration files and creating new sites on WikiApiary for any sites that it finds. This will quickly fill out all of the language sites for all the Wikimedia projects.
 * 2) This is intended to be a standalone bot (not using the WikiApiary libraries that User:Bumble Bee and User:Audit Bee use) that can serve as an example of how to write a bot that creates sites automatically. Hopefully someone that may want to write a crawler, or an admin of another wiki farm, can use this code to make a bot that synchronizes their farm as well.

I'm being cautious so I just ran wmbot and had it create 10 Farm:Wikipedia profiles. I'm going to watch it and expand from there. 🐝 thingles (talk) 22:23, 3 June 2013 (UTC)


 * Alright, I cleaned up the code for wmbot.py a bit more. I'll add it to github shortly, it still has a username and password in the source that I have to extract out. If anyone would mind checking out Special:Contributions/Wmbot and looking at the pages that it has made so far that would be great. Assuming they all work out fine, I'll finish the run tomorrow to add all sites it knows how to parse. This will put all but a handful of unique Farm:Wikimedia sites into WikiApiary, and can easily be run weekly or so to catch any new language instances that get created. 🐝 thingles (talk) 03:42, 4 June 2013 (UTC)


 * wmbot has completed a first full run and added a ton of Wikimedia project sites. WikiApiary should now have every language for Farm:Wikipedia, Farm:Wikiquote, Farm:Wikisource, Farm:Wiktionary, Farm:Wikibooks, Farm:Wikinews, Farm:Wikiversity, Farm:Wikimedia and Farm:Wikivoyage. I haven't scheduled this to run regularly yet, but will probably have it run via cron monthly. 🐝 thingles (talk) 12:16, 4 June 2013 (UTC)


 * A more generic version -- one that can operate on any farm -- would be a FarmerBot. -- ☠ MarkAHershberger ☢ (talk) ☣ 16:18, 4 June 2013 (UTC)


 * I have had a look at some of WMBot's edit and found them allrighty. :) --&#91;&#91;kgh&#93;&#93; (talk) 17:00, 4 June 2013 (UTC) PS I would love to see Farm:ShoutWiki and Farm:Wikkii here, too.


 * Thx kgh!

More than 2M property values set
Almost exactly three month after the last big milestone WikiAPIary exeeded the 2,000,000 property values mark and is once again one of the 10 biggest known Semantic MediaWiki installations. --&#91;&#91;kgh&#93;&#93; (talk) 16:32, 4 June 2013 (UTC)

Host lists for other farms?
Now that I got User:Wmbot running and it seems to be fine, I could branch him to deal with some other farms. kgh mentioned Farm:ShoutWiki and Farm:Wikkii. Personally, I'm very curious to start pulling in Farm:Wikia although it will certainly cause pain as the site load increases. Here is where I could use some help though. For any of these I need some sort of list of sites. For Wmbot I'm using the Wikimedia configuration files, of which one contains a list of database names and from that, because they follow good convention, I can infer the wikis from.

I've looked a lot for a list of Wikia sites and cannot find anything. I thought I had found something for Farm:Wikia when I hit this thread on s23.org but the URL's mentioned are no longer. I even tried looking for information in  as well as , but no go. I see a similar thread for a host list from ShoutWiki, but I don't see such a list.

If anyone can find host lists for any of these farms, it is pretty darn easy for me to make something to pull them all in. Please share any information you can find here. 🐝 thingles (talk) 17:17, 4 June 2013 (UTC)

Wikia
Currently I am not too keen to see Wikia here. That's a lot of wikis to munch on. What about this site? There you get at least a list of the domains. Basically that's the product of s23's work. --&#91;&#91;kgh&#93;&#93; (talk) 17:26, 4 June 2013 (UTC)


 * I've poured over the s23/wikistats's stuff but it's not clear to me how often that is updated. Somehow mutante got ahold of a hostlist for those there, I'd like to use the same source host lists so there isn't an added dependency. 🐝 thingles (talk) 17:39, 4 June 2013 (UTC)


 * You could try to pull a list of the top 5k wiki domains off of http://www.wikia.com/WAM . Edit: that is for wiki's that are active, or you could try for a complete ist http://community.wikia.com/api.php?action=query&list=wkdomains&wkfrom=10000&wkto=15000 Simant (talk) 18:26, 4 June 2013 (UTC)


 * Holy crap! That's the magic I need! (Good too, since that WAM list was not workable.) That will work perfect to start looking at that. I am very, very mindful though of User:Kghbln's concerns. I think what I will likely do for Farm:Wikia is pull this file but only add say 50-100 websites on any given run, and do them manually. Gradually just adding small sets. This is a lower priority though. 🐝 thingles (talk) 04:34, 5 June 2013 (UTC)

Wikkii
Wikkii has a list of active wikis and a list of wikis. Shoutwiki is currently down, but this should be a cached list of their wikis.--ete (talk) 20:54, 4 June 2013 (UTC)


 * Excellent! That looks very workable. It seems like if I filtered for links with  I would get the list I would want. Thanks! 🐝 thingles (talk) 04:34, 5 June 2013 (UTC)


 * Hmm, there seems to be a much bigger problem with Farm:Wikkii. Looking at the sites on that list when I attempt to access the API for any of them I get the error:

MediaWiki API is not enabled for this site. Add the following line to your LocalSettings.php $wgEnableAPI=true;


 * That pretty much takes them out entirely. We do currently have the Wikkii Community Wiki tracked, and it has an API point (at an odd URL). However that is the only Wikkii site that is in WikiApiary right now, perhaps their community wiki is not hosted on their main farm. It is worth noting that while they have the API seemingly disabled universally, basic statistics would still be possible by calling Special:Statistics. For example, HG World stats would be usable to get data. We've discussed using Special:Statistics collection before and decided against it, but that was in the context of collecting from old MediaWiki instances. It looks like there would be utility to collect data from sites that have the API disabled as well. Hmmm... By the way, Wikistats does collect data from Wikkii, and it does support collecting using Special:Statistics calls. 🐝 thingles (talk) 19:44, 5 June 2013 (UTC)

Referata
Related, this does not seem to have all the Referata sites, and this is admin only, maybe Yaron could provide us with a list?--ete (talk) 17:26, 5 June 2013 (UTC)


 * I pinged Yaron on #semantic-mediawiki and he mentioned he purposefully hasn't provided a single list of all Farm:Referata sites. I mentioned that if there is some list he would like me to load I'd happily do that. 🐝 thingles (talk) 19:22, 5 June 2013 (UTC)

SMW Registry
While looking for a list for Referata, I found this, which may be handy? A less easy to digest, but a load more semantic sites is always nice. --ete (talk) 17:26, 5 June 2013 (UTC)


 * Yes, I've poked at the registry for Semantic sites and plan to do a sync with WikiApiary for that. 🐝 thingles (talk) 19:22, 5 June 2013 (UTC)

Thinking about wikispam
I've spent some time tonight sketching out what WikiApiary could to do to fight off spam. I'm pretty amazed by the potential honestly. Given the existing API data, along with a couple of new feeds and paired with an (optional) bot account on a remote wiki WikiApiary could be incredibly effective. Even without a remote bot account, I was just thinking through building a reputation score for a given edit. Part of the magic that I don't see us taking advantage of when fighting wikispam is that we don't have to do it in realtime. Similar to email spam. Spam may go in your inbox for a few minutes, and then once some system figures out it is spam, can pull it out before you even saw it. Same with edits. Anyway, just sharing some late night contemplation. Don't mean to start a huge thread, but happy to hear thoughts. I am starting to do a little playing with accessing the logs and recent changes information via API. The total data size for this does start to get much larger. Entirely feasible, but perhaps more than a single VPS at Linode. :-) 🐝 thingles (talk) 05:33, 5 June 2013 (UTC)
 * Hm, you are considering making a Soldier Bee or something of the kind, which would go around reverting/blanking/deleting detected spam edits (and perhaps even blocking the user when fairly certain)? I guess registration on sites which are getting spammed anyway may be possible via bot..


 * Things that come to mind to consider: First task for the bot after registration could be to edit a section onto the main page talk page, its own userpage, and maybe the founder's talkpage explaining what the bot is, a couple of basic spam prevention things (setting up abuse filter to block unconfirmed users from adding links and putting up a non-terrible capacha seems to stop almost all spam for me), not to reply to the bot on their wiki (give a link to a page here that they can ask about it, unless their spam filter blocks new user links already), and requesting bot/sysop powers to be able to delete spam pages and work without cluttering RC. --ete (talk) 10:22, 5 June 2013 (UTC)


 * When I think of WikiApiary using an account on a remote wiki I see that as something that the admin would request. I envision a setting in WikiApiary that says "Request bot credentials" that would then send a notice via email to the person who requested it with a username and password (unique for every site) for an account to create. WikiApiary could then test if it can login with those credentials, and that the user is given bot permission and send an error if the account isn't setup right. There would need to be a number of security considerations taken into account in that flow, but that is what I'm thinking. 🐝 thingles (talk) 15:18, 5 June 2013 (UTC)


 * If it's on request only, not even creating an account before approval.. I feel like virtually all the spam will go untouched. It's really easy, in my experience, to shut down all but a few spambots (the ones which mangle their links) with a couple of edit filters (perhaps due to how little the spambot writers have to try against some wikis). A lot of MW site creators are just extremely lazy and reluctant to give out power to fix things. I'd happily sign up for this, and if it works well it would be kinda helpful, but spam is already a very minor problem for us (three spambots bypassed the filters in the last month (two confused polish spambots, and one before we'd turned the filters back on after updating), and the filters only block unautoconfirmed users (no age limit, and only one edit needed) from adding links). On some wikis, I've known 100s of spam pages being deleted daily by hand, due to lack of preventative measures. It would obviously be a lot more work.. but if you could get funding to run a bot which wiped out a large portion of spam from these kind of wikis, and plopped a clear and easy preventative solution right into the talkpages of the head admins, that would be pretty amazing. As compared to only working with signed up wikis, which would be.. a bit helpful to a much more limited number of wikis. And a lot harder to get right, since the spambots which bypass sane sysadmin measures tend to be less predictable.--ete (talk) 18:06, 5 June 2013 (UTC)


 * Would you be thinking of offering this as a freemium thing, or maybe try to get funding from somewhere else? I imagine owners who can't be bothered to set up basic anti spam (which is terrifyingly many of them) likely won't be interested in paying, and those who could be interested in paying have little enough spam compared to their number of active users that it's not worth their hassle. But if there was a big single source of funding it could work. Would Wikimedia or some other organization perhaps be interested in funding a least server costs? Maybe even Google once you've got proof of concept, they'd probably be very happy for there to be less wikispam polluting the links they get from wikis and the positive publicity, and have plenty of money/server time. --ete (talk) 10:22, 5 June 2013 (UTC)


 * This sounds like we are talking about US healthcare policy now! ;-) Kidding. I think your observation that "those that can, do, and those that can't, won't pay" is an astute one. I sort of look at the universe of all these unprotected wikis as a pre-built honeypot. But, what to do with it and who would find it valuable requires thought. I do think that modern platforms need to think about spam as it reflects negatively on the platform as a whole. A friend of mine recently setup a MediaWiki and didn't put anything on it. It instantly was obliterated and that made him negative about MediaWiki as a platform. Nothing actionable in that, just an observation. With all that said I am definitely looking at a variety of angles where WikiApiary can generate revenue to fund itself and additional services. 🐝 thingles (talk) 15:18, 5 June 2013 (UTC)


 * Agreeing about it reflecting negatively on MediaWiki as a platform, I've found this especially true for users who clear the spam by hand for badly run wikis. I had one user argue strongly against using MediaWiki on one project entirely due to bad experience with spamclearing on a previous wiki. Honestly, this is something which should be solved on MediaWiki's end (include pre-configured basic anti-spam as default, disableable if not needed), but WikiApiary could definitely help with automatic repairs of long term spammed wikis and putting clear information about how to prevent more spam in clear view of not just the owners, but the users who can badger the owners to fix it. Working on catching those spambots which slip past basic safeguards seems like a more involved project (because those bots are less stupid, and their writers are more likely to update if you figure out how to shut down the current generation) with less overall returns (because globally the stupid bots outnumber the smarter ones by a huge margin).--ete (talk) 18:06, 5 June 2013 (UTC)


 * I wonder if, as a bot editing large numbers of wikis, there was a chance of you ending up on some antispam IP blacklists. That would be amusingly ironic, but annoying.


 * Oh, and would it be practical for the bot to look through deletion logs of large numbers of wikis (all those that gave it access after it automatically registered) and edit logs, looking for the words "spam" and "spambot" to learn from existing spam detected by human editors?--ete (talk) 10:22, 5 June 2013 (UTC)


 * There are a number of ways that you can detect positive and negative votes from the existing log activity. This could then be used to build or decrease reputation both for edits and for the user behind it. 🐝 thingles (talk) 15:18, 5 June 2013 (UTC)


 * Right, seems like you've got some ideas :). Perhaps having some form of output so the owners could run the user removal script to eliminate the spambots would be nice as an extra?--ete (talk) 18:06, 5 June 2013 (UTC)

Limiting charts to 3 months
I just hacked in a small change to the PHP data accessors that are used in dygraph to only return data for the last 3 months. Some sites, notably WikiApiary itself were starting to return a mountain of data that made the browser almost unusable. The plan is to have daily and weekly aggregations for these graphs, but for now I just limited the raw graphs to 3 months. The data is all there, so don't worry. Just not accessible right now. 🐝 thingles (talk) 05:56, 5 June 2013 (UTC)