# -*- coding: utf-8 -*- # Gets the sizes of the lang links and the text and generates a result page with their # ratios. See explanation at the bottom. # Requires the stats.txt file as created by getlanglinks.py # Written by Denny Vrandečić and released 23 June 2012, http://denny.vrandecic.de # Released under the BSD license result = open('index.html', 'w') result.write(""" Ratio of language links to full text in Wikipedia

Ratio of language links to full text in Wikipedias

Read the explanation at the end of the page. Click the table headers to sort.


"""); f = open('stats.txt') pagesum = 0 llcountsum = 0 doublelinkssum = 0 llsizesum = 0 textsizesum = 0 ratiosum = 0 count = 0 linksize = {} for line in reversed(f.readlines()) : lang, lines, pages, llcount, doublelinks, textsize, llsize, time = line.strip().split() if lang == "lang" : continue lines = int(lines) pages = int(pages) llcount = int(llcount) doublelinks = int(doublelinks) textsize = int(textsize) llsize = int(llsize) pagesum += pages llcountsum += llcount doublelinkssum += doublelinks ratio = (100.0/textsize)*llsize llsizesum += llsize textsizesum += textsize ratiosum += ratio count += 1 ratio = round(ratio, 1) green = str(max(0, 200 - int(round(ratio*2)))) red = str(min(200, int(round(ratio*2)))) result.write(" \n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") if doublelinks > 0 : result.write(' ' + "\n") else : result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") f.close() ratio = round((100.0/textsizesum)*llsizesum, 1) green = str(max(0, 200 - int(round(ratio*2)))) red = str(min(200, int(round(ratio*2)))) result.write(" \n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") ratio = round(ratiosum/count, 1) green = str(max(0, 200 - int(round(ratio*2)))) red = str(min(200, int(round(ratio*2)))) result.write(" \n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write("""
LanguagePagesLang. linksDouble linksText sizeLang. links size Ratio
' + lang + '' + str.format("{0:,}", pages) + '' + str.format("{0:,}", llcount) + '' + str.format("{0:,}", doublelinks) + '0' + str.format("{0:,}", textsize) + '' + str.format("{0:,}", llsize) + '' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write('
' + str.format("{0}", ratio) + '%
all ∑' + str.format("{0:,}", pagesum) + '' + str.format("{0:,}", llcountsum) + '' + str.format("{0:,}", doublelinkssum) + '' + str.format("{0:,}", textsizesum) + '' + str.format("{0:,}", llsizesum) + '' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write('
' + str.format("{0}", ratio) + '%
all ∅' + str.format("{0:,}", pagesum/count) + '' + str.format("{0:,}", llcountsum/count) + '' + str.format("{0:,}", doublelinkssum/count) + '' + str.format("{0:,}", textsizesum/count) + '' + str.format("{0:,}", llsizesum/count) + '' + "\n") result.write(' ' + "\n") result.write(' ' + "\n") result.write('
' + str.format("{0}", ratio) + '%


Explanation

Wikipedia exists in more than 280 languages. Often, two articles in two different language editions of Wikipedia are about the same thing: in this case they are connected with something called a language link. These language links are written in the source wikitext of the article. The English article on Berlin has corresponding articles in more than a hundred languages - and if you take a look at the end of the Wikitext for Berlin you will find the so called language links to all these other articles on the other Wikipedias. These make sure that you see the list of languages on the left hand side of many Wikipedia articles.

If you now go to any of the other language versions of this article, you will basically find the same list again and again - merely replacing the link to its own version with a link to the English version. Therefore, there are more than hundred articles that all include the basically same list to each other. To keep these links updated, bots crawl the Wikipedias and try to keep the links synchronized.

On smaller Wikipedias this actually means that in some stub articles a huge part of the actual content of the article is created just by the language links (see this example).

Here is an explanation of the columns:

The page has been created with three Python scripts, all released under the BSD-license: getlanglinks.py, createindex.py and prettydoublelinks.py

The dumps used where the most recent ones as downloaded on June 22nd 2012.

The whole idea of this page is to give a feeling of the effect of the first phase of Wikidata, the project I am currently working on. Wikidata aims to centralize most of the language links in one repository and thus remove them from the wikitext of the individual language versions.

Created June 23rd 2012, Denny Vrandečić.

""");