top of page
Multilingual co-editing of Wikipedia

Linguistic Neighbourhoods:

What explains common co-editing interests between language communities on Wikipedia?

There are almost 300 language editions currently available in Wikipedia. Each of them has a distinct community of editors and scope of articles. Even when a certain concept is covered by several language editions, in each language version the content is unique rather than a direct translation from one of the editions. These differences in number, selection, and article content are not accidential, but illustrate cultural differences between the underlying language communities of editors.

 

Research Questions

Theoretically, each concept covered in Wikipedia could exist in all 300 language editions, but in practice such extensive coverage is very rare, and most concepts are covered in a limited set of editions.

Is this set of languages random? Do certain editions show consistent interest in editing the same concepts? What socio-linguistic features explain common editing interests between language communities on Wikipedia?

 

Results

In this project, we have collected the editing data about all Wikipedia articles created between 2005-2013, and extracted language dyads that tend to be interested in co-editing the same topics. Applying the methods from Network Science, Bayesian and frequentist statistics, we find out that:

  • language pairs do not edit articles about the same concepts by chance;

  • co-editing similarity of language communities is best explained by genetic proximity of languages, bilingualism, shared religion, and demographic attraction between communities. Geographic distance is a significant, but a weak factor;

  • global dominance of English is not observed, instead local interconnections come to the forefront, rooted in socio-linguistic factors.

Full paper for download (EPJ Data Science)   [.pdf]   [data]

Paper in a nutshell (poster)   [.pdf]

Paper in a nutshell (slides)   [.pdf]

 

How to cite:

@article{Samoilenko-2016,
    author = {Samoilenko, Anna and Karimi, Fariba and Edler, Daniel and Kunegis, J'{e}r^{o}me and Strohmaier, Markus},
    year = {2016},
    title = {Linguistic neighbourhoods: explaining cultural borders on {Wikipedia} through multilingual co-editing activity},
    journal = {EPJ Data Science},
    volume = {5:9},
    doi = {10.1140/epjds/s13688-016-0070-8}
}

 

Best Poster Award at NetSciX 2016 conference in Wrocław, Poland

The network of significant Wikipedia co-editing ties between language pairs. Nodes are coloured according to the clusters found by the Infomap algorithm, and link weights within clusters represents the positive deviation of z-scores from the threshold of randomness; links are significant at the 99% level. For visualisation purposes we display only 23 clusters and the strongest inter-cluster links in the network. The inter-cluster links show the aggregated z-scores between all nodes of a pair of clusters. The network suggests that local factors such as shared language, linguistic similarity of languages, shared religion, and geographical proximity play a role in interest similarity of language communities. Notably, English forms a separate cluster, which suggest little interest similarity between English speakers and other communities.

bottom of page