Methods

Building the network of co-editing similarity:

Collect all edits made to all articles in 110 larges language editions of Wikipedia between 2005-2014. Edits by bots are removed.
Construct a bipartite network of co-editing which includes two types of nodes: concepts and language editions which covered them. In the example above, concept Samovar is covered by several language editions (in the language box), in each of the languages the article has different content and title.
Flatten the bipartite network to a weighted network. We count the number of edits to every concept in each language, and compute the probability for a dyad of languages to edit the same concept. We compare this empirical weight with the expected probability of co-editing for the same dyad. The expected probability assumes that concepts are edited at random. We apply Bonferroni correction to account for multiple comparisons and size effects.
In the resulting network, links are weighted according to the z-score of significance. Heavier links represent significant differences between empirical and expected weights.

Exact mathematical formulations are in the paper. More details and method derivation can be foind in the paper introducing the method.

What does the network mean?

In practice, stronger interconnections between two nodes mean that this pair of languages shares a number of concepts, to which both languges show consistent interest. Following the example above, strong link between Ukrainian and Russian means that an article about some concept from the Ukraininan edition of Wikipedia is also likely to be covered on the Russian edition. This is especially true when the concept exists in a very limited number of languages.

In general, our method equalises the differences in the activity levels and sizes of different editions. This means that very active editions such as English can not dominate the network purely because of their size. On the contrary, it is easier for smaller editions to show significant links between each other. As a result, we can highlight finer details of similarities between peripheral languages, and observe how local interconnections come to the forefront.

What can explain this network configuration?

Building a pretty network is an interesting task in itself, as it allows a bird's eye view on the structure of relationships between a large number of language communities. Applying clustering algorithms is one way to analyse this multitude of relationships, and build hypotheses about possible explanations of such interconnections.

There are multiple clustering algoritms, each with parameters to tweak, meaning there could be several possible solutions to how extract communities from a network. However, there is no way of veryfying which of these configurations is true.

Instead of focusing on communities, we focus of possible explanations to the weights in the network. We do cluster the network into communities (using the Infomap method), however we only use this network to inform our guesses about the causes of language interrelations. We further transform these guesses into quantifyable hypotheses which we are able to test statistically.

Each hypothesis is represented as a symmetric matrix. The intersection of columns and rows have a numerical value that corresponds to the belief that this dyad of languages will have significant similarity. The diagonals are empty. The additional datasets that we used for building the hypotheses are available for download.

Explaining the network configuration quantitatively

We compare the plausibility of hypotheses using HypTrails, a Bayesian approach based on Markov chain processes. This approach allows to visualise not only which hypotheses are statistically significant, but rank them according to how much variation in the data they explain. Find out more about Hyptrails from its creator (website).

We present a generalisable approach to quantifying and explaining similarity, which has several important advantages:

it scales well in terms of number of hypotheses and communities that could be analysed;
it does not require understanding of a language;
it is applicable for any example of collaborative production of common good where individual activity is recorded.

The approach consists of several easy steps: (1) Building the similarity network, (2) Building quantifyable hypotheses whih might explain network composition, (3) Testing the hypotheses.

Network

Hypotheses

HypRanking

Network of significant co-editing ties between language pairs on Wikipedia. Nodes are coloured according to the clusters found by the Infomap algorithm, link weights within clusters represents significant z-scores; links are significant at the 99% level. The inter-cluster links show the aggregated z-scores between all nodes of a pair of clusters. For visualisation purposes we display only 23 clusters and the strongest inter-cluster links in the network. The network suggests that local factors such asshared language, linguistic similarity of languages, shared religion, and geographical proximity play a role in interest similarity of language communities. Notably, English forms a separate cluster, which suggest little interest similarity between English speakers and other communities.

A toy example of how to express a hypothesis through a transition probability matrix. According to each hypothesis, the cells with more likely transitions are coloured in darker shades of blue. In Uniform hypothesis all transitions are equally possible, i.e. the editions are covering random topics. We use this as a baseline for testing the significance of our results. In Shared religion hypothesis, the pair of languages UK-RU is given higher belief (darker coloured cell) because the most common religion of speakers of Russian and Ukrainian languages are the same, while in the case of UK-ET and UK-PL they are different. Finally, in Geographical distance hypothesis the shorter the distance between languages, the stronger belief in the transition.

Uniform hypothesis

Shared religion hypothesis

Geographical distance hypothesis

How to read this plot?

Coloured curves each represent a hypothesis, black line is the Uniform hypothesis - our baseline for hypotheses plausibility - all differences in ranking betweeh hypotheses are decisive. The gray curve on top is the data plotted against itself, which together with the black curve provide a window of how well a given hypothesis explains the data. The smaller the distance between the coloured hypothesis curve and the gray curve, the stronger the effect of this hypothesis.

The ranking of the curves relative to each other should be compre for the same value of k. k is a parameter related to how much variation is tolerated between the hypothesis and the observed data. Higher values of k reflect stricter conditions when less deviation of hypothesis from the data is tolerated.

The plot shows that all hypotheses are decisive, and the most plausible ones are the same language family, the bilingual, the shared religion, and the gravity law hypotheses. Overall, it seems that cultural factors such as language and religion play a larger role in explaining Wikipedia co-editing than geographical factors.

Frequentist solution to the same problem - MRQAP

Multiple Regression Quadratic Assignment Procedure (MRQAP) assesses statistical significance of various hypothesis using frequntist aproach. This method has a long established tradition in social network analysis as a way to sift out spuriously observed correlations, and is well-suited for analysing dyadic data where observations are autocorrelated if they are in the same row or column.

By including all four hypotheses into Model, we are able to explain 15% of variation in the data (R-squared=0.1458). The t-statistic tells how much each of the hypotheses contributes to the total result. The results of the test are in agreement with the hypotheses ranking obtained from applying HypTrails. Linguistic proximity of languages, the number of bilinguals, shared religion and demographic attraction (in this order of significance) are the factors significantly contributing to cultural similarity.

HypTrails-computed Bayesian evidence for hypotheses plausibility on shared editing interest Wikipedia data. Higher values of the Bayesian evidence denote that a hypothesis fits the data well. The bottom black line represents the hypothesis of random shared interests and the top grey line is the fit of data on itself -- together forming an upper and lower limit for comparing hypothesis. The ranking of hypotheses should be compared for the same k.

<< BACK TO THE MAIN PAGE

MRQAP decomposition of pairwise correspondence between concept co-occurrence andcultural factors. The results of MRQAP agree with the ranking of hypotheses by the HypTrails algorithm. All statistics except those labelled with * are significant at the 0.05 level.