Data

The datasets produced for the Linguistic Neighbourhoods project are available for free download. Please cite the paper when using these data:

Co-editing similarity weighted network adjacency matrix (.xlsx)
Cluster membership according to Infomap algorithm (.xlsx)

External datasets used for the analysis:

Territory-language information is based on the data from the World Bank, Ethnologue, FactBook, and other sources including per-country census data (Unicode's CLDR charts):

Countries where languages i,j are co-spoken ( .xlsx)
Countries where languages i,j are co-spoken by the majority of population (.xlsx)

Genetic proximity of languages was collected from English Wikipedia infoboxes of articles on each of 110 languages, such as ‘Hebrew language’. For example, Arabic language has the following language family tree profile: Afro-Asiatic; Semitic; Central Semitic; Arabic languages; Arabic.

Genetic profiles of languages (.xlsx)

Religion and population data are based on the Religious Diversity Index provided by the Pew Research Center:

Most common religion per country in 2010 ( .xlsx)
Country population in 2010 (.xlsx)

Distance between countries is computed as Eucledian distance in kilometers between county capitals, based on CIA Factbook:

Country distances (.xlsx)

<< BACK TO THE MAIN PAGE