Wikipedia’s top-cited scholarly articles — revealed
The most-cited journal articles on Wikipedia include papers on the names of lunar craters and the DNA sequences of human and mouse genes — and many of the most popular works are referenced more times in the online encyclopaedia than they are in the scientific literature.
“It is pretty incredible that almost all the highly cited articles are science articles,” says Matt Miller, a data scientist and librarian based in New York City. Miller analysed citation data released in March by the Wikimedia Foundation , the non-profit organization in San Francisco, California, that runs Wikipedia. The data set — which contains some 15.7 million records — shows how many times sources with formal identifiers such as ISBNs (international standard book numbers) and DOIs (digital object identifiers) are referenced across all of Wikipedia’s nearly 300 language editions. Wikimedia notes that most publications cited by identifiers on Wikipedia are books, but Miller looked specifically at the numbers for publications with DOIs — the most widely used identifier for journal articles — on the English-language version of Wikipedia. His data set contains 1.2 million citations that used DOIs, referencing more than 835,000 unique articles.
The most-referenced paper, with 4,702 citations across English Wikipedia, is a 2002 collection of more than 15,000 sequences of human and mouse genes (see ‘English Wikipedia’). The Wikipedia pages that reference the study are almost exclusively entries about single genes or proteins. “It’s a pleasant surprise,” says Robert Strausberg, a cancer researcher who led the project and is now deputy scientific director at the Ludwig Institute for Cancer Research in New York City.
English Wikipedia: top ten scholarly articles
The ten most-referenced publications with DOIs on English Wikipedia:
4,702 citations: Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences (2002)
3,387 citations: The status, quality, and expansion of the NIH full-length cDNA project: The Mammalian Gene Collection (MGC) (2004)
2,895 citations: Validation of the new Hipparcos reduction (2007)
2,212 citations: Complete sequencing and characterization of 21,243 full-length human cDNAs (2004)
1,452 citations: Report on lunar nomenclature by the Working Group of Commission 17 of the IAU (1971)
1,297 citations: Oligo-capping: a simple method to replace the cap structure of eukaryotic mRNAs with oligoribonucleotides (1994)
1,294 citations: Towards a proteome-scale map of the human protein–protein interaction network (2005)
1,251 citations: Construction and characterization of a full length-enriched and a 5'-end-enriched cDNA library (1997)
931 citations: Absolute magnitudes and slope parameters for 250,000 asteroids observed by Pan-STARRS PS1 — preliminary results (2015)
878 citations: NEOWISE studies of spectrophotometrically classified asteroids: preliminary results (2011)
An expanded version of the gene collection, published in 2004, is the second-most-cited article, with about 3,400 references (by contrast, it has 487 citations in the scientific literature, according to Google Scholar). Daniela Gerhard, a cancer geneticist at the US National Cancer Institute in Bethesda, Maryland, and a co-author of the paper, says that the publications are probably cited so frequently because they provide accessible information about the sequences of expressed genes.
In all, five articles in the top ten are about DNA catalogues, including one study that details a method of generating such collections. A 2005 map of nearly 3,000 human protein interactions also features on the list, at number seven. (Wikimedia’s original post notes: “Unsurprisingly, Wikipedians love reference works.”)
Astronomy articles make up the rest of the list, with four entries. The third-most-referenced paper, cited by nearly 3,000 English Wikipedia pages, is a 2007 study that helped researchers to interpret the results of Hipparcos, the first space mission to measure the positions, distances and brightness of stars.
Other space-science papers on the list cover the size and brightness of asteroids , and the names of lunar craters (in a 1971 publication that has just 16 citations in the scientific literature, according to Google Scholar). These papers are probably highly cited because they are reliable references for the many celestial bodies that have their own Wikipedia pages, says astronomer Floor van Leeuwen at the University of Cambridge, UK, who wrote the Hipparcos study.
Bots’ work
Wikipedia, which launched in 2001, receives about 16 billion page views per month and is currently the world’s fifth-most-visited website . Anybody can create articles or edit an existing one, but the site’s guidelines require that writers and editors must attribute quotes and information to published sources such as books or scholarly papers.
A separate analysis of the Wikimedia data dump by Ross Mounce, who directs open-access programmes at the London-based philanthropic foundation Arcadia Fund, reveals the ten most-cited DOI articles across all of the encylopaedia’s language editions (see ‘All Wikipedia language editions’). Six of the articles are the same, but the first entry is notably different. The top-referenced DOI article is a 2007 paper updating a century-old classification of the global climate, which has a whopping 2.8 million citations — but only 169 on English Wikipedia (the second-most-cited source across all editions has just over 21,000 references).
The climate study is so heavily cited because millions of its citations come from pages created by an automated computer program. The bot , developed by physicist Sverker Johansson at Dalarna University in Falun, Sweden, had produced nearly 3 million articles as of July 2014, according to Wikipedia. One-third of the articles are in Swedish and the rest are in Cebuano and Waray, two languages spoken in the Philippines. The bot has produced millions of articles about geographic locations such as towns and islands, and most of those articles include information about the local climate type, which reference the climate study, says Johansson. He adds that he has no precise figures for the bot-generated citations of the climate paper, “but 2.8 million is in the right ballpark”.
All Wikipedia language editions: top ten scholarly articles
The ten most-referenced publications with DOIs across all of Wikipedia’s language editions:
2,830,341 citations: Updated world map of the Köppen–Geiger climate classification (2007)
21,350 citations: Prediction of hydrophobic (lipophilic) properties of small organic molecules using fragmental methods: an analysis of ALOGP and CLOGP methods (1998)
20,247 citations: The status, quality, and expansion of the NIH full-length cDNA project: The Mammalian Gene Collection (MGC) (2004)
5,937 citations: Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences (2002)
5,854 citations: The Asiago supernova catalogue — 10 years after (1999)
4,592 citations: Validation of the new Hipparcos reduction (2007)
4,450 citations: The primordial excitation and clearing of the asteroid belt (2001)
3,062 citations: Report on lunar nomenclature by the Working Group of Commission 17 of the IAU (1971)
2,587 citations: Complete sequencing and characterization of 21,243 full-length human cDNAs (2004)
2,525 citations: Classifying solid planetary bodies (2007)
Mounce notes that other articles might be heavily cited on Wikipedia but not formally referenced by their DOIs — and instead referenced by other means, such as their PubMed ID numbers.
Citations are important if people are to trust information, says John Chodacki, director of the University of California Curation Center, who is based in Berkeley. “That’s true for journal articles and also for Wikipedia pages,” he says. But analysing and comparing citation data across scholarly papers has historically been possible using only paywalled services . “One of the most interesting things is that this information is available at all.”