The initial dataset included 1187 papers indexed between 1st January 2008 and 31st December 2011 in the PubMed database (http://www.ncbi.nlm.nih.gov/pubmed
), which were retrieved using the key words “mtDNA human populations” and “Y chromosome human populations” (see ). After removing irrelevant studies (e.g. studies not pertinent to human populations, reviews or meta-analyses), a total of 253 mitochondrial and 290 Y-chromosomal datasets was extracted from 508 papers that had been published in 101 different Journals (see Table S1
for a brief description of datasets under scrutiny). The raw data file is available as (Table S2
Procedure used to analyze data sharing in papers regarding human genetic variation.
Datasets were analyzed using the procedure described in the flowchart reported in . Only datasets reporting full information, which can be analyzed without any form of limitation, were counted as sharing. On the other hand, datasets lacking haplotypic information or were incomplete (e.g. which make only a part of raw data produced fully available or present only data-derived statistics) were included in the “withholding” categories (see Table S3
for more details on datasets categorized as withheld). We split our classification into shared or withheld dataset according to the information contained in the corresponding papers, trying to recover missing data from databases or repositories only when they were explicitly indicated in the text. As a complement to the examination of published papers (from which we obtained the “immediate sharing” rate), we asked corresponding authors of withheld datasets (including both authors declaring data availability upon request and others not giving any indication) to send missing information. This was done through 3 sequential requests which were e-mailed over a three-week period (). In order to avoid any influence on author response, we made no mention of our study of data sharing in these messages (see Text S1
Procedure used to request data from corresponding authors of withheld datasets.
The shared and withheld datasets were analyzed in relation to: (i) the research field to which the study may be assigned; (ii) type of editorial policy of the publishing Journal; (iii) impact factor rank of the publishing Journal; (iv) number of citations received; (v) approximate quantity of resources used to generate the datasets. In all these analyses, we considered as shared both datasets shared immediately and after e-mails sent to authors of papers declaring data availability upon request.
Datasets were divided into evolutionary, medical and forensic fields. All these three research fields study genetic and genomic differences within and among populations, but can be distinguished according to their final objectives. Essentially, we assigned papers (and the corresponding datasets) concerned with the evolutionary history of human groups, mainly in terms of demography and adaptation, or with the evolutionary processing acting on the human genome to Human Evolutionary Genetics. Papers dealing with the identification of individuals or test of parentage relationships for legal purposes were allocated to the Forensic Genetics field. Finally, we allotted publications concerned with causes and inheritance of genetic disorders, as well as with their diagnosis and management to Medical Genetics. When the assignment of a given paper to more than one field of research seemed to be possible or research aims were ambiguous or not explicit, the ISI category of the scientific journal was used as an additional criterion.
The type of editorial policy was rated using the information provided in the guide to authors of each journal: weak editorial policies are those where the authors are invited to share data, whereas in strong policies, data sharing is indicated as mandatory (see ref. 9 for a more detailed analysis of journal policies). Impact factor ranks were based on impact factor values released by ISI Reuters in June 2009.
We also determined the number of citations received by shared and withheld datasets and estimated the proportion of resources used to generate the data analyzed here. Citations were counted using the Scopus database (http://www.scopus.com
). In order to make data comparable, each citation was weighted by considering the number of months passed since the publication of the cited paper. Very recent papers (published in the last six months of 2011) and self-citations from all authors were excluded from this analysis. To disentangle the effect of various variables which could potentially influence the number of citations, a multivariate analysis was carried out using a linear regression approach with the impact factor, time since publication and number of authors as covariates. Following Piwowar et al. 2007 
, the number of citations and impact factor were log transformed.
In order to obtain an approximate estimate of resources used for the production of shared and withheld datasets, we first defined the parameter “Cost unit” (CU) for each type of mitochondrial and Y-chromosomal polymorphism. Essentially, adopted CU values are based on the number of sequencer runs needed to generate the corresponding data (Table S4
). We considered two different CU values for complete mtDNA sequencing, mtDNA SNP and Y-chromosome SNP genotyping since their cost may vary substantially depending on the method used. The approximate cost for each dataset was obtained by multiplying the cost unit/s of the polymorphism/s analyzed by the number of individuals actually genotyped for each polymorphism. In these calculations, we assumed that data sharing does not imply any additional cost. In fact, depositing data in most of the online databases for mtDNA and Y-chromosome polymorphisms (e.g. GenBank, YHRD and EMPOP, see below) is completely free. Furthermore, nothing is usually paid to publishers for supplementary online material.
A file (in access format; File S1) which makes it possible to carry out a step by step reproduction of our protocol is provided as supplementary material.