🏆 Finalist — NIH Data Sharing Index (“S-Index”) Challenge
Demo corpus. Scores are computed on a select set of biomedical paper/datasets and may be inaccurate for papers outside this corpus — DataRank relies on network effects that improve with scale. We aim to expand this into a fully open resource pending additional funding.

A map of human genome variation from population-scale sequencing

Nature(2010)10.1038/nature09534Source: DataRank Database

A map of human genome variation from population-scale sequencing is a dataset published in Nature (2010). On theSindex it has a DataRank of 32.4, placing it in the top 0.2% of the data-sharing corpus. It has been cited 8,067 times, with 200 citing works in its 1-hop citation network. Its calibrated FAIR score is 72/100.

Top 1%percentile
32.4DataRank
32.4Top 1%
Dataset Open Access8067 citations · base score 9.0
Cite:
datarank_citation_only_1hop_v6· scope data_onlyMethodology

Abstract

The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

Data sources & pipeline
Pipeline:MetadataData-paper checkEnrichmentCitation networkScoring
Enrichment:Pending

FAIR Checklist

Context only (not used in score)
Findable (1/2)
  • Has DOI
Accessible (1/2)
  • Open Access
Interoperable (0/2)
    Reusable (1/3)
    • Dataset classification

    FAIR checklist signals are shown for context only and do not affect DataRank scoring.

    72FAIR score
    F Findable
    100
    A Accessible
    70
    I Interoperable
    50
    R Reusable
    67
    Top 2% by FAIRdeterministic✓ full text read

    Calibrated FAIR score — a parallel quality metric, independent of the DataRank citation score. See the full evaluation →

    DataRank Breakdown

    Base Score 4%Citation Network 96%

    Base Score Contribution

    1.3

    From this paper's citation signal

    Citation Network Contribution

    31.0

    From 200 citing papers with measurable signal

    Learn more about DataRank methodology →

    Top 5 citers driving the network score

    Ranked by citation count — the same ordering the engine uses when summing log1p(Cq) over citers.

    1. The Sequence Alignment/Map format and SAMtools
      Bioinformatics200966,179 citationsDataRank 1.7
    2. Fast gapped-read alignment with Bowtie 2
      Nature Methods201259,681 citationsDataRank 1.6
    3. A global reference for human genetic variation
      Nature201519,823 citationsDataRank 11.1Top 19%
    4. An integrated encyclopedia of DNA elements in the human genome
      Nature201219,311 citationsDataRank 23.8Top 3%
    5. The variant call format and VCFtools
      Bioinformatics201117,436 citationsDataRank 1.5
    Why this DataRank?

    DataRank blends this paper's own citation count with the influence of the papers that cite it. Here, roughly 4% comes from its base citations and 96% from the citation network (200 citing papers contributed measurable signal).

    Base score B(p)
    log1p(citation_count) — grows sub-linearly, so a paper with 1,000 citations is not 10× a paper with 100.
    Network N(p)
    Σ over citers of log1p(Cq) ÷ max(outdegreeq, 1). Being cited by a highly-cited paper with few references counts most.
    Damping factor d = 0.85
    DataRank = (1−d)·B(p) + d·N(p) — the two cards above are each already multiplied by their share.
    Self-citations excluded
    Citers sharing any OpenAlex author ID with this paper are filtered out before the network sum.

    Citers are pulled from OpenAlex sorted by cited_by_count:descand capped per paper, so when the cap binds we keep the highest-signal references and the score is reproducible across reruns.

    Read the full methodology →

    Click a node to highlight its connections. Use scroll to zoom. Drag to pan.

    Node colors:CenterData PaperData + Open AccessNon-dataSelected & links| Node size = percentile rank

    Authors (324)

    Min HuORCID, Yuan ChenORCID,Matthew N. BainbridgeORCID, Richard M. Durbin , Si Quang Le