🏆 Finalist — NIH Data Sharing Index (“S-Index”) Challenge
Demo corpus. Scores are computed on a select set of biomedical paper/datasets and may be inaccurate for papers outside this corpus — DataRank relies on network effects that improve with scale. We aim to expand this into a fully open resource pending additional funding.

The Sequence of the Human Genome

Science(2001)10.1126/science.1058040Source: DataRank Database

The Sequence of the Human Genome is a dataset published in Science (2001). On theSindex it has a DataRank of 18.7, placing it in the top 7.1% of the data-sharing corpus. It has been cited 13,648 times, with 175 citing works in its 1-hop citation network. Its calibrated FAIR score is 59/100.

Top 7%percentile
18.7DataRank
18.7Top 7%
Dataset13648 citations · base score 9.5
Cite:
datarank_citation_only_1hop_v6· scope data_onlyMethodology

Abstract

A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

Data sources & pipeline
Pipeline:MetadataData-paper checkEnrichmentCitation networkScoring
Enrichment:Pending

FAIR Checklist

Context only (not used in score)
Findable (2/2)
  • Has DOI
  • Indexed in repositories
Accessible (0/2)
    Interoperable (2/2)
    • DataCite relations
    • Linked datasets
    Reusable (1/3)
    • Dataset classification

    FAIR checklist signals are shown for context only and do not affect DataRank scoring.

    59FAIR score
    F Findable
    100
    A Accessible
    70
    I Interoperable
    50
    R Reusable
    17
    Top 10% by FAIRdeterministic⚠ abstract only
    Estimated from the abstract only. The agent couldn't read this paper's full text, so body-dependent criteria (data-availability statement, formats, license) are inferred. For a confident score, upload the PDF or supply full text →

    Calibrated FAIR score — a parallel quality metric, independent of the DataRank citation score. See the full evaluation →

    DataRank Breakdown

    Base Score 8%Citation Network 92%

    Base Score Contribution

    1.4

    From this paper's citation signal

    Citation Network Contribution

    17.3

    From 175 citing papers with measurable signal

    Learn more about DataRank methodology →

    Top 5 citers driving the network score

    Ranked by citation count — the same ordering the engine uses when summing log1p(Cq) over citers.

    1. Basic local alignment search tool
      Journal of Molecular Biology199093,553 citationsDataRank 1.7
    2. DNA sequencing with chain-terminating inhibitors
      Proceedings of the National Academy of Sciences197769,231 citationsDataRank 1.7
    3. Emergence of Scaling in Random Networks
      Science199936,177 citationsDataRank 1.6
    4. Initial sequencing and comparative analysis of the mouse genome
      Nature20027,236 citationsDataRank 16.2Top 10%
    5. A haplotype map of the human genome
      Nature20055,917 citationsDataRank 29.2Top 1%
    Why this DataRank?

    DataRank blends this paper's own citation count with the influence of the papers that cite it. Here, roughly 8% comes from its base citations and 92% from the citation network (175 citing papers contributed measurable signal).

    Base score B(p)
    log1p(citation_count) — grows sub-linearly, so a paper with 1,000 citations is not 10× a paper with 100.
    Network N(p)
    Σ over citers of log1p(Cq) ÷ max(outdegreeq, 1). Being cited by a highly-cited paper with few references counts most.
    Damping factor d = 0.85
    DataRank = (1−d)·B(p) + d·N(p) — the two cards above are each already multiplied by their share.
    Self-citations excluded
    Citers sharing any OpenAlex author ID with this paper are filtered out before the network sum.

    Citers are pulled from OpenAlex sorted by cited_by_count:descand capped per paper, so when the cap binds we keep the highest-signal references and the score is reproducible across reruns.

    Read the full methodology →

    Click a node to highlight its connections. Use scroll to zoom. Drag to pan.

    Node colors:CenterData PaperData + Open AccessNon-dataSelected & links| Node size = percentile rank

    Authors (345)

    Mark D. AdamsORCID,Eugene W. MyersORCID,Peter W. Li,Richard J. Mural,Granger G. Sutton

    Related Papers (10)

    Science(2022)
    co-citedsame journal
    10.1126/science.abj6987
    The Protein Data Bank
    Top 1%
    32.3DataRank
    Nucleic Acids Research(2000)
    co-cited
    10.1093/nar/28.1.235
    Science(2009)
    co-citedsame journal
    10.1126/science.1162986
    Basic local alignment search tool
    N/A
    1.7DataRank · unranked
    Journal of Molecular Biology(1990)
    co-cited
    10.1016/s0022-2836(05)80360-2