Pilot corpus: NIH-funded biomedical datasets. Any DOI can be scored on demand — network effects sharpen as coverage grows.

UniProt: the universal protein knowledgebase in 2021

Nucleic Acids Research(2020)10.1093/nar/gkaa1100Source: DataRank Database

UniProt: the universal protein knowledgebase in 2021 is a dataset published in Nucleic Acids Research (2020). On theSindex it has a DataRank of 10.8, placing it in the top 1% of the data-sharing corpus. It has been cited 7,078 times, with 100 citing works in its 1-hop citation network. Its calibrated FAIR score is 88/100.

Top 1%percentile

10.8DataRank

Ranks in the top 1% for downstream scientific impact

Data paper Open Access7,078 citations · 100 citing works in its network

Download PDF

Cite:

Linked data & code

gen chebi uniprot

DataRank reads this dataset's downstream impact straight off the citation graph — no black box, no proprietary weighting. How is this computed?

›Methodology & internals

datarank_citation_only_1hop_v6· scope data_onlyMethodology

Pipeline:MetadataData-paper checkEnrichmentCitation networkScoring

Enrichment:Funding (20 grants)Impact (FWCI 381.28)OA: goldTopics (60)IDs (PubMed, PMC, OpenAlex)Annotations (3)SDGs (2)

FAIR Checklist

Context only (not used in score)

Findable (1/2)

Has DOI

Accessible (1/2)

Open Access

Interoperable (0/2)

Reusable (1/3)

Dataset classification

FAIR checklist signals are shown for context only and do not affect DataRank scoring.

88FAIR score

Full FAIR picture · advisory

F Findable

A Accessible

I Interoperable

100

R Reusable

Top 1% by FAIRLLM-assessed✓ full text read

How the score was derived

The headline score is computed from the scored criteria — the fact-shaped checks (a repository, an accession, a licence) that two independent models agree on. The advisory criteria below are real FAIR guidance but rest on judgment calls that models read differently, so they inform without moving the number.

F Findableall criteria61

llmPersistent identifier for the datapartial50

“UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/ .”

The only identifier given for the data is a web URL (https://www.uniprot.org/), which is not a persistent identifier scheme. [majority verdict 'partial' (4/5 passes agreed)]

RDA-F1-01D — FAIR Data Maturity Model: 'Data is identified by a persistent identifier' (priorit · RDA-F1-02D — FAIR Data Maturity Model: 'Data is identified by a globally unique identifier' · FsF-F1-02D — F-UJI/FAIRsFAIR: 'Data is assigned a persistent identifier'

llmNamed repositoryyes100

“UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/ .”

UniProt is named as the holder, and it is a known data repository listed in re3data.org. [majority verdict 'yes' (4/5 passes agreed)]

RDA-F4-01M — FAIR Data Maturity Model: metadata is offered so it can be harvested and indexed ( · NIH DMS Policy Element 4 (NOT-OD-21-014) — name the repository where data will be archived · NSTC Desirable Characteristics of Data Repositories (2022) — 'Long-Term Sustainability', 'Reten

llmDataset formally citedpartial50

“UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/ .”

The dataset's identifier (the URL) appears only in body text, not in the reference list. [majority verdict 'partial' (4/5 passes agreed)]

FORCE11 Joint Declaration of Data Citation Principles (2014) — data should be cited as a first- · RDA-F3-01M — metadata clearly and explicitly includes the identifier of the data it describes · FsF-F3-01M — F-UJI: 'Metadata includes the identifier of the data it describes'

Advisory · not in the published score

llmData-availability statementyes100

“Due to the ever-increasing number of sequence records UniProt is processing with every release cycle, as of release 2020_01 (26 February 2020), UniProt releases are now published every eight weeks. This gives our production team the time required to complete data import, proteome redundancy removal, data checking, integration of external data and automatic annotation of unreviewed records prior to starting the release process. In addition to providing customizable views and downloads in a range of formats via the website, and file sets at the FTP site ( www.uniprot.org/downloads ), UniProt supplies users with a number of different options for computational access to the data ( www.uniprot.org/help/programmatic_access ).”

The Data Availability statement points to the UniProt website and FTP site, which is a repository record. [majority verdict 'yes' (4/5 passes agreed)]

Colavizza, Hrynaszkiewicz, Staden, Whitaker & McGillivray (2020), 'The citation advantage of li · Springer Nature research data policy — Data Availability Statements: standard statement templat · RDA-F3-01M — metadata clearly and explicitly includes the identifier of the data it describes

llmDescription of the dataset as an objectno0

“UniProt release 2020_04 contains over 189 million sequence records (Figure 1 ), with >292 000 proteomes”— not found in the paper; verdict downgraded

The dataset's extent is described in running prose, not in an itemised inventory or table. [downgraded to 'no' — no verifiable quote from the paper] [majority verdict 'no' (3/5 passes agreed)]

RDA-F2-01M — 'Rich metadata is provided to allow discovery' (priority Essential) · FsF-F2-01M — F-UJI: 'Metadata includes descriptive core elements to support data findability' · FsF-R1-01MD — F-UJI: 'Metadata specifies the content of the data'

A Accessibleall criteria75

llmAccess route free of preconditionsyes100

“UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/ .”

The data are stated to be freely available under an open license with no precondition. [majority verdict 'yes' (4/5 passes agreed)]

RDA-A1.1-01D — 'Data is accessible through a free access protocol' · FsF-A1-01M — F-UJI: 'Metadata contains access level and access conditions of the data' · NSTC Desirable Characteristics of Data Repositories (2022) — 'Free and Easy Access'

Advisory · not in the published score

llmAccess level labelledyes100

“UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/ .”

The paper explicitly labels the data as 'freely accessible' and under a CC-BY (4.0) license, which are open access labels.

FsF-A1-01M — F-UJI: 'Metadata contains access level and access conditions of the data' · RDA-A1-01M — metadata contains information to enable the user to get access to the data · COAR Controlled Vocabularies — Access Rights v1.0 (open / embargoed / restricted / metadata-onl

llmGatekeeper for sensitive datano0

The data are protein sequences and not human-subject or sensitive data; no gatekeeper is named because none is needed.

NIH Genomic Data Sharing Policy (NOT-OD-14-124) — controlled-access via a Data Access Committee · RDA-A1.2-01D — 'Data is accessible through an access protocol that supports authentication and · NIH DMS Policy Element 5 (NOT-OD-21-014) — Access, Distribution, or Reuse Considerations (conse

llmAvailability timing & retentionno0

No sentence in the paper states how long the data will remain available or gives a retention period. [majority verdict 'no' (3/5 passes agreed)]

NIH DMS Plan Element 4 (NOT-OD-21-014) — Data Preservation, Access, and Associated Timelines · NSTC Desirable Characteristics (2022), Organizational Infrastructure: 'Retention Policy' · RDA-A2-01M — 'Metadata is guaranteed to remain available after data is no longer available'

I Interoperableall criteria100

llmOpen file formatyes100

“making data available in a number of community recognised formats, such as text, XML and RDF”

The paper lists open standard formats including XML, RDF, and FASTA for the data.

FsF-R1.3-02D — F-UJI: 'Data is available in a file format recommended by the target research co · RDA-R1.3-02D — data is expressed in a machine-understandable community standard · RDA-I1-01D — data uses a knowledge representation expressed in a standardised format

Advisory · not in the published score

llmCommunity standard / vocabularyyes100

“such as the Gene Ontology (GO)”

The paper names the Gene Ontology, a community data standard, as applied to the data. [majority verdict 'yes' (4/5 passes agreed)]

RDA-R1.3-01M — 'Metadata complies with a community standard' (priority Essential) · RDA-R1.3-01D — 'Data complies with a community standard' · RDA-I2-01M — '(Meta)data use vocabularies that follow FAIR principles'

llmIdentifiers for the resources the data depend onyes100

“accession number MN908947”

The paper includes an identifier (INSDC accession MN908947) for a resource other than its own dataset.

RDA-I3-01M — '(meta)data include references to other (meta)data' · RDA-I3-03M — 'metadata includes qualified references to other metadata' · FsF-I3-01M — F-UJI: 'Metadata includes links between the data and its related entities'

R Reusableall criteria83

llmReuse licenceyes100

“UniProt resources are available under a CC-BY (4.0) license”

The data is licensed under CC-BY (4.0), an open standard license.

RDA-R1.1-01M — 'Metadata includes information about the licence under which the data can be reu · RDA-R1.1-02M — 'Metadata refers to a standard reuse licence' · RDA-R1.1-03M — 'Metadata refers to a machine-understandable reuse licence'

llmSnapshot identifiedyes100

“UniProt release 2020_04 contains over 189 million sequence records”

The paper identifies the specific data snapshot by release version (2020_04).

DataCite Metadata Schema 4.6 — the 'Version' property · RDA-R1.2-01M — provenance information (which version was used is provenance) · NSTC Desirable Characteristics of Data Repositories (2022) — 'Provenance', 'Retention Policy'

llmAnalysis code availableyes100

“UniFIRE is an open-source Java-based framework and tool developed to apply the UniProt annotation rules on given protein sequences and provided by UniProt to share our knowledge in computational annotation and our rule-based systems ( https://gitlab.ebi.ac.uk/uniprot-public/unifire ).”

The paper provides a machine-resolvable URL (GitLab) for the study's own code. [majority verdict 'yes' (4/5 passes agreed)]

NIH DMS Policy Element 2 (NOT-OD-21-014) — 'Related Tools, Software and/or Code' · FAIR4RS Principles v1.0 (Chue Hong et al., 2022; RDA/FORCE11/ReSA) — FAIR Principles for Resear · FORCE11 Software Citation Principles (Smith, Katz & Niemeyer, 2016, PeerJ CS 2:e86)

llmFunder and award numberyes100

“National Institutes of Health [U24HG007822]”

The paper lists specific award/grant numbers (e.g., U24HG007822) for the funding.

DataCite Metadata Schema 4.6 — 'FundingReference' property (funderName, funderIdentifier, award · Crossref Funder Registry — canonical funder identifiers for funding metadata · RDA-F2-01M — rich metadata provided to allow discovery (funding is part of the descriptive reco

Advisory · not in the published score

llmProvenance of the datayes100

“We have adopted the MMseqs2 algorithm to improve the speed of UniRef production”

The paper names specific tools and algorithms (e.g., MMseqs2, BUSCO v3, ARBA) used to produce the data.

RDA-R1.2-01M — 'Metadata includes provenance information according to community- specific standa · FsF-R1.2-01M — F-UJI: 'Metadata includes provenance information about data creation or generati · W3C PROV-O (W3C Recommendation, 2013) — the entity/activity/agent model of provenance

llmDocumentation / codebookno0

No documentation object (README, codebook, data dictionary) is named as travelling with the data, and no variable-definition table is provided. [majority verdict 'no' (4/5 passes agreed)]

RDA-R1-01M — '(Meta)data are richly described with a plurality of accurate and relevant attribu · FsF-R1-01MD — F-UJI: 'Metadata specifies the content of the data' · NIH DMS Policy Element 3 (NOT-OD-21-014) — Standards (documentation and metadata to accompany t

Calibrated FAIR score — a parallel quality metric, independent of the DataRank citation score. See the full evaluation →

DataRank Breakdown

Base Score 12%Citation Network 88%

Base Score Contribution

1.3

From this paper's citation signal

Citation Network Contribution

9.5

From 100 citing papers with measurable signal

Learn more about DataRank methodology →

Top 5 citers driving the network score

Ranked by each citer's contribution to N(p) — log1p(C_q) divided by its reference count — out of 100 citers.

Protein structure predictions to atomic accuracy with AlphaFold
Nature Methods2022293 citations14 referencesContributes 0.406
AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models
Nucleic Acids Research20218,367 citations23 referencesContributes 0.393
IBS 2.0: an upgraded illustrator for the visualization of biological sequences
Nucleic Acids Research2022225 citations16 referencesContributes 0.339
eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale
Molecular Biology and Evolution20214,762 citations25 referencesContributes 0.339
AlphaFold heralds a data-driven revolution in biology and medicine
Nature Medicine2021197 citations17 referencesContributes 0.311

Authors (99)

María Martin ORCID,Sandra Orchard ORCID,Michele Magrane ORCID,Rahat Agivetova,Shadab Ahmad ORCID

Live enrichment

PubMed:33237286 PMC:PMC7778908 OpenAlex:https://openalex.org/W3112376646

National Institute of General Medical Sciences

Grant: P20GM103446

NHGRI NIH HHS

Grant: U41 HG002273

Biotechnology and Biological Sciences Research Council

Grant: BB/M011674/1

Biotechnology and Biological Sciences Research Council

Grant: BB/T010541/1

18-BBSRC-NSF/BIO : CIBR:Implementing an explicit phylogenetic framework for large-scale protein sequence annotation

National Institute of General Medical Sciences

Grant: R01GM080646

National Institutes of Health

Grant: U24HG007822

Biotechnology and Biological Sciences Research Council

Grant: BB/S01781X/1

British Heart Foundation

Grant: RG/13/5/30112

National Institutes of Health

Grant: 5U24HG007822-12

UniProt: A Protein Sequence and Function Resource for Biomedical Science

National Institutes of Health

Grant: 5R01GM080646-12

PRO: A Protein Ontology in OBO Foundry for Scalable Integration of Biomedical Knowledge

National Institutes of Health

Grant: 4U41HG002273-16

Gene Ontology Consortium

National Institutes of Health

Grant: 5P20GM103446-15

Bioinformatics Core

Open Targets

National Cancer Institute

National Institute of Allergy and Infectious Diseases

European Molecular Biology Laboratory

National Eye Institute

National Heart, Lung, and Blood Institute

Swiss Federal Government

National Institute of Diabetes and Digestive and Kidney Diseases

FWCI

381.28

Citation Percentile

1.0%

Citation Trend

2018

2019

2020

2021

2022

2023

2024

2025

2026

goldLicense: cc-by

journalhttps://academic.oup.com/nar/article-pdf/49/D1/D480/35364103/gkaa1100.pdf

publisherhttps://academic.oup.com/nar/article-pdf/49/D1/D480/35364103/gkaa1100.pdf

journalhttps://doi.org/10.1093/nar/gkaa1100

repositoryhttps://pubmed.ncbi.nlm.nih.gov/33237286

repositoryhttps://archive-ouverte.unige.ch/unige:159643

Fields of Study

Genomics and Phylogenetic StudiesAdvanced Proteomics Techniques and ApplicationsMachine Learning in Bioinformatics0301 basic medicine03 medical and health sciences0303 health sciences

MeSH Terms

COVID-19SARS-CoV-2HumansUser-Computer InterfaceViral ProteinsComputational BiologyInternetProteomeDatabases, ProteinProteomicsKnowledge BasesPandemicsMolecular Sequence AnnotationData Curation

Keywords

UniProtBiologyHuman proteome projectProteomeEnsemblComputer scienceWorld Wide WebComputational biologyBioinformaticsGenomeGenomicsProteomicsGeneticsCOVID-19 / virologyComputational Biology / methodsMolecular Sequence Annotation / methodsKnowledge BasesProteomics / methodsSARS-CoV-2 / physiologyCOVID-19 / epidemiologyUser-Computer InterfaceViral ProteinsSARS-CoV-2 / geneticsData Curation / methodsViral Proteins / geneticsDatabase IssueHumansSARS-CoV-2 / metabolismDatabases, ProteinPandemicsData CurationProteome / metabolismProteome / geneticsInternet616.0757SARS-CoV-2COVID-19 / prevention & controlCOVID-19Molecular Sequence AnnotationViral Proteins / metabolism

Sustainable Development Goals

SDG 3: 3. Good healthSDG 0: Good health and well-being