Demo corpus. Scores are computed on a select set of biomedical paper/datasets and may be inaccurate for papers outside this corpus — DataRank relies on network effects that improve with scale. We aim to expand this into a fully open resource pending additional funding.

Comparison of automated and human assignment of MeSH terms on publicly-available molecular datasets

Journal of Biomedical Informatics(2011)10.1016/j.jbi.2011.03.007Source: DataRank Database

Comparison of automated and human assignment of MeSH terms on publicly-available molecular datasets is a dataset published in Journal of Biomedical Informatics (2011). On theSindex it has a DataRank of 0.768, placing it in the top 44.3% of the data-sharing corpus. It has been cited 14 times, with 13 citing works in its 1-hop citation network. Its calibrated FAIR score is 41/100.

Top 44%percentile

0.768DataRank

0.768Top 44%

Dataset Open Access14 citations · base score 2.7

Cite:

datarank_citation_only_1hop_v6· scope data_onlyMethodology

Abstract

Publicly available molecular datasets can be used for independent verification or investigative repurposing, but depends on the presence, consistency and quality of descriptive annotations. Annotation and indexing of molecular datasets using well-defined controlled vocabularies or ontologies enables accurate and systematic data discovery, yet the majority of molecular datasets available through public data repositories lack such annotations. A number of automated annotation methods have been developed; however few systematic evaluations of the quality of annotations supplied by application of these methods have been performed using annotations from standing public data repositories. Here, we compared manually-assigned Medical Subject Heading (MeSH) annotations associated with experiments by data submitters in the PRoteomics IDEntification (PRIDE) proteomics data repository to automated MeSH annotations derived through the National Center for Biomedical Ontology Annotator and National Library of Medicine MetaMap programs. These programs were applied to free-text annotations for experiments in PRIDE. As many submitted datasets were referenced in publications, we used the manually curated MeSH annotations of those linked publications in MEDLINE as "gold standard". Annotator and MetaMap exhibited recall performance 3-fold greater than that of the manual annotations. We connected PRIDE experiments in a network topology according to shared MeSH annotations and found 373 distinct clusters, many of which were found to be biologically coherent by network analysis. The results of this study suggest that both Annotator and MetaMap are capable of annotating public molecular datasets with a quality comparable, and often exceeding, that of the actual data submitters, highlighting a continuous need to improve and apply automated methods to molecular datasets in public data repositories to maximize their value and utility.

›Data sources & pipeline

Pipeline:MetadataData-paper checkEnrichmentCitation networkScoring

Enrichment:Pending

FAIR Checklist

Context only (not used in score)

Findable (1/2)

Has DOI

Accessible (1/2)

Open Access

Interoperable (0/2)

Reusable (1/3)

Dataset classification

FAIR checklist signals are shown for context only and do not affect DataRank scoring.

41FAIR score

F Findable

A Accessible

I Interoperable

R Reusable

Top 79% by FAIRLLM-assessed⚠ abstract only

Estimated from the abstract only. The agent couldn't read this paper's full text, so body-dependent criteria (data-availability statement, formats, license) are inferred. For a confident score, upload the PDF or supply full text →

Calibrated FAIR score — a parallel quality metric, independent of the DataRank citation score. See the full evaluation →

DataRank Breakdown

Base Score 53%Citation Network 47%

Base Score Contribution

0.406

From this paper's citation signal

Citation Network Contribution

0.362

From 8 citing papers with measurable signal

Learn more about DataRank methodology →

Top 5 citers driving the network score

Ranked by citation count — the same ordering the engine uses when summing log1p(C_q) over citers.

Cell type–specific gene expression differences in complex tissues
Nature Methods2010546 citationsDataRank 0.946
Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks
BMC Bioinformatics2005437 citationsDataRank 0.912
Creation and implications of a phenome-genome network
Nature Biotechnology2006207 citationsDataRank 12.7Top 16%
Ontology-driven indexing of public datasets for translational bioinformatics
BMC Bioinformatics2009121 citationsDataRank 8.4Top 24%
Disease signatures are robust across tissues and experiments
Molecular Systems Biology2009112 citationsDataRank 4.5

Why this DataRank?

DataRank blends this paper's own citation count with the influence of the papers that cite it. Here, roughly 53% comes from its base citations and 47% from the citation network (8 citing papers contributed measurable signal).

Base score B(p): log1p(citation_count) — grows sub-linearly, so a paper with 1,000 citations is not 10× a paper with 100.
Network N(p): Σ over citers of log1p(C_q) ÷ max(outdegree_q, 1). Being cited by a highly-cited paper with few references counts most.
Damping factor d = 0.85: DataRank = (1−d)·B(p) + d·N(p) — the two cards above are each already multiplied by their share.
Self-citations excluded: Citers sharing any OpenAlex author ID with this paper are filtered out before the network sum.

Citers are pulled from OpenAlex sorted by cited_by_count:descand capped per paper, so when the cap binds we keep the highest-signal references and the score is reproducible across reruns.

Read the full methodology →

Click a node to highlight its connections. Use scroll to zoom. Drag to pan.

Node colors:CenterData PaperData + Open AccessNon-dataSelected & links| Node size = percentile rank