Methodology
DataRank v6.0 Citation-Only 1-Hop
DataRank uses a citation-only 1-hop approximation. FAIR/repository and DataCite reuse metadata remain visible for context and auditing, but do not enter the score.
What's a data paper?
Most scientific papers describe a new finding. A data paper is different: its main contribution is the data itself β a database, cohort, atlas, benchmark, or genomic resource that other scientists reuse in their own work. GTEx, TCGA, UK Biobank, the Framingham Heart Study, and the Protein Data Bank are familiar examples.
This distinction matters because DataRank only ranks data papers. A method paper or a review can rack up citations for reasons that have nothing to do with sharing data β it's cited because someone ran the algorithm or summarised the field. To measure data sharing we first have to isolate the papers whose citations actually signal data reuse.
How we identify them β DrPaper
DrPaper is the AI classifier we built for the job. Under the hood it's a fine-tuned SciBERT β a 110M-parameter language model pre-trained on scientific text β that we re-trained on roughly 4,000 hand-labeled papers. On validation it reaches an F1 of 0.9153.
We then stress-tested it on 38 known edge cases spanning 15 categories β consortium genomics, brain atlases, cohort profiles, database updates, benchmarks, clinical databases, genome resources β and DrPaper got all 38 right. That includes landmark resources like GTEx, TCGA, the Framingham Heart Study, UK Biobank, and the RCSB Protein Data Bank, all correctly flagged as data papers because their citations come primarily from data reuse. It also correctly rejected influential methods papers β AlphaFold, DESeq2, and BWA β which are cited heavily for their algorithms, not their data.
The model, the labeled training set, and the fine-tuning notebook are all open. See Resources for direct links to the HuggingFace model, the Kaggle notebook, and the labeled dataset.
Where this shows up in the product
- Β·Every paper detail page shows a DrPaper confidence score and a data-paper / not-a-data-paper label.
- Β·The leaderboards default to data papers only. You can switch to non-data papers from the βPaper typeβ filter.
- Β·Author and institution scores sum DataRank over their data papers β so the ranking rewards data sharing rather than total output.
DOIphin β federated metadata aggregator
Before any paper can be ranked, its metadata has to be assembled β and no single source is complete. DOIphin is the engine we built for that job. Given a DOI, it queries 14+ scholarly APIs in parallel and cross-walks the results into one unified record: bibliographic metadata, open-access status, author and institution identifiers, funder/award links, and any datasets the paper deposited or reused.
Crucially, DOIphin also resolves the citation and link graph β the edges between a data paper and the works that cite it β which is the substrate DataRank scores. FAIR and DataCite reuse signals are aggregated here too, shown alongside each paper for context but deliberately kept out of the score (their coverage is too uneven across repositories to rank on).
Sources cross-walked
DOIphin is open source. See Resources for the repository and API.
v6.0 β Citation-Only 1-Hop DataRank
CurrentMay 2026
Core equations
B(p) is the paper's own citation count under a log scale. N(p) sums the same log-scaled count over the papers that cite it (citers), each weighted by 1 / their reference count so a citer that cites thousands of papers contributes less per edge than one that cites a handful. Self-citations are removed by matching OpenAlex author IDs. FAIR and DataCite signals are visible alongside the score for context but do not enter it β their coverage is too uneven across repositories to use as a ranking signal.
Scope & defaults
- Damping d = 0.85 (network term weight; the same value used by Google's original PageRank).
- Up to 200 citers fetched per paper, sorted by their own citation count so the most influential are always included.
- Percentile ranking covers data papers only (those identified by DrPaper). Non-data papers receive a DataRank value but no percentile.
Strengths
- +Simple and auditable: score depends only on citation counts and graph structure
- +No dependence on DataCite reuse coverage, which is patchy across repositories
- +Stable interpretation across sources: no metadata-weight tuning in scoring
- +1-hop model is fast enough for live DOI streaming
Limitations
- -Depends on citation coverage and latency in external indexes
- -1-hop approximation omits multi-hop citation propagation
- -Scores remain corpus-relative within the data-paper ranking scope
Percentile Ranking
Papers are sorted by DataRank and mapped to a percentile in [0,Β 100] within the data-paper corpus. Tied scores share the same percentile. The 99th percentile is the top 1%; the lowest-ranked paper sits at 0. Percentiles are refreshed on each full corpus recompute.
Where N is the number of data papers in the corpus. Counting strictly-lower scores in the numerator (rather than less-or-equal) means tied papers share both a rank and a percentile.
Researcher Score
Each researcher's score is the sum of DataRank scores across their data papers indexed in the corpus. This reflects the cumulative data-sharing impact of an author's contributions to the data-paper corpus.
Where p1, p2, β¦, pk are the author's data papers (classified as dataset papers) indexed in the corpus.
Scope
- -Author scores reflect only papers indexed in this database, not the author's full publication record
- -Co-authors on the same paper currently receive the same paper-level DataRank contribution
- -Click on any author to see which papers contributed to their score
Computation Audit
Every recompute writes a snapshot row recording the algorithm version (current: datarank_citation_only_1hop_v6), the damping factor used, the corpus size, and the score distribution. The timestamp of the most recent recompute is published in /api/v1/stats as last_computation; full snapshot history lives in the database for audit and is available on request.
Technical reference (for implementers and reviewers)βΎ
Tie-aware rank & percentile
Both batch and live paths use the same helper (compute_rank_percentile_from_counts): rank is 1 + |{q : DataRank(q) > DataRank(p)}|, percentile is 100 Β· |strictly lower| / (N β 1). Single-item corpora map to percentile 100.0. Tied scores share both rank and percentile.
Self-citation filter
Citers are deduplicated against the seed by OpenAlex author-ID set overlap (app/engine/openalex_graph.py::filter_self_citations). A citer is dropped if it shares any author with the seed.
Offline fallback
If fetch_citer_neighbourhood returns nothing (network failure, OpenAlex 5xx exhaustion, or a paper outside the data-only batch fetch scope), the engine sets N(p) = 0 and DataRank(p) = (1 β d) Β· B(p). This is the contract every code path preserves.
Configuration knobs
DATARANK_DAMPING(default0.85) β damping factor d. Must be in (0, 1).DATARANK_MAX_CITERS(default200) β flat citer cap per paper.DATARANK_DIVERSITY_GAMMA(default0, off) β opt-in citer-field diversity multiplier. When > 0, N(p) is scaled by1 + Ξ³ Β· H_normwhere H_norm is the normalised Shannon entropy of citers' OpenAlexprimary_topic.fielddistribution.DATARANK_LANDMARK_ELBOW(default0, off) β opt-in additive log term past a citation threshold for landmark resources.DATARANK_ADAPTIVE_CITER_CAP_MAX(default0, off) β opt-in per-paper cap scaling with seedcited_by_count.
In v5 the bounded reuse multiplier R(p) was active by default (DATARANK_REUSE_LAMBDA=0.15,DATARANK_REUSE_CAP=25). Removed in v6 because DataCite reuse coverage is too uneven across repositories. The env vars are no longer read.
Snapshot schema
Each recompute appends one row to datarank_snapshots with:algorithm,damping_factor,paper_count,mode (online|offline),mean_datarank,median_datarank,max_datarank, and an extra_metadata JSON column carrying every hyperparameter, the fetch scope (data_only_fetch,papers_fetched), and the algorithm-family version string.
Where to read the code
backend/app/engine/datarank.pyβ scoring helpers (compute_sscore,compute_citer_quality_sum,compute_citer_diversity_factor) and theALGORITHM_IDconstant.backend/app/engine/endowment.pyβ base-score curve (log1p+ optional landmark elbow).backend/app/engine/openalex_graph.pyβ citer fetch, deterministic sort bycited_by_count:desc, self-citation filter.backend/ingestion/compute_datarank_db.pyβ canonical batch recompute path (data-only fetch by default).backend/app/services/corpus_rank_service.pyβ cached tie-aware rank/percentile helper used by both batch and live paths.
See it in action
Search any DOI and get a DataRank score, corpus percentile, and base-vs-network breakdown in seconds.