πŸ† Finalist β€” NIH Data Sharing Index (β€œS-Index”) Challenge

Methodology

DataRank v6.0 Citation-Only 1-Hop

DataRank uses a citation-only 1-hop approximation. FAIR/repository and DataCite reuse metadata remain visible for context and auditing, but do not enter the score.

D

What's a data paper?

Most scientific papers describe a new finding. A data paper is different: its main contribution is the data itself β€” a database, cohort, atlas, benchmark, or genomic resource that other scientists reuse in their own work. GTEx, TCGA, UK Biobank, the Framingham Heart Study, and the Protein Data Bank are familiar examples.

This distinction matters because DataRank only ranks data papers. A method paper or a review can rack up citations for reasons that have nothing to do with sharing data β€” it's cited because someone ran the algorithm or summarised the field. To measure data sharing we first have to isolate the papers whose citations actually signal data reuse.

How we identify them β€” DrPaper

DrPaper is the AI classifier we built for the job. Under the hood it's a fine-tuned SciBERT β€” a 110M-parameter language model pre-trained on scientific text β€” that we re-trained on roughly 4,000 hand-labeled papers. On validation it reaches an F1 of 0.9153.

We then stress-tested it on 38 known edge cases spanning 15 categories β€” consortium genomics, brain atlases, cohort profiles, database updates, benchmarks, clinical databases, genome resources β€” and DrPaper got all 38 right. That includes landmark resources like GTEx, TCGA, the Framingham Heart Study, UK Biobank, and the RCSB Protein Data Bank, all correctly flagged as data papers because their citations come primarily from data reuse. It also correctly rejected influential methods papers β€” AlphaFold, DESeq2, and BWA β€” which are cited heavily for their algorithms, not their data.

The model, the labeled training set, and the fine-tuning notebook are all open. See Resources for direct links to the HuggingFace model, the Kaggle notebook, and the labeled dataset.

Where this shows up in the product

  • Β·Every paper detail page shows a DrPaper confidence score and a data-paper / not-a-data-paper label.
  • Β·The leaderboards default to data papers only. You can switch to non-data papers from the β€œPaper type” filter.
  • Β·Author and institution scores sum DataRank over their data papers β€” so the ranking rewards data sharing rather than total output.
🐬

DOIphin β€” federated metadata aggregator

Before any paper can be ranked, its metadata has to be assembled β€” and no single source is complete. DOIphin is the engine we built for that job. Given a DOI, it queries 14+ scholarly APIs in parallel and cross-walks the results into one unified record: bibliographic metadata, open-access status, author and institution identifiers, funder/award links, and any datasets the paper deposited or reused.

Crucially, DOIphin also resolves the citation and link graph β€” the edges between a data paper and the works that cite it β€” which is the substrate DataRank scores. FAIR and DataCite reuse signals are aggregated here too, shown alongside each paper for context but deliberately kept out of the score (their coverage is too uneven across repositories to rank on).

Sources cross-walked

CrossRefOpenAlexDataCiteZenodoDryadNIH RePORTERPubMedRORORCID+ more

DOIphin is open source. See Resources for the repository and API.

v6.0 β€” Citation-Only 1-Hop DataRank

Current

May 2026

Core equations

B(p) is the paper's own citation count under a log scale. N(p) sums the same log-scaled count over the papers that cite it (citers), each weighted by 1 / their reference count so a citer that cites thousands of papers contributes less per edge than one that cites a handful. Self-citations are removed by matching OpenAlex author IDs. FAIR and DataCite signals are visible alongside the score for context but do not enter it β€” their coverage is too uneven across repositories to use as a ranking signal.

Scope & defaults

  • Damping d = 0.85 (network term weight; the same value used by Google's original PageRank).
  • Up to 200 citers fetched per paper, sorted by their own citation count so the most influential are always included.
  • Percentile ranking covers data papers only (those identified by DrPaper). Non-data papers receive a DataRank value but no percentile.

Strengths

  • +Simple and auditable: score depends only on citation counts and graph structure
  • +No dependence on DataCite reuse coverage, which is patchy across repositories
  • +Stable interpretation across sources: no metadata-weight tuning in scoring
  • +1-hop model is fast enough for live DOI streaming

Limitations

  • -Depends on citation coverage and latency in external indexes
  • -1-hop approximation omits multi-hop citation propagation
  • -Scores remain corpus-relative within the data-paper ranking scope
P

Percentile Ranking

Papers are sorted by DataRank and mapped to a percentile in [0,Β 100] within the data-paper corpus. Tied scores share the same percentile. The 99th percentile is the top 1%; the lowest-ranked paper sits at 0. Percentiles are refreshed on each full corpus recompute.

Where N is the number of data papers in the corpus. Counting strictly-lower scores in the numerator (rather than less-or-equal) means tied papers share both a rank and a percentile.

R

Researcher Score

Each researcher's score is the sum of DataRank scores across their data papers indexed in the corpus. This reflects the cumulative data-sharing impact of an author's contributions to the data-paper corpus.

Where p1, p2, …, pk are the author's data papers (classified as dataset papers) indexed in the corpus.

Scope

  • -Author scores reflect only papers indexed in this database, not the author's full publication record
  • -Co-authors on the same paper currently receive the same paper-level DataRank contribution
  • -Click on any author to see which papers contributed to their score
A

Computation Audit

Every recompute writes a snapshot row recording the algorithm version (current: datarank_citation_only_1hop_v6), the damping factor used, the corpus size, and the score distribution. The timestamp of the most recent recompute is published in /api/v1/stats as last_computation; full snapshot history lives in the database for audit and is available on request.

Technical reference (for implementers and reviewers)β–Ύ

Tie-aware rank & percentile

Both batch and live paths use the same helper (compute_rank_percentile_from_counts): rank is 1 + |{q : DataRank(q) > DataRank(p)}|, percentile is 100 Β· |strictly lower| / (N βˆ’ 1). Single-item corpora map to percentile 100.0. Tied scores share both rank and percentile.

Self-citation filter

Citers are deduplicated against the seed by OpenAlex author-ID set overlap (app/engine/openalex_graph.py::filter_self_citations). A citer is dropped if it shares any author with the seed.

Offline fallback

If fetch_citer_neighbourhood returns nothing (network failure, OpenAlex 5xx exhaustion, or a paper outside the data-only batch fetch scope), the engine sets N(p) = 0 and DataRank(p) = (1 βˆ’ d) Β· B(p). This is the contract every code path preserves.

Configuration knobs

  • DATARANK_DAMPING (default 0.85) β€” damping factor d. Must be in (0, 1).
  • DATARANK_MAX_CITERS (default 200) β€” flat citer cap per paper.
  • DATARANK_DIVERSITY_GAMMA (default 0, off) β€” opt-in citer-field diversity multiplier. When > 0, N(p) is scaled by 1 + Ξ³ Β· H_norm where H_norm is the normalised Shannon entropy of citers' OpenAlex primary_topic.field distribution.
  • DATARANK_LANDMARK_ELBOW (default 0, off) β€” opt-in additive log term past a citation threshold for landmark resources.
  • DATARANK_ADAPTIVE_CITER_CAP_MAX (default 0, off) β€” opt-in per-paper cap scaling with seed cited_by_count.

In v5 the bounded reuse multiplier R(p) was active by default (DATARANK_REUSE_LAMBDA=0.15,DATARANK_REUSE_CAP=25). Removed in v6 because DataCite reuse coverage is too uneven across repositories. The env vars are no longer read.

Snapshot schema

Each recompute appends one row to datarank_snapshots with:algorithm,damping_factor,paper_count,mode (online|offline),mean_datarank,median_datarank,max_datarank, and an extra_metadata JSON column carrying every hyperparameter, the fetch scope (data_only_fetch,papers_fetched), and the algorithm-family version string.

Where to read the code

  • backend/app/engine/datarank.py β€” scoring helpers (compute_sscore, compute_citer_quality_sum, compute_citer_diversity_factor) and the ALGORITHM_ID constant.
  • backend/app/engine/endowment.py β€” base-score curve (log1p + optional landmark elbow).
  • backend/app/engine/openalex_graph.py β€” citer fetch, deterministic sort by cited_by_count:desc, self-citation filter.
  • backend/ingestion/compute_datarank_db.py β€” canonical batch recompute path (data-only fetch by default).
  • backend/app/services/corpus_rank_service.py β€” cached tie-aware rank/percentile helper used by both batch and live paths.

Source: github.com/zehrakorkusuz/sindex-portal.

See it in action

Search any DOI and get a DataRank score, corpus percentile, and base-vs-network breakdown in seconds.