Open Source

Resources & Artifacts

Every component of the DataRank pipeline is open. Explore our notebooks, datasets, APIs, and models — all freely available for research and educational use.

View on GitHub

Open-source artifacts

4,000+

Labeled papers

Metadata sources

MIT

License

Model

DrPaper — fine-tuned classifier

Our SciBERT-based data-paper classifier. Distinguishes papers whose main contribution is a shared dataset (cohort, atlas, benchmark, database) from methods, theory, and review papers. Validation F1 = 0.9153; passes all 38 known edge cases including GTEx, TCGA, UK Biobank, and the RCSB Protein Data Bank; correctly rejects AlphaFold, DESeq2, and BWA.

View on HuggingFaceHuggingFace

Notebook

DrPaper training notebook

End-to-end notebook that fine-tunes SciBERT into DrPaper on the 4K labeled corpus — useful if you want to retrain, swap the base model, or audit the training procedure.

View on KaggleKaggle

Dataset

4K labeled training set

4,000+ papers hand-labeled as data paper or not, plus a second phase of 38 hard edge cases across 15 categories (consortium genomics, brain atlases, cohort profiles, database updates, benchmarks, clinical databases, genome resources) used to validate DrPaper. Curated from GigaScience, Dryad, and PubMed sources.

View on KaggleKaggle

API

🐬 DOIphin — federated metadata aggregator

Our open-source aggregation engine. For any DOI it cross-walks metadata across 14+ scholarly sources (CrossRef, OpenAlex, DataCite, Zenodo, Dryad, and more) into one unified record and builds the citation/link graph that DataRank scores. The upstream enrichment service powering the ingestion pipeline.

View on GitHubGitHub

How to cite

If you use any of these resources in your research, please cite our work.

@misc{thesindex2026,
  title   = {DataRank v6.0: Citation-Only 1-Hop Scoring for Scholarly Influence},
  author  = {Korkusuz, Zehra and Huang, Kuan-lin and Edmunds, Scott C.},
  year    = {2026},
  url     = {https://thesindex.org}
}

Build with DataRank

Use our open API, datasets, and models to power your own research tools and analyses.

Try DataRank Read the methodology