Open Source
Resources & Artifacts
Every component of the DataRank pipeline is open. Explore our notebooks, datasets, APIs, and models β all freely available for research and educational use.
4
Open-source artifacts
4,000+
Labeled papers
14
Metadata sources
MIT
License
DrPaper β fine-tuned classifier
Our SciBERT-based data-paper classifier. Distinguishes papers whose main contribution is a shared dataset (cohort, atlas, benchmark, database) from methods, theory, and review papers. Validation F1 = 0.9153; passes all 38 known edge cases including GTEx, TCGA, UK Biobank, and the RCSB Protein Data Bank; correctly rejects AlphaFold, DESeq2, and BWA.
DrPaper training notebook
End-to-end notebook that fine-tunes SciBERT into DrPaper on the 4K labeled corpus β useful if you want to retrain, swap the base model, or audit the training procedure.
4K labeled training set
4,000+ papers hand-labeled as data paper or not, plus a second phase of 38 hard edge cases across 15 categories (consortium genomics, brain atlases, cohort profiles, database updates, benchmarks, clinical databases, genome resources) used to validate DrPaper. Curated from GigaScience, Dryad, and PubMed sources.
π¬ DOIphin β federated metadata aggregator
Our open-source aggregation engine. For any DOI it cross-walks metadata across 14+ scholarly sources (CrossRef, OpenAlex, DataCite, Zenodo, Dryad, and more) into one unified record and builds the citation/link graph that DataRank scores. The upstream enrichment service powering the ingestion pipeline.
How to cite
If you use any of these resources in your research, please cite our work.
@misc{thesindex2026,
title = {DataRank v6.0: Citation-Only 1-Hop Scoring for Scholarly Influence},
author = {Korkusuz, Zehra and Huang, Kuan-lin and Edmunds, Scott C.},
year = {2026},
url = {https://thesindex.org}
}Build with DataRank
Use our open API, datasets, and models to power your own research tools and analyses.