Demo corpus. Scores are computed on a select set of biomedical paper/datasets and may be inaccurate for papers outside this corpus — DataRank relies on network effects that improve with scale. We aim to expand this into a fully open resource pending additional funding.

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

PeerJ(2016)10.7717/peerj.1621Source: DataRank Database

Cross-platform normalization of microarray and RNA-seq data for machine learning applications is a dataset published in PeerJ (2016). On theSindex it has a DataRank of 3.7, placing it in the top 30.8% of the data-sharing corpus. It has been cited 102 times, with 87 citing works in its 1-hop citation network. Its calibrated FAIR score is 35/100.

Top 31%percentile

3.7DataRank

3.7Top 31%

Dataset Open Access102 citations · base score 4.6

Cite:

datarank_citation_only_1hop_v6· scope data_onlyMethodology

Abstract

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log 2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

›Data sources & pipeline

Pipeline:MetadataData-paper checkEnrichmentCitation networkScoring

Enrichment:Pending

FAIR Checklist

Context only (not used in score)

Findable (1/2)

Has DOI

Accessible (1/2)

Open Access

Interoperable (0/2)

Reusable (1/3)

Dataset classification

FAIR checklist signals are shown for context only and do not affect DataRank scoring.

35FAIR score

F Findable

A Accessible

I Interoperable

R Reusable

Top 83% by FAIRLLM-assessed✓ full text read

Calibrated FAIR score — a parallel quality metric, independent of the DataRank citation score. See the full evaluation →

DataRank Breakdown

Base Score 19%Citation Network 81%

Base Score Contribution

0.692

From this paper's citation signal

Citation Network Contribution

3.0

From 72 citing papers with measurable signal

Learn more about DataRank methodology →

Top 5 citers driving the network score

Ranked by citation count — the same ordering the engine uses when summing log1p(C_q) over citers.

Regression Shrinkage and Selection Via the Lasso
Journal of the Royal Statistical Society Series B: Statistical Methodology199651,197 citationsDataRank 1.6
limma powers differential expression analyses for RNA-sequencing and microarray studies
Nucleic Acids Research201542,254 citationsDataRank 1.6
Comprehensive molecular portraits of human breast tumours
Nature201212,279 citationsDataRank 1.4
A comparison of normalization methods for high density oligonucleotide array data based on variance and bias
Bioinformatics20038,412 citationsDataRank 1.4
Missing value estimation methods for DNA microarrays
Bioinformatics20014,216 citationsDataRank 1.3

Why this DataRank?

DataRank blends this paper's own citation count with the influence of the papers that cite it. Here, roughly 19% comes from its base citations and 81% from the citation network (72 citing papers contributed measurable signal).

Base score B(p): log1p(citation_count) — grows sub-linearly, so a paper with 1,000 citations is not 10× a paper with 100.
Network N(p): Σ over citers of log1p(C_q) ÷ max(outdegree_q, 1). Being cited by a highly-cited paper with few references counts most.
Damping factor d = 0.85: DataRank = (1−d)·B(p) + d·N(p) — the two cards above are each already multiplied by their share.
Self-citations excluded: Citers sharing any OpenAlex author ID with this paper are filtered out before the network sum.

Citers are pulled from OpenAlex sorted by cited_by_count:descand capped per paper, so when the cap binds we keep the highest-signal references and the score is reproducible across reruns.

Read the full methodology →

Click a node to highlight its connections. Use scroll to zoom. Drag to pan.

Node colors:CenterData PaperData + Open AccessNon-dataSelected & links| Node size = percentile rank