Demo corpus. Scores are computed on a select set of biomedical paper/datasets and may be inaccurate for papers outside this corpus — DataRank relies on network effects that improve with scale. We aim to expand this into a fully open resource pending additional funding.

Computationally Efficient Assembly of Pseudomonas aeruginosa Gene Expression Compendia

mSystems(2023)10.1128/msystems.00341-22Source: DataRank Database

Computationally Efficient Assembly of Pseudomonas aeruginosa Gene Expression Compendia is a dataset published in mSystems (2023). On theSindex it has a DataRank of 0.342, placing it in the top 51.4% of the data-sharing corpus. It has been cited 6 times, with 2 citing works in its 1-hop citation network. Its calibrated FAIR score is 50/100.

Top 51%percentile

0.342DataRank

0.342Top 51%

Dataset Open Access6 citations · base score 1.9

Cite:

datarank_citation_only_1hop_v6· scope data_onlyMethodology

Abstract

Thousands of Pseudomonas aeruginosa RNA sequencing (RNA-seq) gene expression profiles are publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). In this work, the transcriptional profiles from hundreds of studies performed by over 75 research groups were reanalyzed in aggregate to create a powerful tool for hypothesis generation and testing. Raw sequence data were uniformly processed using the Salmon pseudoaligner, and this read mapping method was validated by comparison to a direct alignment method. We developed filtering criteria to exclude samples with aberrant levels of housekeeping gene expression or an unexpected number of genes with no reported values and normalized the filtered compendia using the ratio-of-medians method. The filtering and normalization steps greatly improved gene expression correlations for genes within the same operon or regulon across the 2,333 samples. Since the RNA-seq data were generated using diverse strains, we report the effects of mapping samples to noncognate reference genomes by separately analyzing all samples mapped to cDNA reference genomes for strains PAO1 and PA14, two divergent strains that were used to generate most of the samples. Finally, we developed an algorithm to incorporate new data as they are deposited into the SRA. Our processing and quality control methods provide a scalable framework for taking advantage of the troves of biological information hibernating in the depths of microbial gene expression data and yield useful tools for P. aeruginosa RNA-seq data to be leveraged for diverse research goals. IMPORTANCE Pseudomonas aeruginosa is a causative agent of a wide range of infections, including chronic infections associated with cystic fibrosis. These P. aeruginosa infections are difficult to treat and often have negative outcomes. To aid in the study of this problematic pathogen, we mapped, filtered for quality, and normalized thousands of P. aeruginosa RNA-seq gene expression profiles that were publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The resulting compendia facilitate analyses across experiments, strains, and conditions. Ultimately, the workflow that we present could be applied to analyses of other microbial species.

›Data sources & pipeline

Pipeline:MetadataData-paper checkEnrichmentCitation networkScoring

Enrichment:Pending

FAIR Checklist

Context only (not used in score)

Findable (1/2)

Has DOI

Accessible (1/2)

Open Access

Interoperable (0/2)

Reusable (1/3)

Dataset classification

FAIR checklist signals are shown for context only and do not affect DataRank scoring.

50FAIR score

F Findable

A Accessible

I Interoperable

R Reusable

Top 22% by FAIRLLM-assessed✓ full text read

Calibrated FAIR score — a parallel quality metric, independent of the DataRank citation score. See the full evaluation →

DataRank Breakdown

Base Score 85%Citation Network 15%

Base Score Contribution

0.292

From this paper's citation signal

Citation Network Contribution

0.0499

From 2 citing papers with measurable signal

Learn more about DataRank methodology →

Top 5 citers driving the network score

Ranked by citation count — the same ordering the engine uses when summing log1p(C_q) over citers.

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
Genome Biology201497,097 citationsDataRank 1.7
<tt>edgeR</tt> : a Bioconductor package for differential expression analysis of digital gene expression data
Bioinformatics200944,025 citationsDataRank 1.6
Near-optimal probabilistic RNA-seq quantification
Nature Biotechnology201610,816 citationsDataRank 1.4
Unsupervised Extraction of Stable Expression Signatures from Public Compendia with an Ensemble of Neural Networks
Cell Systems2017102 citationsDataRank 2.6
Pseudomonas aeruginosa lasR mutant fitness in microoxia is supported by an Anr-regulated oxygen-binding hemerythrin
Proceedings of the National Academy of Sciences202069 citationsDataRank 1.4

Why this DataRank?

DataRank blends this paper's own citation count with the influence of the papers that cite it. Here, roughly 85% comes from its base citations and 15% from the citation network (2 citing papers contributed measurable signal).

Base score B(p): log1p(citation_count) — grows sub-linearly, so a paper with 1,000 citations is not 10× a paper with 100.
Network N(p): Σ over citers of log1p(C_q) ÷ max(outdegree_q, 1). Being cited by a highly-cited paper with few references counts most.
Damping factor d = 0.85: DataRank = (1−d)·B(p) + d·N(p) — the two cards above are each already multiplied by their share.
Self-citations excluded: Citers sharing any OpenAlex author ID with this paper are filtered out before the network sum.

Citers are pulled from OpenAlex sorted by cited_by_count:descand capped per paper, so when the cap binds we keep the highest-signal references and the score is reproducible across reruns.

Read the full methodology →

Click a node to highlight its connections. Use scroll to zoom. Drag to pan.

Node colors:CenterData PaperData + Open AccessNon-dataSelected & links| Node size = percentile rank