Demo corpus. Scores are computed on a select set of biomedical paper/datasets and may be inaccurate for papers outside this corpus — DataRank relies on network effects that improve with scale. We aim to expand this into a fully open resource pending additional funding.

Mapping the Unseen in Practice: Comparing Latent Dirichlet Allocation and BERTopic for Navigating Topic Spaces

(2024)10.31235/osf.io/3nad9Source: DataRank Database

Mapping the Unseen in Practice: Comparing Latent Dirichlet Allocation and BERTopic for Navigating Topic Spaces is a research paper (2024). On theSindex it has a DataRank of 0.104. It has been cited 1 time.

N/A

0.104DataRank · unranked

0.104

Open Access1 citations · base score 0.693

Cite:

datarank_citation_only_1hop_v6· scope data_onlyMethodology

Abstract

This article focuses on the strengths and weaknesses of topic modeling for the social studies of science. For about a decade, Natural Language Processing opened new research avenues beyond traditional bibliometric approaches, such as co-citation, co-authorship, and co-word analysis. Among these, the most prevalent are latent Dirichlet allocation (LDA) and BERTopic. The first is a Bayesian probabilistic model and the latter is rooted in deep learning. It remains unclear what those differences imply in practice, and how they contribute to our sociological understanding of the inner works of science. This paper compares results obtained by LDA and BERTopic applied to the same dataset composed of all scientific articles (n=34,797) authored by all biology professors in Switzerland between 2008 and 2020. Although they differ in their operationalization, LDA and BERTopic produce topic spaces with a similar global configuration. However, major differences are observed when focusing on specific multidimensional concepts, such as gene or species. Overall, we stress that topic modeling offers a highly valuable ground for collaborative interdisciplinary research among scholars from all the social studies of science and beyond, when combined with in-depth knowledge of the object under scrutiny.

›Data sources & pipeline

Pipeline:MetadataData-paper checkEnrichmentCitation networkScoring

Enrichment:Pending

FAIR Checklist

Context only (not used in score)

Findable (1/2)

Has DOI

Accessible (1/2)

Open Access

Interoperable (0/2)

Reusable (0/3)

FAIR checklist signals are shown for context only and do not affect DataRank scoring.

Run a calibrated FAIR evaluation for this paper →

DataRank Breakdown

Base Score 100%Citation Network 0%

Base Score Contribution

0.104

From this paper's citation signal

Citation Network Contribution

Citation network not refreshed for this result

This paper's DataRank is currently driven only by its base citation score. Citation network data was not refreshed for this result.

Learn more about DataRank methodology →

Why this DataRank?

DataRank blends this paper's own citation count with the influence of the papers that cite it. Here, roughly 100% comes from its base citations and 0% from the citation network.

Base score B(p): log1p(citation_count) — grows sub-linearly, so a paper with 1,000 citations is not 10× a paper with 100.
Network N(p): Σ over citers of log1p(C_q) ÷ max(outdegree_q, 1). Being cited by a highly-cited paper with few references counts most.
Damping factor d = 0.85: DataRank = (1−d)·B(p) + d·N(p) — the two cards above are each already multiplied by their share.
Self-citations excluded: Citers sharing any OpenAlex author ID with this paper are filtered out before the network sum.

Citers are pulled from OpenAlex sorted by cited_by_count:descand capped per paper, so when the cap binds we keep the highest-signal references and the score is reproducible across reruns.

Read the full methodology →