Mapping the Unseen in Practice: Comparing Latent Dirichlet Allocation and BERTopic for Navigating Topic Spaces is a research paper (2024). On theSindex it has a DataRank of 0.104. It has been cited 1 time.
This article focuses on the strengths and weaknesses of topic modeling for the social studies of science. For about a decade, Natural Language Processing opened new research avenues beyond traditional bibliometric approaches, such as co-citation, co-authorship, and co-word analysis. Among these, the most prevalent are latent Dirichlet allocation (LDA) and BERTopic. The first is a Bayesian probabilistic model and the latter is rooted in deep learning. It remains unclear what those differences imply in practice, and how they contribute to our sociological understanding of the inner works of science. This paper compares results obtained by LDA and BERTopic applied to the same dataset composed of all scientific articles (n=34,797) authored by all biology professors in Switzerland between 2008 and 2020. Although they differ in their operationalization, LDA and BERTopic produce topic spaces with a similar global configuration. However, major differences are observed when focusing on specific multidimensional concepts, such as gene or species. Overall, we stress that topic modeling offers a highly valuable ground for collaborative interdisciplinary research among scholars from all the social studies of science and beyond, when combined with in-depth knowledge of the object under scrutiny.
FAIR checklist signals are shown for context only and do not affect DataRank scoring.
Base Score Contribution
0.104
From this paper's citation signal
Citation Network Contribution
0
Citation network not refreshed for this result
This paper's DataRank is currently driven only by its base citation score. Citation network data was not refreshed for this result.
Learn more about DataRank methodology →DataRank blends this paper's own citation count with the influence of the papers that cite it. Here, roughly 100% comes from its base citations and 0% from the citation network.
Citers are pulled from OpenAlex sorted by cited_by_count:descand capped per paper, so when the cap binds we keep the highest-signal references and the score is reproducible across reruns.