Past Issues

RTD TIG Week: Mapping NIH Research with Embedding-Based Topic Models by Zahra Zad and Paula Fearon

Date: Thursday, June 18, 2026

We are Zahra Zad, Ph.D. and Paula Fearon, Ph.D with the Analytics Research Institute and want to share with the community a resource that we created.

For large or complex portfolios, evaluators face a challenge: how can we make sense of large bodies of scientific work without relying exclusively on predefined categories or manual coding? Our team recently explored this question, creating a topic visualization of National Institutes of Health (NIH)-funded grants using embedding-based language models. Our goal was to understand how scientific topics are distributed across NIH Institutes and Centers (ICs), while experimenting with methods that can capture the relationships within research portfolios at scale.

Many existing topic or content analyses rely on administrative classifications, manually assigned categories, or counts of frequently used terms. While useful, these methods can struggle with interdisciplinary science, evolving terminology, or emerging research areas. Embedding-based approaches offer a more flexible way to represent scientific activity.

Rad Resource

Our approach used PubMedBERT, a language model trained on the biomedical literature (PubMed). Unlike keyword-based methods, embedding models like PubMedBERT look at a text’s language in context, allowing comparison of documents based on meaning. Grants can be grouped because they describe related concepts, even if they do not use the same terminology.

We created embeddings from grant titles and abstracts, clustered grants into topics, then used these results to develop a beta interactive tool, TopicVista, to explore the distribution of topics across NIH ICs. TopicVista revealed both expected patterns and areas of overlap across the NIH funding landscape. Many topics are concentrated within the expected ICs, such as “Cancer Biology” within the National Cancer Institute (CA). Other topics appeared across multiple ICs (e.g. “Data Management and Bioinformatics”). These areas of overlap may point to opportunities for collaboration across ICs, or overlap may raise useful questions about how ICs delineate their distinct roles and priorities. Our tool, code, and data are freely available here.

For evaluators, embedding-based topic models create new possibilities. First, they can reveal how research areas relate by content rather than administrative category, helping evaluators identify overlap, fragmentation, convergence, and gaps across organizations or initiatives. Second, these approaches can support analysis of interdisciplinarity and emerging science by surfacing connections conventional classifications may miss. Third, visual topic maps can make complex funding landscapes easier to communicate, helping decision makers and stakeholders more intuitively understand research activity.

Importantly, these methods do not replace substantive expertise or evaluative judgment. Models such as PubMedBERT reflect patterns in scientific language; interpretation still requires domain knowledge, validation, and methodological transparency. But these approaches expand the evaluator’s toolkit, especially for working with larger and less structured datasets.

For the RTD TIG community and other evaluators, the broader opportunity may lie in how advances in natural language processing can strengthen evaluation practice itself. As computational methods become more accessible, embedding-based analyses offer evaluators new ways to examine research systems, organizational strategy, and scientific ecosystems at a scale that was previously difficult to achieve.

TopicVista represents an initial effort to explore NIH research portfolios using embedding-based topic models. The topics and visualizations have undergone limited manual review, and we view this as the beginning of an iterative process rather than a finished product. We welcome feedback on the tool, its interpretations, and potential applications, and would be delighted to connect with others interested in refining these approaches or collaborating on future work.

Two-panel visualization of NIH-funded grant applications by research topic. The left panel is a horizontal bar chart showing the number of applications in each topic, with larger topics including Cancer Biology, Public Health and Caregiving, Neuroscience and Sleep Regulation, HIV Immunology and Vaccines, and Neurodegenerative Diseases. The right panel is a bubble chart showing how applications within each topic are distributed across NIH Institutes and Centers. Some topics are concentrated in a few institutes, while others appear across many institutes, indicating areas of overlap across the NIH portfolio.

The American Evaluation Association is celebrating RTD TIG Week with our colleagues in the Research Technology and Development TIG. All of the blog contributions this week come from our RTD TIG members. Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to aea365@eval.org. aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.