Cluster, Multi-Site, and Multi-Level Evaluation (CMME) TIG Week: Optimizing Qualitative Analysis: The Synergy of LLMs and Human Insight by Gizelle Gopez, Mani Keita, Pavan Kumar, Aaron Landrum, Jamie McCall, and Jenica Reed

Date: Friday, November 28, 2025

Hi! We’re Gizelle Gopez, Mani Keita Fakeye, Pavan Kumar, Jaime McCall, Aaron Landrum, and Jenica Reed, from Deloitte’s Evaluation and Research for Action Center of Excellence. As evaluators, we’re always exploring ways to improve efficiency and rigor when working with large, complex data sets across multiple sites. This challenge was front and center in a recent project where our team needed to qualitatively code hundreds of interview transcripts from multiple organizations. Qualitative coding is one of our favorite analysis methods, but it can be time consuming, especially with large amounts of data from interview and focus group transcripts.

Large Language Models (LLMs) are advanced computer algorithms designed to read, analyze, and generate text using predictive analytics and they are the engine behind Generative AI tools. They are used for a variety of processes, such as identifying patterns, synthesizing large volumes of text, faster code debugging, and testing of AI models and applications. Given the volume of data involved in our recent study, we developed an LLM to conduct qualitative coding more efficiently.

To create the deductive codebook, we identified the concepts from a literature review on the needs of social entrepreneurs. We then trained the LLM with constructive instructions using prompts reflective of the codebook and desired outputs. This included detailed rules on what to do (view segments holistically, consider context provided by adjacent segments, etc.) and avoid (focusing on keywords, breaking the text down sentence-by-sentence, etc.) when coding. After training the model, we had the LLM take the first pass at coding the transcripts.

Lesson Learned #1: The LLM successfully applied the deductive codes to selected portions of the text.

It applied parent and child codes to the transcripts, capturing the content but often missed the nuances such as sentiment or the context in which the statements were made. Key word or phrase coding was successfully applied.

Lesson Learned #2: While the LLM could be trained to apply the codes, it struggled with deciding when to use broader (macro) or more specific (micro) coding styles.

The LLM applied the codes by searching for specific words or sentences, but it was unable to determine when to consider an entire section of text for the context. Additionally, the LLM often coded only 1-2 sentences within a paragraph, potentially missing other sentences that also needed coding. When multiple concepts were present, the LLM did not apply more than one code to a specific text segment.

Lesson Learned #3: The LLM was unable to identify additional potential codes for inductive coding.

Since the model was trained to recognize specific text based on the codebook, it focused on identifying words or definitions. However, it did not detect themes, patterns, or other contexts that could be used as potential inductive codes.

Lesson Learned #4: The human-in-the-loop was crucial throughout the coding process.

The team’s involvement was essential in developing the codebook, validating the output, and identifying themes and nuances that the LLM missed. Responsible use of these tools is paramount as organizations leverage LLMs and refine their role in qualitative analyses—ultimately, humans are responsible for the validity of the findings first and foremost.

Using an LLM validated our code application and sparked discussions about our process and results; confirming that our involvement is essential to produce quality output and use AI responsibly.

Have you used LLMs or other tools for analyzing qualitative data? What lessons or tips do you have for others on how to use these tools effectively? Leave a comment below!

_{This posting contains general information only, does not constitute professional advice or services, and should not be used as a basis for any decision or action that may affect your business. Deloitte shall not be responsible for any loss sustained by any person who relies on this posting.}

^{As used in this posting, “Deloitte” means Deloitte Consulting LLP, a subsidiary of Deloitte LLP. Please see www.deloitte.com/us/about for a detailed description of our legal structure. Certain services may not be available to attest clients under the rules and regulations of public accounting.}

The American Evaluation Association is hosting Cluster, Multi-site, and Multi-level Evaluation TIG week. The contributions this week come from our CMME TIG members. Do you have questions, concerns, kudos, or content to extend this AEA365 contribution? Please add them in the comments section for this post on the AEA365 webpage so that we may enrich our community of practice. Would you like to submit an AEA365 Tip? Please send a note of interest to AEA365@eval.org. AEA365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators. The views and opinions expressed on the AEA365 blog are solely those of the original authors and other contributors. These views and opinions do not necessarily represent those of the American Evaluation Association, and/or any/all contributors to this site.