How could we pedagogically evaluating LLM-based dialogue tutors?

With an influx of NLP researchers into education, many are attempting to measure the pedagogical qualities of LLM-based tutors in objective ways with pedagogical benchmarks that can act as guideposts to facilitate technical iteration. However, education is unlike the natural sciences but more so a social and design science. To solely measure a dialogue tutor with respective to objective and generalisable qualities leads to either reducing the majority of education or unknowingly incorporating value-laden decisions that an individual may see as objective through their worldview. Here, I describe an idea that aims to evaluate dialogue tutors in a manner that facilitates: 1. Capturing both objective and subjective qualities of tutoring ***(composable rubric)*** 2. Deriving insights that are not only generalisable, but captures nuances ***(inductive analysis)*** 3. Not only allowing incorporating researcher interpretation for depth, but facilitating automated for scale ***(LLM-as-a-judge)*** To gain depth and scale is not novel and forms the foundation of many mixed-methods approaches. However, given that the basis of this evaluation breaks away from viewing an LLM as an autonomous tutor, but rather focuses on the extent to which the design intentions are carried over to the discourse produced between the student and dialogue tutor. This evaluation would be performed many times for many different designs, hence time-efficiency is paramount. Hence, this is simply my pragmatic mixing of existing thoughts in a time-efficient manner that could make it reasonable to be applied with individual designs, whilst not ignoring the depth and subjectivities of education. # 1. A composable rubric The discourse data is qualitative; but to gain scalable insights, we need to reduce it to quantitative metrics. However, many prior rubrics focus solely on an objective notion of pedagogy and overlook the contextual and value-laden nature of education. Hence, here is a composable rubric that ranges from the subjective to the objective, evaluating both subjective design intentions evidenced within the produced discourse and the more objective good practices of tutoring. The following rubrics are listed in order from subjective to objective. 1. Meeting learning objectives 2. Subjective conversational qualities 3. Grounding in pedagogical theories 4. General tutoring qualities 5. Generic conversational qualities Learning objectives (1) are laden with values about what education should be for, which is a highly subjective and contextual decision. Subjective conversational qualities (2) are ways of speaking that we intend for the entire conversation to have; for example, if this is used in a school in England, we may wish to use British English. The pedagogical theories (3) are design decisions grounded in theory, which are somewhat flexible but depend on the forms of knowledge deemed valuable within the learning objectives, which can be mapped through frameworks such as [[The Knowledge-Learning-Instruction Framework|KLI]]. However, such mapping is cannot be modelled in an entirely deterministic manner given the plethora of confounding variables within education that we cannot account for; the application of theories cannot be purely 'grounded' in literature, but requires human design judgements based on experience and intuition. The general tutoring qualities (4) are more universal, with qualities we see in older dialogue tutoring research (eg. [[AutoTutor]]) such as 'relevance' and 'perceptivity', though their specific interpretation within the discourse are still somewhat subjective, requiring rating criteria of each indicator to have some subjective decisions for the application context. Lastly, the conversational qualities (5) are non-controversial and the most objective, where criteria can be taken from NLP papers with 'accuracy' and 'fluency' that concerning the naturalness and grammatical correctness of the produced discourse. The set of indicators for (4) and (5) can remain constant across designs, and is what we often see in current research (as of date 6/8/2025). Meanwhile, rubrics for (1), (2), and (3) are specific to each pedagogic intent. To retain some sense of scalability and evaluative comparability of such system, I propose the creation of a set of pedagogical intentions. This should be composed of a context, learning objectives, and respective desired theory. After which, different systems that have been created with the intention of instantiating pedagogical intentions, can be compared amongst each other to see how well these intentions are embodied in the discourse produced with real students. Though of course, this is much more time-consuming than benchmarks or evaluating instruction following capabilities in individual utterances. # 2. Inductive analysis Rubrics are inherently reductive in that it is a deductive process that reduces qualitative data into a pre-defined set of boxes, which can never be comprehensive of all nuances. To overcome this many researchers advocate for mixed-methods approaches where qualitative and quantitative methods derive knowledge that supplement each others shortfalls, however this can be very time-consuming in the application to individual designs. We could combine both processes to improve time-efficiency. In the rubric-based evaluation of such systems a researcher must apply the rubric to a sample of the produced dialogue to calculate inter-rater reliability (before our automation with LLM-as-a-judge). In this process, the researcher will incrementally go through each of the above 5 rubric categories, making sense of the indicators, understanding the dialogue and applying the indicators. Hence we can leverage this familiarisation they gain as a basis for [[Memo|memoing]] qualities that fall outside the rubric, as well as other thoughts/ideas. Here, the categories of the rubric inherently act as a guideline to direct attention to areas of interest, which makes later integration of qualitative and quantitative insights easier, but diverts it from following [[Grounded theory|grounded theory]] (though neither is this the intention). Afterwards, the memos itself can be treated as a raw qualitative dataset on which we can perform open coding and then group into higher-order categories with thematic coding, which can be used to explain, expand or contradict the rubric analysis results. # 3. LLM as a judge The process of manually applying rubrics is time-consuming, hence many current works reduce the number of indicators or the amount of text that is evaluated (eg. looking at individual utterances). That is, the decision of further indicators allows our generalisable insights to be more so comprehensive, but comes at the cost of time; richer texts with dialogue between student and tutor captures more qualities of the interaction, but takes longer for the researcher to gain an understanding of for rubric application. This can be aided with the use of LLM-as-a-judge for automatically applying the rubrics after some inter-rater reliability is calculated with it. However, unlike traditional rater training which involves coming to a shared sense of understanding where that knowledge is often implicit, the use of LLM-as-a-judge requires a strict operationalisation of the indicator. Though, this is useful in forcing the researcher to create a well-defined rubric that could be leveraged for use in the future as well. When experimenting, I used a reasoning model (o4-mini) that was prompted to rate a dialogue with respect to only 1 indicator at a time, and was forced to explicitly note the reasoning processes before coming to the decisions, which seemed to have promising results. My other experimentations of attempting to have the model directly come to conclusions or reason on more than 1 indicator led to poor results. Finally, applying this to all dialogue produces time-efficiency whilst still gaining the scale of insights. However, this comes at the cost of API fees which can be expensive when applying 1 indicator at a time, using a reasoning model and forcing generation of explanations. # 4. Conclusion This is simply my pragmatic mixing of existing thoughts to capture depth and subjectivities, but with an increase in time-efficiency that could make it practical to apply to individual designs. You may have seen a diagram like the following when describing the relative advantages of rich and thick data. On this, I have summarised the areas of researcher effort for this idea. ![[Pasted image 20250807163723.png]] Though, such crossing over different fields with different philosophical outlooks is sure to garner hate from all directions :/ Where sociologists may argue that I am reducing the nuances of education to an automated rubric with some poor attempt to shoehorn qualitative methods whilst ignoring the necessity of naturalistic conditions, whilst NLP researchers may see it as a time-intensive process that does not even evaluate a generalisable tutor. However, I do not see it being possible to build bridges in which both tensions can be reconciled in its entirety, given that this incompatibility is inherently philosophical.