Here I ramble about some of my ideas on pedagogically evaluating LLM-based dialogue tutors. Note, different academic communities often centre around different [[Research paradigm|research paradigms]], where there exists a semi-strict mapping to accepted methodologies. To mix methodologies in interdisciplinary ways opens up avenues for criticisms, where the different researchers meanings of rigour begin to clash. It seems that this leads to evaluation reducing scope until the proposition becomes impenetrable with respect to a targeted community. Nonetheless, I explore some ideas here that cross over fields with different philosophical outlooks and is sure to garner criticisms from all directions :/ Sociologists may argue that I am reducing the nuances of education to an automated rubric with some poor attempt to shoehorn qualitative methods whilst ignoring the necessity of naturalistic conditions, whilst NLP researchers may see it as a time-intensive process that does not even evaluate a generalisable tutor. However, I do not see it being possible to build bridges in which both tensions can be reconciled in its entirety, given that this incompatibility is inherently philosophical - but nonetheless, I believe the results could have pragmatic usefulness. --- # 1. Capturing the subjective, objective, scale and nuance *Date created: Aug 6, 2025* With an influx of NLP researchers into education, many are attempting to measure the pedagogical qualities of LLM-based tutors in objective ways with pedagogical benchmarks that can act as guideposts to facilitate technical iteration. However, education is unlike the natural sciences but more so a social and design science. To solely measure a dialogue tutor with respective to objective and generalisable qualities leads to either reducing the majority of education or unknowingly incorporating value-laden decisions that an individual may see as objective through their worldview. Here, I describe an idea that aims to evaluate dialogue tutors in a manner that facilitates: 1. Capturing both objective and subjective qualities of tutoring ***(composable rubric)*** 2. Deriving insights that are not only generalisable, but captures nuances ***(inductive analysis)*** 3. Not only allowing incorporating researcher interpretation for depth, but facilitating automated for scale ***(LLM-as-a-judge)*** To gain depth and scale is not novel and forms the foundation of many mixed-methods approaches. However, given that the basis of this evaluation breaks away from viewing an LLM as an autonomous tutor, but rather focuses on the extent to which the design intentions are carried over to the discourse produced between the student and dialogue tutor. This evaluation would be performed many times for many different designs, hence time-efficiency is paramount. Hence, this is simply my pragmatic mixing of existing thoughts in a time-efficient manner that could make it reasonable to be applied with individual designs, whilst not ignoring the depth and subjectivities of education. You may have seen a diagram like the following when describing the relative advantages of rich and thick data. On this, I have summarised the areas of researcher effort for this idea. ![[Pasted image 20250807163723.png]] ## 1.1 A composable rubric The discourse data is qualitative; but to gain scalable insights, we need to reduce it to quantitative metrics. However, many prior rubrics focus solely on an objective notion of pedagogy and overlook the contextual and value-laden nature of education. Hence, here is a composable rubric that ranges from the subjective to the objective, evaluating both subjective design intentions evidenced within the produced discourse and the more objective good practices of tutoring. The following rubrics are listed in order from subjective to objective. 1. Meeting learning objectives 2. Subjective conversational qualities 3. Grounding in pedagogical theories 4. General tutoring qualities 5. Generic conversational qualities Learning objectives (1) are laden with values about what education should be for, which is a highly subjective and contextual decision. Subjective conversational qualities (2) are ways of speaking that we intend for the entire conversation to have; for example, if this is used in a school in England, we may wish to use British English. The pedagogical theories (3) are design decisions grounded in theory, which are somewhat flexible but depend on the forms of knowledge deemed valuable within the learning objectives, which can be mapped through frameworks such as [[The knowledge-learning-instruction framework|KLI]]. However, such mapping is cannot be modelled in an entirely deterministic manner given the plethora of confounding variables within education that we cannot account for; the application of theories cannot be purely 'grounded' in literature, but requires human design judgements based on experience and intuition. The general tutoring qualities (4) are more universal, with qualities we see in older dialogue tutoring research (eg. [[AutoTutor]]) such as 'relevance' and 'perceptivity', though their specific interpretation within the discourse are still somewhat subjective, requiring rating criteria of each indicator to have some subjective decisions for the application context. Lastly, the conversational qualities (5) are non-controversial and the most objective, where criteria can be taken from NLP papers with 'accuracy' and 'fluency' that concerning the naturalness and grammatical correctness of the produced discourse. The set of indicators for (4) and (5) can remain constant across designs, and is what we often see in current research (as of date 6/8/2025). Meanwhile, rubrics for (1), (2), and (3) are specific to each pedagogic intent. To retain some sense of scalability and evaluative comparability of such system, I propose the creation of a set of pedagogical intentions. This should be composed of a context, learning objectives, and respective desired theory. After which, different systems that have been created with the intention of instantiating pedagogical intentions, can be compared amongst each other to see how well these intentions are embodied in the discourse produced with real students. Though of course, this is much more time-consuming than benchmarks or evaluating instruction following capabilities in individual utterances. ### 1.1.1 Addressing some criticisms related to generalisability of tutoring qualities In the specific indicators that I used in these categories, I received some feedback relating to me lacking enough qualities that should be in 'general tutoring qualities' which can I should simply take from literature. However, a lot of past literature with rubric evaluations of tutoring dialogue is in the context of ITSs and is being done in the context of STEM problem-solving which is a much more narrowly defined domain compared to the general quality of tutoring. Given that the developers of technology often come from CS backgrounds where the primary purpose of knowledge is to solve problems, this bias may lead to a false illusion of generalisability of tutoring qualities when in-fact they are domain-specific pedagogical strategies. Some papers suggested to me include: collaborative dialogue patterns in naturalistic one-to-one tutoring (Graesser), what do human tutors do? (VanLehn) and unifying AI tutor evaluation (Maurya). All three are from areas focusing on problem-solving, but ignoring that I will just focus on the last one where they criticise prior evaluations for being subjective protocols and benchmarks, whilst providing dimensions that are mostly specific to problem-solving. One of their dimensions takes 'revealing answer' to be a bad trait, whilst positing their evaluation to not be subjective and is a unification of pedagogy. Don't get me wrong, I love the notion [[Unified theory of pedagogy|unification]], but to posit a lack of subjectivity in such a relaxed way to me represents the core issues of current state of the research into LLMs for education (as of 17/9/2025). ## 1.2 Inductive analysis Rubrics are inherently reductive in that it is a deductive process that reduces qualitative data into a pre-defined set of boxes, which can never be comprehensive of all nuances. To overcome this many researchers advocate for mixed-methods approaches where qualitative and quantitative methods derive knowledge that supplement each others shortfalls, however this can be very time-consuming in the application to individual designs. We could combine both processes to improve time-efficiency. In the rubric-based evaluation of such systems a researcher must apply the rubric to a sample of the produced dialogue to calculate inter-rater reliability (before our automation with LLM-as-a-judge). In this process, the researcher will incrementally go through each of the above 5 rubric categories, making sense of the indicators, understanding the dialogue and applying the indicators. Hence we can leverage this familiarisation they gain as a basis for [[Memo|memoing]] qualities that fall outside the rubric, as well as other thoughts/ideas. Here, the categories of the rubric inherently act as guiding questions to direct attention to areas of interest and encourage deeper reflection. That is "What qualities related to 'category name' did I observe that was NOT captured by the rubric indicators?". This makes later integration of qualitative and quantitative insights easier, but diverts it from following [[Grounded theory|grounded theory]] (though neither is this the intention). Afterwards, the memos itself can be treated as a raw qualitative dataset on which we can perform open coding and then group into higher-order categories with thematic coding, which can be used to explain, expand or contradict the rubric analysis results. ## 1.3 LLM as a judge The process of manually applying rubrics is time-consuming, hence many current works reduce the number of indicators or the amount of text that is evaluated (eg. looking at individual utterances). That is, the decision of further indicators allows our generalisable insights to be more so comprehensive, but comes at the cost of time; richer texts with dialogue between student and tutor captures more qualities of the interaction, but takes longer for the researcher to gain an understanding of for rubric application. This can be aided with the use of LLM-as-a-judge for automatically applying the rubrics after some inter-rater reliability is calculated with it. However, unlike traditional rater training which involves coming to a shared sense of understanding where that knowledge is often implicit, the use of LLM-as-a-judge requires a strict operationalisation of the indicator. Though, this is useful in forcing the researcher to create a well-defined rubric that could be leveraged for use in the future as well. When experimenting, I used a reasoning model (o4-mini) that was prompted to rate a dialogue with respect to only 1 indicator at a time, and was forced to explicitly note the reasoning processes before coming to the decisions, which seemed to have promising results. My other experimentations of attempting to have the model directly come to conclusions or reason on more than 1 indicator led to poor results. Finally, applying this to all dialogue produces time-efficiency whilst still gaining the scale of insights. However, this comes at the cost of API fees which can be expensive when applying 1 indicator at a time, using a reasoning model and forcing generation of explanations. --- # TO CLEAN # 2. Binarising discourse *Date created: Nov 6, 2025* Metrics ... the difficulty of looking at an entire conversation ... problems: - difficulty of defining criteria and ratings for an entire discourse ... whilst it is important to catch relationships, it can be quite difficult for LLMs to be rate entire discourses - but this should not mean that we just look at individual utterances that are generated, as the qualities of dialogue tutoring are realised in the actual conversation ... - that is, we should not just do single turn and use that to form the entire evaluation ... - ; one thing we could do, is decompose the conversation into small units - messages that can be analysed ... this could be a list of tutor messages, a list of student messages, or even a list of tutor -> student message pair, or student -> tutor message pairs we can decompose particular metrics that we may care about, for example accuracy and applying to each individual unit of analysis start creating metrics ... from conversations ... that can then be used to analyse correlations with some kind of 'ground truth' with learning gains, testing, etc ... start connecting: design -> metric -> ground truth - better understand the in-context behaviours .... - uses - this can be used to evaluate the quality of the dialogue tutor, in whether it follows pedagogical strategies that we wanted it to, not assuming objectivity in one metric always needing to improving in everything, but trying create the more subtle connections. also each pedagogical strategy attempting to be decomposed into particular metrics that can be useful for it. essentially capturing a lot more about the statefulness of it. eg. spaced repetition, advanced organiser, use of proceduralisation ... etc... these are combinations of particular states be careful in thinking about the metrics as ground truth, but just a pointer that should not be optimised for - i think the main thing is in creating metrics or lists that can capture information about of course to binarise things mean we do not capture the nuances what effects does state ... ordering, actions, quantity, sparsity of actions, etc ... have ... works really well with LLM as a judge given the decomposition ... novelty as well: - being able to test how well stateful design-intentions are carried out in pedagogy, not the generalisable but rather as to whether our intentions of strategy are evidenced - [[Research infrastructure]] - being able to have something that facilitates simple and quick experimentation ... - just being able to share and borrow metrics that somebody else may have created - being able to generate its own metrics from the design, to see how well the pedagogy or otherwise is being able to be met - over time if a community engages with the same design, we get to accumulate the discourse and try to identify relations ... to be able to validate two things: design -> metric, metric -> learning - essentially testing the pedagogical theories, by going through metrics and validating these two things separately - the main advantage is the tying together of rationalism and empiricism ... that is we are not solely relying on existing datasets for analysis of evaluation, and try to gain as many insights as possible from it; but rather we get to actually be experimental in trying different things ... - this closing of the loop and supporting this in a time efficient way, is probably one of the best things that we can do for the learning sciences - how do we accumulate theory, if we cannot change the conditions ... a super micro view on things, there will be so much noise from the real world conditions ... and pedagogy cannot be isolated to this.... -