Mechanistic interpretability is a field of research in explainable AI (XAI) that aims to understand the internal working of neural networks and can aid in steering model behaviour. That is, whilst neural networks are often seen as a black box, this is not for a lack of information but rather the difficulty in interpreting it. We technically have all of the information about the model and the processes that are undertaken in inference, but they are very high-dimensional vectors that make the comprehension process difficult. With respect to education, interpretability and controllability are vital. - [[Interpretability]]. If we are to give a suggestion to a student it is often important to know why such a recommendation was made. - [[Controllability]]. LLMs are increasingly being used to make [[Learning design|learning design]] related decisions, thereby giving control of value-laden pedagogical decisions to the latent knowledge derived from the training data. There is an ideal that mechanistic interpretability may allow us to direct the pedagogic behaviour of LLMs by isolating useful features. The difficulty with many current methods is whether such complex behaviour can be represented as a vector learnt from contrastive examples (positive and negative examples) of a pedagogical trait. Nonetheless, it is another [[LLM-control methods|LLM-control method]] that can be used in combination with others. However, my initial investigations with Goodfire's sparse autoencoder libraries do not show much promise in that isolated traits are often operate on the level of grammar and style, such as encouraging tone, British english or conciseness. Whilst these can be useful to free up space that otherwise needs to be conveyed through in-context learning, the side-effects in strange artefacts and inconsistency in the property is problematic (as of 30/9/2025).