获取最新的快照版本,请使用 Spring AI 1.1.3spring-doc.cadn.net.cn

评估测试

测试 AI 应用程序需要评估生成的内容,以确保 AI 模型没有产生幻觉响应。spring-doc.cadn.net.cn

一种评估响应的方法是使用AI模型本身进行评估。选择最适合用于评估的AI模型,该模型可能与生成响应所使用的模型不同。spring-doc.cadn.net.cn

Spring AI用于评估响应的接口值为 Evaluator,定义如下:spring-doc.cadn.net.cn

@FunctionalInterface
public interface Evaluator {
    EvaluationResponse evaluate(EvaluationRequest evaluationRequest);
}

评估的输入是定义为 EvaluationRequestspring-doc.cadn.net.cn

public class EvaluationRequest {

	private final String userText;

	private final List<Content> dataList;

	private final String responseContent;

	public EvaluationRequest(String userText, List<Content> dataList, String responseContent) {
		this.userText = userText;
		this.dataList = dataList;
		this.responseContent = responseContent;
	}

  ...
}

相关性评估器

The RelevancyEvaluator is an implementation of the Evaluator interface, designed to assess the relevance of AI-generated responses against provided context. This evaluator helps assess the quality of a RAG flow by determining if the AI model’s response is relevant to the user’s input with respect to the retrieved context.spring-doc.cadn.net.cn

评估基于用户输入、AI模型的响应以及上下文信息。它使用一个提示模板来询问AI模型,其响应是否与用户输入和上下文相关。spring-doc.cadn.net.cn

这是由RelevancyEvaluator使用的默认提示模板:spring-doc.cadn.net.cn

Your task is to evaluate if the response for the query
is in line with the context information provided.

You have two options to answer. Either YES or NO.

Answer YES, if the response for the query
is in line with context information otherwise NO.

Query:
{query}

Response:
{response}

Context:
{context}

Answer:
您可以使用 .promptTemplate() 构建器方法提供您自己的 PromptTemplate 对象,以自定义提示模板。有关详细信息,请参阅自定义模板

集成测试中的用法

以下是集成测试中使用RelevancyEvaluator的示例,用于验证使用RetrievalAugmentationAdvisor的RAG流程结果:spring-doc.cadn.net.cn

@Test
void evaluateRelevancy() {
    String question = "Where does the adventure of Anacletus and Birba take place?";

    RetrievalAugmentationAdvisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
        .documentRetriever(VectorStoreDocumentRetriever.builder()
            .vectorStore(pgVectorStore)
            .build())
        .build();

    ChatResponse chatResponse = ChatClient.builder(chatModel).build()
        .prompt(question)
        .advisors(ragAdvisor)
        .call()
        .chatResponse();

    EvaluationRequest evaluationRequest = new EvaluationRequest(
        // The original user question
        question,
        // The retrieved context from the RAG flow
        chatResponse.getMetadata().get(RetrievalAugmentationAdvisor.DOCUMENT_CONTEXT),
        // The AI model's response
        chatResponse.getResult().getOutput().getText()
    );

    RelevancyEvaluator evaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));

    EvaluationResponse evaluationResponse = evaluator.evaluate(evaluationRequest);

    assertThat(evaluationResponse.isPass()).isTrue();
}

您可以在 Spring AI 项目中找到多个集成测试,这些测试使用 RelevancyEvaluator 来测试 QuestionAnswerAdvisor 的功能(参见 tests)和 RetrievalAugmentationAdvisor 的功能(参见 tests)。spring-doc.cadn.net.cn

自定义模板

The RelevancyEvaluator uses a default template to prompt the AI model for evaluation. You can customize this behavior by providing your own PromptTemplate object via the .promptTemplate() builder method.spring-doc.cadn.net.cn

自定义的 PromptTemplate 可以使用任何 TemplateRenderer 实现(默认情况下,它会基于 StringTemplate 引擎使用 StPromptTemplate)。重要要求是模板必须包含以下占位符:spring-doc.cadn.net.cn

事实核查评估器

FactCheckingEvaluator 是 Evaluator 接口的另一种实现,旨在根据提供的上下文评估 AI 生成响应的事实准确性。该评估器通过验证给定的陈述(主张)是否由提供的上下文(文档)逻辑支持,从而帮助检测并减少 AI 输出中的幻觉现象。spring-doc.cadn.net.cn

“claim”和“document”被提交给AI模型进行评估。目前已有专门用于此目的的更小、更高效的AI模型,例如Bespoke的Minicheck,它有助于降低执行此类检查的成本,相比GPT-4等旗舰模型更具优势。此外,Minicheck也可通过Ollama使用。spring-doc.cadn.net.cn

用法

FactCheckingEvaluator 的构造函数接受一个 ChatClient.Builder 作为参数:spring-doc.cadn.net.cn

public FactCheckingEvaluator(ChatClient.Builder chatClientBuilder) {
  this.chatClientBuilder = chatClientBuilder;
}

评估器使用以下提示模板进行事实核查:spring-doc.cadn.net.cn

Document: {document}
Claim: {claim}

其中{document}是上下文信息,{claim}是待评估的AI模型响应。spring-doc.cadn.net.cn

示例

以下是如何将 FactCheckingEvaluator 与基于 Ollama 的 ChatModel,特别是 Bespoke-Minicheck 模型一起使用的示例:spring-doc.cadn.net.cn

@Test
void testFactChecking() {
  // Set up the Ollama API
  OllamaApi ollamaApi = new OllamaApi("http://localhost:11434");

  ChatModel chatModel = new OllamaChatModel(ollamaApi,
				OllamaOptions.builder().model(BESPOKE_MINICHECK).numPredict(2).temperature(0.0d).build())


  // Create the FactCheckingEvaluator
  var factCheckingEvaluator = new FactCheckingEvaluator(ChatClient.builder(chatModel));

  // Example context and claim
  String context = "The Earth is the third planet from the Sun and the only astronomical object known to harbor life.";
  String claim = "The Earth is the fourth planet from the Sun.";

  // Create an EvaluationRequest
  EvaluationRequest evaluationRequest = new EvaluationRequest(context, Collections.emptyList(), claim);

  // Perform the evaluation
  EvaluationResponse evaluationResponse = factCheckingEvaluator.evaluate(evaluationRequest);

  assertFalse(evaluationResponse.isPass(), "The claim should not be supported by the context");

}