MLflow素振りの記：2.8からサポートされたLLM-as-a-Judgeを触る

はじめに

MLflowを触ってみました。
なんでもLLM-as-a-Judgeができると聞きまして

「静的データセットで評価する」のコードを動かしてみる

検索して見つけたドキュメントに沿って手を動かします。
https://docs.databricks.com/ja/mlflow/llm-evaluate.html#evaluate-with-a-static-dataset
ちなみに原文は「Evaluate with a static dataset」

ここでは、質問応答の例が2つあります

質問：What is MLflow?
質問：What is Spark?

各質問には正解（ground truth）と予測（prediction）があります。

predictionとground truthの評価を行いたい状況という理解です。
mlflow.evaluate()を呼んで、LLM-as-a-Judge¹する例です。

results = mlflow.evaluate(
    data=eval_data,
    targets="ground_truth",
    predictions="predictions",
    extra_metrics=[mlflow.metrics.genai.answer_similarity()],
    evaluators="default",
)

extra_metrics引数がLLM-as-a-Judge（の方法の1つ）を指定しています。

MLflow 2.8 リリースブログより

MLflow 2.8: Automated Evaluation参照²

We've extended the MLflow Evaluation API to support GenAI metrics and evaluation examples.

「MLflow Evaluation API」の部分はリンクになっていて、mlflow.evaluate()に関するドキュメントから「Evaluating with LLM」が案内されています。
https://mlflow.org/docs/latest/models.html#evaluating-with-llms

You get out-of-the-box metrics like toxicity, latency, tokens and more, alongside some GenAI metrics that use GPT-4 as the default judge, like faithfulness, answer_correctness, and answer_similarity.

動作環境 & スクリプト

Python 3.11.8
pip install mlflow tiktoken openai
- mlflow==2.11.3
- tiktoken==0.6.0
- openai==1.16.2
環境変数OPENAI_API_KEYを設定しておく

評価結果を確認

スクリプトを実行

See aggregated evaluation results below:
{'answer_similarity/v1/mean': 3.5, 'answer_similarity/v1/variance': 0.25, 'answer_similarity/v1/p90': 3.9}
See evaluation table below:
            inputs  ...                 answer_similarity/v1/justification
0  What is MLflow?  ...  The provided output has moderate semantic simi...
1   What is Spark?  ...  The provided output aligns closely with the ta...

[2 rows x 5 columns]

「MLflowってmlflow uiでブラウザで見られたよな」と思い出し³、眺めてみました

MLflowは裏で一体何をしたのか

mlflow.metrics.genai.answer_similarityは今回の設定だとGPT-4で評価しています！⁴

https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.answer_similarity

Defaults to “openai:/gpt-4”.

評価のプロンプトはここにあるようです。
https://github.com/mlflow/mlflow/blob/v2.11.3/mlflow/metrics/genai/prompts/v1.py#L99

正解に対する予測の意味の類似具合を評価する（definition参照）
1-5の5段階で評価する（抽象的な定義を与える。grading_prompt）
2と4の例を与える（default_examples）
- 「What is MLflow?」が例ですが、手元のpredictionはこのプロンプトと同じテキストではありませんでした
justificationにGPT-4の評価の理由がありそうでした
- MLFlowの質問の方（3と評価）：Therefore, it demonstrates moderate, but not substantial, semantic similarity.
- PySparkの質問の方（4と評価）：it demonstrates substantial semantic similarity.

終わりに

2件のデータでMLflowのLLM-as-a-Judgeを体験してみました。
mlflow.evaluate()に正解と予測を渡して、LLM-as-a-Judgeの設定（＝extra_metrics引数の指定）をすればできるのですね。
全然経験ないので間違っている理解の可能性高そうですが、mlflow.evaluate()ってモデルをなにか1つ指定してその予測を評価するものだと思うので、モデルを指定せずに正解と予測の2種のデータを指定するだけでよいというのが機能拡張なのかなととらえています。

MLflowの概念はとても目新しいので、引き続き触っていこうと思います。

P.S. MLflowに興味を持ったきっかけは

2月のMLOps勉強会なのです。

こちらの論文で出された概念と理解しています。この記事で知りました。↩
このブログのタイトルにPart 2とあるのは、こちらの第2弾の内容も含むからではないかと考えています。↩
参考記事 ↩
切り替えたプルリクエストも見つかりました（経緯はまだ不明）↩

nikkie-ftnextの日記

イベントレポートや読書メモを発信