OpenAIのPrompt engineeringのドキュメントから、LLMの出力をGPTに評価させるケースのプロンプト例を理解する

はじめに

OpenAIのドキュメントの中にはプロンプトエンジニアリングを扱ったガイドがあります。
https://platform.openai.com/docs/guides/prompt-engineering

その中で「分量多いな」と避けていた部分に、1日1エントリの強制力で切り込みます。
LLMの出力をGPTで評価させるシーンで参照できそうな箇所です

はじめに
目次
Prompt engineering、6つの戦略（Six strategies）
戦略「Test changes systematically」
戦術「Evaluate model outputs with reference to gold-standard answers」
- パターン1：回答に含んでいてほしい点（人が与える）をGPTはLLMの出力から引用し、引用だけからその点が推論できるか評価する
- パターン2：質問へのLLMの回答について、理想の回答と一緒にGPTに与え、GPTは重なり具合や矛盾の有無を分析する
終わりに

Prompt engineering、6つの戦略（Six strategies）

今回対象にするドキュメントは冒頭で、よりよい結果を得るための6つの戦略（「Six strategies for getting better results」）が紹介されています。
https://platform.openai.com/docs/guides/prompt-engineering/six-strategies-for-getting-better-results

Write clear instructions
Provide reference text
Split complex tasks into simpler subtasks
Give the model time to "think"
Use external tools
Test changes systematically

これらの戦略（strategy）には、それぞれ戦術（tactic）がひも付きます。
戦略が抽象的なプロンプトの方針で、戦術が具体なプロンプトの書き方なのかなと理解しています。

このドキュメント自体は非常に学びが深いと思っているのですが、今回読んでいくのは「Test changes systematically」（変更を体系的に検査する）です¹。

戦略「Test changes systematically」

https://platform.openai.com/docs/guides/prompt-engineering/strategy-test-changes-systematically

まず、検査するサンプルサイズが小さいと、真に改善したのか、全体としては改善していなくて見た範囲でたまたまよかっただけなのかの区別が難しいと述べられます（表もあります）

Looking at a few examples may hint at which is better, but with small sample sizes it can be hard to distinguish between a true improvement or random luck.

ここではEvaluation procedures（評価手順）をevalsと呼んでいますが、よい評価手順が満たす性質が3つ

Representative of real-world usage

Contain many test cases for greater statistical power

Easy to automate or repeat

実世界のユースケースを代表していて、サンプルサイズは大きく、自動化や繰り返しがしやすい。

その後に

Evaluation of outputs can be done by computers, humans, or a mix.

と続きますが、以降ではcomputer（計算機）による評価の話がメイントピックです。

結びの一文は以下で、model-based evals（モデルによる評価）がワークするよう実験することを促しています

We encourage experimentation to figure out how well model-based evals can work for your use case.

この戦略に紐づく戦術は1つで「Evaluate model outputs with reference to gold-standard answers」（モデルの出力を理想の回答への参照で評価する）
https://platform.openai.com/docs/guides/prompt-engineering/tactic-evaluate-model-outputs-with-reference-to-gold-standard-answers

具体的なプロンプトを読み解いていきます。

戦術「Evaluate model outputs with reference to gold-standard answers」

ドキュメント中の題材は、アポロ計画です。

パターン1：回答に含んでいてほしい点（人が与える）をGPTはLLMの出力から引用し、引用だけからその点が推論できるか評価する

GPTで評価する対象のLLMの出力は以下の3つ：

(a)

Neil Armstrong is famous for being the first human to set foot on the Moon. This historic event took place on July 21, 1969, during the Apollo 11 mission.

(b)

Neil Armstrong made history when he stepped off the lunar module, becoming the first person to walk on the moon.

(c)

In the summer of '69, a voyage grand,
Apollo 11, bold as legend's hand.
Armstrong took a step, history unfurled,
"One small step," he said, for a new world.

これらは1つ1つuserプロンプトとして与えます。
systemプロンプトは以下：

You will be provided with text delimited by triple quotes that is supposed to be the answer to a question. Check if the following pieces of information are directly contained in the answer:

- Neil Armstrong was the first person to walk on the moon.
- The date Neil Armstrong first walked on the moon was July 21, 1969.

For each of these points perform the following steps:

1 - Restate the point.
2 - Provide a citation from the answer which is closest to this point.
3 - Consider if someone reading the citation who doesn't know the topic could directly infer the point. Explain why or why not before making up your mind.
4 - Write "yes" if the answer to 3 was yes, otherwise write "no".

Finally, provide a count of how many "yes" answers there are. Provide this count as {"count": <insert count here>}.

2点評価させようとしています。

ニール・アームストロングが月を歩いた最初の人間である
ニール・アームストロングが初めて月を歩いた日は1969年7月21日である

GPTによる評価手順は4段階（サブタスクに分けている例ですね）

評価する点を再度述べます
入力から評価する点と最も近い箇所を引用します
このトピックを知らない人が2の引用だけを読んで、評価する点を推論できるかを考えます
- 結論（推論できる／できない）の前に理由も述べます（考える時間を与えていますね）
3の回答がyesなら"yes"と、そうでないなら"no"と出力します

評価点すべてについて上の1〜4を繰り返した後、"yes"の数を数えてJSON形式で返します。
今回は評価点が2点なので、2/2が最良ですね²

ドキュメントからは「Open in Playground」のリンクがあり、簡単に結果を確認できます。

先日リリースされた「gpt-4-0125-preview」を指定しました³。
簡単に試せるだけでなく、temperatureを0に指定できる点がよいですね⁴

(a)を渡した場合の出力の一部です

1. Neil Armstrong was the first person to walk on the moon.
- Citation: "Neil Armstrong is famous for being the first human to set foot on the Moon."
- Explanation: The citation directly states that Neil Armstrong was the first human to walk on the moon. Someone reading this sentence would not need any additional information to understand that Neil Armstrong was the first person to do so.
- Answer: yes

引用は直接述べている（The citation directly states）と言っていますね。

(c)の場合です

1. Neil Armstrong was the first person to walk on the moon.
- Citation: "Armstrong took a step, history unfurled."
- Explanation: The citation implies that Armstrong's action of taking a step was historically significant, suggesting it was a first or unique event. However, without prior knowledge, one might not directly infer that he was the first person to walk on the moon, only that his action was important.
- Answer: no

重要なことをしたと言っているけれど、前提知識がないと月を歩いたかは分からないので、no評価。
かしこいな〜

パターン2：質問へのLLMの回答について、理想の回答と一緒にGPTに与え、GPTは重なり具合や矛盾の有無を分析する

GPTによる評価にはバリエーションがあり、別のパターンも紹介されています。

このパターンでは質問への回答（LLMによる出力を想定）がいくつかあります。
GPTに質問への回答それぞれを理想回答と評価させます。

質問：
What event is Neil Armstrong most famous for and on what date did it occur? Assume UTC time.

ニール・アームストロングが有名な出来事とそれが起きた日（UTC）を問う質問です。

理想回答はこちら（Expert Answerと書かれています）

Neil Armstrong is most famous for being the first person to walk on the moon. This historic event occurred on July 21, 1969.

質問への回答3つ

(d)

Didn't he walk on the moon or something?

(e)

On the 21st of July 1969, Neil Armstrong became the second person to walk on the moon, following after Buzz Aldrin.

(f)

At approximately 02:56 UTC on July 21st 1969, Neil Armstrong became the first human to set foot on the lunar surface, marking a monumental achievement in human history.

userプロンプトは以下の形式です（デリミタを使って構造化した明確なプロンプトですね）

Question: """What event is Neil Armstrong most famous for and on what date did it occur? Assume UTC time."""

Submitted Answer: """<ここに(d)または(e)または(f)が来ます（userプロンプトは全部で3つ）>"""

Expert Answer: """Neil Armstrong is most famous for being the first person to walk on the moon. This historic event occurred on July 21, 1969."""

systemプロンプトは以下です

Use the following steps to respond to user inputs. Fully restate each step before proceeding. i.e. "Step 1: Reason...".

Step 1: Reason step-by-step about whether the information in the submitted answer compared to the expert answer is either: disjoint, equal, a subset, a superset, or overlapping (i.e. some intersection but not subset/superset).

Step 2: Reason step-by-step about whether the submitted answer contradicts any aspect of the expert answer.

Step 3: Output a JSON object structured like: {"type_of_overlap": "disjoint" or "equal" or "subset" or "superset" or "overlapping", "contradiction": true or false}

3つのサブタスクを与えられています。
またstep-by-stepでと考える時間も与えられていますね。

expert answerと比較してsubmitted answerの情報は、disjoint（支離滅裂）、equal（等しい）、subset（部分集合）、superset（上位の集合）、overlapping（一部が重なる）のいずれか判定
submitted answerはexpert answerに対して矛盾するか判定
1と2を踏まえ、JSON形式で出力

3の出力イメージです

{"type_of_overlap": "overlapping", "contradiction": false}

なるほど、正解との比較も評価する手順を細かく与えたら成り立つ（とされている）のですね

終わりに

OpenAIのPrompt engineeringのドキュメントから、Strategy「Test changes systematically」とそのTactic「Evaluate model outputs with reference to gold-standard answers」のプロンプト例を読み解きました。
プロンプトが長くて後回しにしていましたが、モデルによる評価の例として学びがありました。
今回は知識として仕入れただけなので、これらのパターンでうまくいくのか、また、GPT並みにかしこいと言われるオープンソースLLMも評価者の候補に入れて、手元でも試してみたいな〜

今回対象外とした部分、DeepLearning.AIの講座と重なる所も多いです ↩
(a)が2/2、(b)は1/2、(c)は0/2となる例です↩
操作ミスでGPT-3.5で実行してしまったのですが、(c)をyes 2/2と評価したので評価に使うならGPT-4の方がいいのではと思いました↩
temperatureについて ↩

nikkie-ftnextの日記

イベントレポートや読書メモを発信