LLMへの攻撃についてのサーベイ論文「Breaking Down the Defenses」で知った Prompt Injection の論文メモ

はじめに

ふふっ。Python界の田中琴葉、nikkieです。

LLMへの攻撃の1つ、プロンプトインジェクションについて、サーベイ論文から代表的な論文をいくつか知りました。
論文を読んでいる中での中間アウトプットです。

Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models

LLMへの攻撃についてのサーベイ論文です。
v2で読んでいます。
サーベイ論文ですが、ページ数は少なく通読しやすく感じます¹

This paper presents a comprehensive survey of the various forms of attacks targeting LLMs (略) （Abstractより）

LLMへの攻撃のTaxonomy（分類学）がこちら（Figure 1）

「Technique to Attack」に3つあるうち、Prompt Injection（3.2）を見ていきます。

Prompt Injection

This section outlines attacker strategies to manipulate LLM behavior using carefully designed malicious prompts (3.2 p.3)

7つの主要な分野に研究を整理するとしています。
先述のTaxonomyにしたがって、4つの分類で見ていきます。

Objective Manipulation

PromptInject

Indirect Prompt Injection

Propane（※Taxonomyは「Objective Manipulation」だが、分野「Prompt Manipulation Frameworks」に記載）
論文自体はv1からタイトルが変わった模様

Prompt Leaking

HouYi

we subsequently formulate HouYi, a novel black-box prompt injection attack technique, which draws inspiration from traditional web injection attacks. (Abstract)

Malicious Content Generation

AutoDAN

AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. (Abstract)

Prompt Packer（※Taxonomyは「Malicious Content Generation」だが、分野「Prompt Manipulation Frameworks」に記載）

In this paper, we introduce an innovative technique for obfuscating harmful instructions: Compositional Instruction Attacks (CIA), which refers to attacking by combination and encapsulation of multiple instructions. (Abstract)

Training Data Manipulation

ProAttack

In this study, we propose ProAttack, a novel and efficient method for performing clean-label backdoor attacks based on the prompt, which uses the prompt itself as a trigger.

このサーベイ論文におけるJailbreak

3.1にあります

This section delves into jailbreak attacks on LLMs, detailing strategies to exploit model vulnerabilities for unauthorized actions, (3.1 p.2)

JailbreakもPrompt Injectionも、より詳しい定義はサーベイ論文中に見つかっていませんが、ここまでに引いた部分を訳すと

Jailbreak: 許可されていない行動のために、LLMの脆弱性につけこむ
Prompt Injection: 悪意のあるプロンプトでLLMの振る舞いを操作

となるかなと思います。
ですが訳してみて、重なる部分があるように思われてきたような。3.1を読んでみるのが宿題です

3.2の中でもAutoDANは元論文ではJailbreak、このサーベイ論文ではPrompt Injectionと扱いが分かれていますね。
概念としても似ているんでしょうか。

Prompt Engineering Guideより、JailbreakとPrompt Injection

この論文以外では、Prompt Engineering Guideも見てみました。

www.promptingguide.ai

プロンプトインジェクションは、行動を変更する巧妙なプロンプトを使用して、モデルの出力を乗っ取ることを目的としています。
(英語版 Prompt injection is a type of LLM vulnerability where a prompt containing a concatenation of trusted prompt and untrusted inputs lead to unexpected behaviors,)

プロンプトリークはプロンプトインジェクションに含まれているという扱いです。

ジェイルブレイクより

一部のモデルは、倫理に反する命令には応答しないが、要求が巧妙に文脈化されている場合は回避できます。
(英語版 Some modern LLMs will avoid responding to unethical instructions provide in a prompt due to the safety policies implemented by the LLM provider. However, it has been shown that it is still possible to bypass those safety policies and guardrails using different jailbreaking techniques.)

終わりに

LLMへの攻撃についてのサーベイ論文から、Prompt Injectionについて代表的な論文を知りました。
Prompt Injectionと呼ばれる攻撃は具体例が話題になることもあり言葉として認識していましたが、「こんなにも研究があったのか！」という感想です。
githubにソースが公開されているものはコードも含めて、引き続きインプットしていきたいと思っています。

LLMに悪いことをしてはいけません。

サーベイと言ったら50ページは超えてくる印象があります↩

nikkie-ftnextの日記

イベントレポートや読書メモを発信