UnstructuredでWebのURLもローカルのPDFも読み込める！さらにユーザはpartition関数を呼ぶだけととっても簡単！（arXivの論文を例に）

はじめに

#ラブライバーに見て欲しいアイマス公式絵で涙腺崩壊😭😭
異次元フェスの余韻で夢見心地なnikkieです。

存在を知った興味深いライブラリの素振り（初手）です。

Unstructured

We get your data LLM-ready

訳してみると「あなたのデータにLLMで使う準備を」といった感じでしょうか。
どんなソースのデータもUnstructuredで扱えるのが売りのようです。

80% of enterprise data exists in difficult-to-use formats like HTML, PDF, CSV, PNG, PPTX, and more.
Unstructured effortlessly extracts and transforms complex data for use with every major vector database and LLM framework.

npakaさんによるお試し記事

LangChainが使ってます¹

UnstructuredURLLoaderがあり、URLを渡すとページの内容を取得します²。
https://python.langchain.com/docs/integrations/document_loaders/url

実装はこのあたり。Unstructuredのpartition関数を呼んでいます。
https://github.com/langchain-ai/langchain/blob/v0.0.350/libs/community/langchain_community/document_loaders/url.py#L129-L142

partition

Unstructuredにはいくつかの概念があるようです（Core Functionality参照）。
その中の1つがPartitioning
https://unstructured-io.github.io/unstructured/core/partition.html

Partitioning functions in unstructured allow users to extract structured content from a raw unstructured document.

Partitioningの機能を使って、生の構造化されていないドキュメントから、構造化された内容を抽出できるとのことです。

多様なデータソースが扱えるようですが、今回はWebのURLとPDFとで試してみます。
以下の論文³をWebから、ダウンロードしたPDFからの2通りで読み込みます。

動作環境

Python 3.11.4です。
PDFを扱うためにpip install 'unstructured[pdf]'します⁴。

ライブラリのバージョン

antlr4-python3-runtime==4.9.3
backoff==2.2.1
beautifulsoup4==4.12.2
certifi==2023.11.17
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
coloredlogs==15.0.1
contourpy==1.2.0
cryptography==41.0.7
cycler==0.12.1
dataclasses-json==0.6.3
Deprecated==1.2.14
effdet==0.4.1
emoji==2.9.0
filelock==3.13.1
filetype==1.2.0
flatbuffers==23.5.26
fonttools==4.46.0
fsspec==2023.12.2
huggingface-hub==0.19.4
humanfriendly==10.0
idna==3.6
iopath==0.1.10
Jinja2==3.1.2
joblib==1.3.2
kiwisolver==1.4.5
langdetect==1.0.9
layoutparser==0.3.4
lxml==4.9.3
MarkupSafe==2.1.3
marshmallow==3.20.1
matplotlib==3.8.2
mpmath==1.3.0
mypy-extensions==1.0.0
networkx==3.2.1
nltk==3.8.1
numpy==1.26.2
omegaconf==2.3.0
onnx==1.15.0
onnxruntime==1.15.1
opencv-python==4.8.1.78
packaging==23.2
pandas==2.1.4
pdf2image==1.16.3
pdfminer.six==20221105
pdfplumber==0.10.3
pikepdf==8.9.0
Pillow==10.1.0
portalocker==2.8.2
protobuf==4.25.1
pycocotools==2.0.7
pycparser==2.21
pyparsing==3.1.1
pypdf==3.17.2
pypdfium2==4.25.0
pytesseract==0.3.10
python-dateutil==2.8.2
python-iso639==2023.12.11
python-magic==0.4.27
python-multipart==0.0.6
pytz==2023.3.post1
PyYAML==6.0.1
rapidfuzz==3.5.2
regex==2023.10.3
requests==2.31.0
safetensors==0.4.1
scipy==1.11.4
six==1.16.0
soupsieve==2.5
sympy==1.12
tabulate==0.9.0
timm==0.9.12
tokenizers==0.15.0
torch==2.1.1
torchvision==0.16.1
tqdm==4.66.1
transformers==4.36.0
typing-inspect==0.9.0
typing_extensions==4.9.0
tzdata==2023.3
unstructured==0.11.2
unstructured-inference==0.7.15
unstructured.pytesseract==0.3.12
urllib3==2.1.0
wrapt==1.16.0

WebのURLから

https://unstructured-io.github.io/unstructured/core/partition.html#partition-html

ローカルのHTMLファイルの他、URLも渡せます（LangChainのUnstructuredURLLoaderで使っているのはこれです）。
今回はarXiv Vanity⁵で開いた https://www.arxiv-vanity.com/papers/2305.14283/ を渡します。

from unstructured.partition.html import partition_html

html_elements = partition_html(url="https://www.arxiv-vanity.com/papers/2305.14283/")

これで論文が取得できます。
html_elementsはリストで、Unstructuredが構造化したパーツからなります。
Unstructuredのドキュメントには、以下のようにして1つの文字列にまとめる例があります⁶。

"\n\n".join(str(el) for el in html_elements)

ローカルのPDFから

https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf

arXivからダウンロードしたPDFファイルを渡します。

from unstructured.partition.pdf import partition_pdf

pdf_elements = partition_pdf("2305.14283.pdf")

Webページの場合と同様に内容が取得できます！
英語だからすんなり動いているのかもしれません。
日本語のPDFの場合は未検証です。

ファサード`partition`

Unstructuredには、ユーザにとって使いやすいpartition関数があります。
https://unstructured-io.github.io/unstructured/core/partition.html#partition
partition_hogeのように呼び分けなくてもよいのです！（Unstructured側でやってくれるわけですね）

from unstructured.partition.auto import partition

html_elements = partition(url="https://www.arxiv-vanity.com/papers/2305.14283/", content_type="text/html")
pdf_elements = partition("2305.14283.pdf")

⚠️partitionからpartition_htmlを呼ぶ場合、現状はcontent_typeを指定する必要がありました。

partition関数ってデザインパターンのファサードではないかと思います（ちょうぜつ本で見たところだ！）。
ユーザはファイルを指定してpartitionを呼ぶだけ、ファサードが呼び分けてくれます。
これはとても簡単ですね。

更に嬉しいことに、Unstructuredはデータごとのpartition関数も提供してくれているので、これらを組合せることもできます（simpleかつeasyだ！）

終わりに

ライブラリUnstructuredのpartitionについて見てきました。
どんなデータもpartition関数で簡単に読み込めます！

arXivの論文の内容が簡単に取得できて感動でした。
これは色々と捗りそう〜

こちらの記事にて過去に出会ってました ↩
変化が激しいので元のファイルも示します https://github.com/langchain-ai/langchain/blob/v0.0.350/docs/docs/integrations/document_loaders/url.ipynb ↩
Rewrite-Retrieve-Readの論文です ↩
ref: https://github.com/Unstructured-IO/unstructured/blob/0.11.2/setup.py#L122 ↩
知った経緯はこちら ↩
https://unstructured-io.github.io/unstructured/core/partition.html#partitioning の表の下のコードより↩

nikkie-ftnextの日記

イベントレポートや読書メモを発信

UnstructuredでWebのURLもローカルのPDFも読み込める！さらにユーザはpartition関数を呼ぶだけととっても簡単！（arXivの論文を例に）

はじめに

目次

Unstructured

LangChainが使ってます¹

partition

動作環境

WebのURLから

ローカルのPDFから

ファサード`partition`

終わりに

はじめに

目次

Unstructured

LangChainが使ってます1

partition

動作環境

WebのURLから

ローカルのPDFから

ファサードpartition

終わりに

LangChainが使ってます¹

ファサード`partition`