Skip to content

Commit 9b50fe1

Browse files
authored
enhance: add document summarizer example (#107)
* enhance: add document summarizer example Signed-off-by: Grant Linville <[email protected]>
1 parent f039adc commit 9b50fe1

File tree

7 files changed

+104
-1
lines changed

7 files changed

+104
-1
lines changed

docs/README-USECASES.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ Depending on the context window supported by the LLM, you can either send a larg
8383

8484
### Summarization
8585

86-
Here is a GPTScript that sends a large document in batches to the LLM and produces a summary of the entire document. [Link to example here]
86+
Here is a GPTScript that sends a large document in batches to the LLM and produces a summary of the entire document. [hamlet-summarizer](../examples/hamlet-summarizer)
8787

8888
Here is a GPTScript that reads the content of a large SQL database and produces a summary of the entire database. [Link to example here]
8989

examples/hamlet-summarizer/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
venv/

examples/hamlet-summarizer/Hamlet.pdf

511 KB
Binary file not shown.

examples/hamlet-summarizer/README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Hamlet Summarizer
2+
3+
This is an example tool that summarizes the contents of a large documents in chunks.
4+
5+
The example document we are using is the Shakespeare play Hamlet. It is about 51000 tokens
6+
(according to OpenAI's tokenizer for GPT-4), so it can fit within GPT-4's context window,
7+
but this serves as an example of how larger documents can be split up and summarized.
8+
This example splits it into chunks of 10000 tokens.
9+
10+
Hamlet PDF is from https://nosweatshakespeare.com/hamlet-play/pdf/.
11+
12+
## Design
13+
14+
The script consists of three tools: a top-level tool that orchestrates everything, a summarizer that
15+
will summarize one chunk of text at a time, and a Python script that ingests the PDF and splits it into
16+
chunks and provides a specific chunk based on an index.
17+
18+
The summarizer tool looks at the entire summary up to the current chunk and then summarizes the current
19+
chunk and adds it onto the end. In the case of models with very small context windows, or extremely large
20+
documents, this approach may still exceed the context window, in which case another tool could be added to
21+
only give the summarizer the previous few chunk summaries instead of all of them.
22+
23+
## Run the Example
24+
25+
```bash
26+
# Create a Python venv
27+
python3 -m venv venv
28+
29+
# Source it
30+
source venv/bin/activate
31+
32+
# Install the packages
33+
pip install -r requirements.txt
34+
35+
# Set your OpenAI key
36+
export OPENAI_API_KEY=your-api-key
37+
38+
# Run the example
39+
gptscript --cache=false hamlet-summarizer.gpt
40+
```
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
tools: hamlet-summarizer, sys.read, sys.write
2+
3+
First, create the file "summary.txt" if it does not already exist.
4+
5+
You are a program that is tasked with fetching partial summaries of a play called Hamlet.
6+
7+
Call the hamlet-summarizer tool to get each part of the summary. Begin with index 0. Do not proceed
8+
until the tool has responded to you.
9+
10+
Once you get "No more content" from the hamlet-summarizer, stop calling it.
11+
Then, print the contents of the summary.txt file.
12+
13+
---
14+
name: hamlet-summarizer
15+
tools: hamlet-retriever, sys.read, sys.append
16+
description: Summarizes a part of the text of Hamlet. Returns "No more content" if the index is greater than the number of parts.
17+
args: index: (unsigned int) the index of the portion to summarize, beginning at 0
18+
19+
You are a theater expert, and you're tasked with summarizing part of Hamlet.
20+
Get the part of Hamlet at index $index.
21+
Read the existing summary of Hamlet up to this point in summary.txt.
22+
23+
Summarize the part at index $index. Include as many details as possible. Do not leave out any important plot points.
24+
Do not introduce the summary with "In this part of Hamlet", "In this segment", or any similar language.
25+
If a new character is introduced, be sure to explain who they are.
26+
Add two newlines to the end of your summary and append it to summary.txt.
27+
28+
If you got "No more content" just say "No more content". Otherwise, say "Continue".
29+
30+
---
31+
name: hamlet-retriever
32+
description: Returns a part of the text of Hamlet. Returns "No more content" if the index is greater than the number of parts.
33+
args: index: (unsigned int) the index of the part to return, beginning at 0
34+
35+
#!python3 main.py "$index"

examples/hamlet-summarizer/main.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
import tiktoken
2+
import sys
3+
from llama_index.readers.file import PyMuPDFReader
4+
from llama_index.core.node_parser import TokenTextSplitter
5+
6+
index = int(sys.argv[1])
7+
docs = PyMuPDFReader().load("Hamlet.pdf")
8+
9+
combined = ""
10+
for doc in docs:
11+
combined += doc.text
12+
13+
splitter = TokenTextSplitter(
14+
chunk_size=10000,
15+
chunk_overlap=10,
16+
tokenizer=tiktoken.encoding_for_model("gpt-4").encode)
17+
18+
pieces = splitter.split_text(combined)
19+
20+
if index >= len(pieces):
21+
print("No more content")
22+
sys.exit(0)
23+
24+
print(pieces[index])
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
tiktoken==0.6.0
2+
llama-index-core==0.10.14
3+
llama-index-readers-file==0.1.6

0 commit comments

Comments
 (0)