enhance: add document summarizer example (#107)

g-linville · web-flow · commit 9b50fe1ae0bd · 2024-03-05T08:29:01.000-07:00
* enhance: add document summarizer example

Signed-off-by: Grant Linville &lt;grant@acorn.io&gt;
diff --git a/docs/README-USECASES.md b/docs/README-USECASES.md
@@ -83,7 +83,7 @@ Depending on the context window supported by the LLM, you can either send a larg
 
 ### Summarization
 
-Here is a GPTScript that sends a large document in batches to the LLM and produces a summary of the entire document. [Link to example here]
+Here is a GPTScript that sends a large document in batches to the LLM and produces a summary of the entire document. [hamlet-summarizer](../examples/hamlet-summarizer)
 
 Here is a GPTScript that reads the content of a large SQL database and produces a summary of the entire database. [Link to example here]
 
diff --git a/examples/hamlet-summarizer/.gitignore b/examples/hamlet-summarizer/.gitignore
@@ -0,0 +1 @@
+venv/
diff --git a/examples/hamlet-summarizer/Hamlet.pdf b/examples/hamlet-summarizer/Hamlet.pdf
diff --git a/examples/hamlet-summarizer/README.md b/examples/hamlet-summarizer/README.md
@@ -0,0 +1,40 @@
+# Hamlet Summarizer
+
+This is an example tool that summarizes the contents of a large documents in chunks.
+
+The example document we are using is the Shakespeare play Hamlet. It is about 51000 tokens
+(according to OpenAI's tokenizer for GPT-4), so it can fit within GPT-4's context window,
+but this serves as an example of how larger documents can be split up and summarized.
+This example splits it into chunks of 10000 tokens.
+
+Hamlet PDF is from https://nosweatshakespeare.com/hamlet-play/pdf/.
+
+## Design
+
+The script consists of three tools: a top-level tool that orchestrates everything, a summarizer that
+will summarize one chunk of text at a time, and a Python script that ingests the PDF and splits it into
+chunks and provides a specific chunk based on an index.
+
+The summarizer tool looks at the entire summary up to the current chunk and then summarizes the current
+chunk and adds it onto the end. In the case of models with very small context windows, or extremely large
+documents, this approach may still exceed the context window, in which case another tool could be added to
+only give the summarizer the previous few chunk summaries instead of all of them.
+
+## Run the Example
+
+```bash
+# Create a Python venv
+python3 -m venv venv
+
+# Source it
+source venv/bin/activate
+
+# Install the packages
+pip install -r requirements.txt
+
+# Set your OpenAI key
+export OPENAI_API_KEY=your-api-key
+
+# Run the example
+gptscript --cache=false hamlet-summarizer.gpt
+```
diff --git a/examples/hamlet-summarizer/hamlet-summarizer.gpt b/examples/hamlet-summarizer/hamlet-summarizer.gpt
@@ -0,0 +1,35 @@
+tools: hamlet-summarizer, sys.read, sys.write
+
+First, create the file "summary.txt" if it does not already exist.
+
+You are a program that is tasked with fetching partial summaries of a play called Hamlet.
+
+Call the hamlet-summarizer tool to get each part of the summary. Begin with index 0. Do not proceed
+until the tool has responded to you.
+
+Once you get "No more content" from the hamlet-summarizer, stop calling it.
+Then, print the contents of the summary.txt file.
+
+---
+name: hamlet-summarizer
+tools: hamlet-retriever, sys.read, sys.append
+description: Summarizes a part of the text of Hamlet. Returns "No more content" if the index is greater than the number of parts.
+args: index: (unsigned int) the index of the portion to summarize, beginning at 0
+
+You are a theater expert, and you're tasked with summarizing part of Hamlet.
+Get the part of Hamlet at index $index.
+Read the existing summary of Hamlet up to this point in summary.txt.
+
+Summarize the part at index $index. Include as many details as possible. Do not leave out any important plot points.
+Do not introduce the summary with "In this part of Hamlet", "In this segment", or any similar language.
+If a new character is introduced, be sure to explain who they are.
+Add two newlines to the end of your summary and append it to summary.txt.
+
+If you got "No more content" just say "No more content". Otherwise, say "Continue".
+
+---
+name: hamlet-retriever
+description: Returns a part of the text of Hamlet. Returns "No more content" if the index is greater than the number of parts.
+args: index: (unsigned int) the index of the part to return, beginning at 0
+
+#!python3 main.py "$index"
diff --git a/examples/hamlet-summarizer/main.py b/examples/hamlet-summarizer/main.py
@@ -0,0 +1,24 @@
+import tiktoken
+import sys
+from llama_index.readers.file import PyMuPDFReader
+from llama_index.core.node_parser import TokenTextSplitter
+
+index = int(sys.argv[1])
+docs = PyMuPDFReader().load("Hamlet.pdf")
+
+combined = ""
+for doc in docs:
+    combined += doc.text
+
+splitter = TokenTextSplitter(
+    chunk_size=10000,
+    chunk_overlap=10,
+    tokenizer=tiktoken.encoding_for_model("gpt-4").encode)
+
+pieces = splitter.split_text(combined)
+
+if index >= len(pieces):
+    print("No more content")
+    sys.exit(0)
+
+print(pieces[index])
diff --git a/examples/hamlet-summarizer/requirements.txt b/examples/hamlet-summarizer/requirements.txt
@@ -0,0 +1,3 @@
+tiktoken==0.6.0
+llama-index-core==0.10.14
+llama-index-readers-file==0.1.6

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+tiktoken==0.6.0`
	`2`	`+llama-index-core==0.10.14`
	`3`	`+llama-index-readers-file==0.1.6`