Skip to content

Commit

Permalink
Support page label for PDF (#3346)
Browse files Browse the repository at this point in the history
* wip

* wip

* wip

* wip

* wip

* cr

* wip

* wip

* wip

* docs

* wip

* wip

* wip

* fix merge

* wip
  • Loading branch information
Disiok authored May 15, 2023
1 parent d0da47e commit 3baffc7
Show file tree
Hide file tree
Showing 3 changed files with 226 additions and 7 deletions.
214 changes: 214 additions & 0 deletions docs/examples/citation/pdf_page_reference.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "ffa328f9",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/suo/miniconda3/envs/llama/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
}
],
"source": [
"from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex, download_loader, GPTRAKEKeywordTableIndex"
]
},
{
"cell_type": "markdown",
"id": "54fa1547-3d32-46ea-9d75-6b167c1e97c3",
"metadata": {},
"source": [
"Set service context to enable streaming"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3a003a97-874c-4807-b017-37270ea7a682",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from llama_index import LLMPredictor, ServiceContext\n",
"from langchain import OpenAI\n",
"\n",
"service_context = ServiceContext.from_defaults(\n",
" llm_predictor=LLMPredictor(\n",
" llm=OpenAI(temperature=0, model_name=\"text-davinci-003\", streaming=True)\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "479d39ec-3def-438c-ae63-7fb7062ef74b",
"metadata": {
"tags": []
},
"source": [
"Load document and build index"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "0ab64036-518b-488a-9bc5-888534c4bf50",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"reader = SimpleDirectoryReader(input_files=['../data/10k/lyft_2021.pdf'])\n",
"data = reader.load_data()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "aa856570-ae8a-4fe6-8f7c-59a698308fdb",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"index = GPTVectorStoreIndex.from_documents(data, service_context=service_context)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "40133993-2413-444c-a2a3-796cf9e263f8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"query_engine = index.as_query_engine(\n",
" streaming=True, \n",
" similarity_top_k=3\n",
")"
]
},
{
"cell_type": "markdown",
"id": "11a91d5a-470a-4896-8a10-732f1e29643f",
"metadata": {},
"source": [
"Stream response with page citation"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "a2d96a34-481c-4330-9618-8bae1aa087d6",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"• Decreased demand for our platform leading to decreased revenues and decreased earning opportunities for drivers on our platform (Page 6)\n",
"• Establishing new health and safety requirements for ridesharing and updating workplace policies (Page 6)\n",
"• Cost-cutting measures, including lay-offs, furloughs and salary reductions (Page 18)\n",
"• Delays or prevention of testing, developing or deploying autonomous vehicle-related technology (Page 18)\n",
"• Reduced consumer demand for autonomous vehicle travel resulting from an overall reduced demand for travel (Page 18)\n",
"• Impacts to the supply chains of our current or prospective partners and suppliers (Page 18)\n",
"• Economic impacts limiting our or our current or prospective partners’ or suppliers’ ability to expend resources on developing and deploying autonomous vehicle-related technology (Page 18)\n",
"• Decreased morale, culture and ability to attract and retain employees (Page 18)\n",
"• Reduced demand for services on our platform or greater operating expenses (Page 18)\n",
"• Decreased revenues and earnings (Page 18)"
]
}
],
"source": [
"response = query_engine.query(\"What was the impact of COVID? Show statements in bullet form and show page reference after each statement.\")\n",
"response.print_response_stream()"
]
},
{
"cell_type": "markdown",
"id": "28251b19-ad4a-48e5-be85-2ec60b9e37e3",
"metadata": {},
"source": [
"Inspect source nodes"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "ea3a9505-23e7-482f-a516-87c3ccd3da70",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-----\n",
"Text:\t Impact of COVID-19 to our BusinessThe ongoing COVID-19 pandemic continues to impact communities in the United States, Canada and globally. Since the pandemic began in March 2020,governments and private businesses - at the recommendation of public health officials - have enacted precautions to mitigate the spread of the virus, including travelrestrictions and social distancing measures in many regions of the United States and Canada, and many enterprises have instituted and maintained work from homeprograms and limited the number of employees on site. Beginning in the middle of March 2020, the pandemic and these related responses caused decreased demand for ourplatform leading to decreased revenues as well as decreased earning opportunities for drivers on our platform. Our business continues to be impacted by the COVID-19pandemic. Although we have seen some signs of demand improving, particularly compared to the dema ...\n",
"Metadata:\t {'page_label': '6'}\n",
"Score:\t 0.823\n",
"-----\n",
"Text:\t storing unrented and returned vehicles. These impacts to the demand for and operations of the different rental programs have and may continue to adversely affectour business, financial condi tion and results of operation.• The COVID-19 pandemic may delay or prevent us, or our current or prospective partners and suppliers, from being able to test, develop or deploy autonomousvehicle-related technology, including through direct impacts of the COVID-19 virus on employee and contractor health; reduced consumer demand forautonomous vehicle travel resulting from an overall reduced demand for travel; shelter-in-place orders by local, state or federal governments negatively impactingoperations, including our ability to test autonomous vehicle-related technology; impacts to the supply chains of our current or prospective partners and suppliers;or economic impacts limiting our or our current or prospective partners’ or suppliers’ ability to expend resources o ...\n",
"Metadata:\t {'page_label': '18'}\n",
"Score:\t 0.811\n",
"-----\n",
"Text:\t and unpredictable effects of COVID-19, we are not currently in a position to forecast the expected impact of COVID-19 on our financialand operating results. Our business could be adversely affected by natural d isasters, public health crises, political crises, economic downturns or other unexpected events.A significant natural disaster, such as an earthquake, fire, hurricane, tornado, flood or significant power outage, could disrupt our operations, mobile networks, theInternet or the operations of our third-party technology providers. In particular, our corporate headquarters are located in the San Francisco Bay Area, a region known forseismic activity and increasingly for fires. The impact of climate change may increase these risks. In addition, any public health crises, such as the COVID-19 pandemic,other epidemics, political crises, such as terrorist attacks, war and other political or social instability and other geopolitical developments, or other catastrophic events,whether ...\n",
"Metadata:\t {'page_label': '18'}\n",
"Score:\t 0.806\n"
]
}
],
"source": [
"for node in response.source_nodes:\n",
" print('-----')\n",
" text_fmt = node.node.text.strip().replace('\\n', ' ')[:1000]\n",
" print(f\"Text:\\t {text_fmt} ...\")\n",
" print(f'Metadata:\\t {node.node.extra_info}')\n",
" print(f'Score:\\t {node.score:.3f}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "673aa070-cefe-4e1b-bc16-e2a64bbb5f67",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Binary file added docs/examples/data/10k/lyft_2021.pdf
Binary file not shown.
19 changes: 12 additions & 7 deletions llama_index/readers/file/docs_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,26 +18,31 @@ def load_data(
) -> List[Document]:
"""Parse file."""
try:
import PyPDF2
import pypdf
except ImportError:
raise ImportError(
"PyPDF2 is required to read PDF files: `pip install PyPDF2`"
"pypdf is required to read PDF files: `pip install pypdf`"
)
text_list = []
with open(file, "rb") as fp:
# Create a PDF object
pdf = PyPDF2.PdfReader(fp)
pdf = pypdf.PdfReader(fp)

# Get the number of pages in the PDF document
num_pages = len(pdf.pages)

# Iterate over every page
docs = []
for page in range(num_pages):
# Extract the text from the page
page_text = pdf.pages[page].extract_text()
text_list.append(page_text)
text = "\n".join(text_list)
return [Document(text, extra_info=extra_info)]
page_label = pdf.page_labels[page]

metadata = {"page_label": page_label}
if extra_info is not None:
metadata.update(extra_info)

docs.append(Document(page_text, extra_info=metadata))
return docs


class DocxReader(BaseReader):
Expand Down

1 comment on commit 3baffc7

@sandorkonya
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Disiok this is a nice addition! It makes it possible to show the corresponding page of the pdf in an interactive way (like on a canvas) if we wanted to.

Please sign in to comment.