Advanced PDF Document Processing with Llama Index v0.10 and Llama Parse

The recent release of Llama Index v0.10 has brought significant advancements in the ability to parse embedded tables and figures, particularly with its new Llama Parse feature. This post aims to dive into these updates to see if they live up to their promise.

Overview of Llama Index v0.10 and Llama Parse

First, let's delve into Llama Index v0.10. This release, along with the Llama Cloud platform, marks a step toward making Llama Index a next-generation, production-ready data framework for LLM applications. Key changes include:

Core vs. Hub: The main abstraction now resides in the core package, while third-party integrations are managed separately in the Llama Hub.
Service Context Removal: The previously cumbersome service context object has been removed, simplifying the new version.
Growing Integrations: The number of third-party integrations has increased significantly.

Llama Index remains a data framework focused on context augmentation—augmenting the prompt in the context window of the LLM. This helps in generating more accurate and fact-checked responses, reducing the issue of hallucinations.

Understanding Llama Parse

Llama Parse is a proprietary parsing algorithm designed for documents with embedded objects, such as tables and figures. It allows for more complex RAG systems by handling semi-structured data (like tables) and unstructured data (like text).

Testing Llama Parse

To evaluate Llama Parse, two documents were chosen: an Nvidia 10-K filing and a report on AI and the future of teaching and learning. These documents contain a mix of text and complex figures.

Results

Parsing Speed: The speed was inconsistent, especially with the recommended recursive retriever, which took minutes to process requests.
Tabular Data Extraction: The extraction of tabular data was very good when it worked.
Figure Extraction: No figure extraction was available. This feature is still under development.

Code Implementation

Setting Up Llama Parse

# Initialize Llama Parse
llama_parse = LlamaParse(
    return_format='markdown',
    remote=True,
    language='en',
    num_workers=2
)

# Upload PDF files
uploaded_files = ['nvidia_10k.pdf', 'ai_report.pdf']

# Parse documents
parsed_docs = llama_parse.parse(uploaded_files)

Creating a Query Engine

# Initialize Llama Index
llama_index = LlamaIndex()

# Set up embeddings and LLM
llama_index.set_global_settings(
    base_llm='gpt-3.5-turbo',
    embedding_model='text-embedding-ada-002'
)

# Parse markdown elements
markdown_parser = MarkdownNodeParser()
nodes = markdown_parser.get_nodes(parsed_docs)

# Create a vector store index
vector_store = VectorStore(nodes)

# Set up a recursive query engine
query_engine = RecursiveQueryEngine(
    vector_store=vector_store,
    ranker_model='bge-rank-large',
    top_k=15
)

Querying the Data

# Query examples
query1 = "Who is the executive VP of operations and how old are they?"
response1 = query_engine.query(query1)

query2 = "What is the gross carrying amount of total amortizable intangible assets for January 29, 2023?"
response2 = query_engine.query(query2)

Here is a step-by-step guide to set up a project using `Llama Parse` and `Llama Index v0.10`, complete with a `requirements.txt` file and a Python script (`main.py`). We'll also include explanations for each part of the code.

Step 1: Create `requirements.txt`

First, create a file named requirements.txt to list all the dependencies.

llama-index==0.10.0
llama-parse==0.1.0
openai==0.27.3

Step 2: Create `main.py`

Next, create a file named main.py and add the following code. This script will initialize Llama Parse, parse the documents, and create a query engine to retrieve data.

import os
import asyncio
from llama_index import LlamaIndex, MarkdownNodeParser, RecursiveQueryEngine, VectorStore
from llama_parse import LlamaParse
import openai

# Set your API keys
os.environ['LLAMA_API_KEY'] = 'your_llama_api_key_here'
os.environ['OPENAI_API_KEY'] = 'your_openai_api_key_here'

# Function to initialize Llama Parse and parse documents
async def parse_documents():
    # Initialize Llama Parse
    llama_parse = LlamaParse(
        return_format='markdown',
        remote=True,
        language='en',
        num_workers=2
    )

    # Upload PDF files
    uploaded_files = ['nvidia_10k.pdf', 'ai_report.pdf']

    # Parse documents
    parsed_docs = await llama_parse.parse(uploaded_files)
    return parsed_docs

# Function to create a query engine
def create_query_engine(parsed_docs):
    # Initialize Llama Index
    llama_index = LlamaIndex()

    # Set up embeddings and LLM
    llama_index.set_global_settings(
        base_llm='gpt-3.5-turbo',
        embedding_model='text-embedding-ada-002'
    )

    # Parse markdown elements
    markdown_parser = MarkdownNodeParser()
    nodes = markdown_parser.get_nodes(parsed_docs)

    # Create a vector store index
    vector_store = VectorStore(nodes)

    # Set up a recursive query engine
    query_engine = RecursiveQueryEngine(
        vector_store=vector_store,
        ranker_model='bge-rank-large',
        top_k=15
    )
    
    return query_engine

# Main function to run the process
async def main():
    # Parse documents
    parsed_docs = await parse_documents()

    # Create query engine
    query_engine = create_query_engine(parsed_docs)

    # Example queries
    query1 = "Who is the executive VP of operations and how old are they?"
    response1 = await query_engine.query(query1)
    print("Query 1 Response:", response1)

    query2 = "What is the gross carrying amount of total amortizable intangible assets for January 29, 2023?"
    response2 = await query_engine.query(query2)
    print("Query 2 Response:", response2)

# Run the main function
if __name__ == "__main__":
    asyncio.run(main())

Step 3: Set Up Your Environment

Install Dependencies: Run the following command to install the required libraries.

pip install -r requirements.txt

Add API Keys: Replace 'your_llama_api_key_here' and 'your_openai_api_key_here' with your actual API keys in the main.py file.
Add API Keys: Replace 'your_llama_api_key_here' and 'your_openai_api_key_here' with your actual API keys in the main.py file. You can create an API key from LlamaIndex Cloud.
Prepare PDF Files: Ensure you have the PDF files (nvidia_10k.pdf and ai_report.pdf) in the same directory as your main.py script or adjust the paths accordingly.

Step 4: Run the Script

Run the script to parse the documents and query the data.

python main.py

Code Explanation

Environment Variables: The API keys for Llama and OpenAI are set using environment variables.

os.environ['LLAMA_API_KEY'] = 'your_llama_api_key_here'
os.environ['OPENAI_API_KEY'] = 'your_openai_api_key_here'

Parsing Documents: The parse_documents function initializes Llama Parse and uploads the PDF files for parsing.

async def parse_documents():
    llama_parse = LlamaParse(
        return_format='markdown',
        remote=True,
        language='en',
        num_workers=2
    )
    uploaded_files = ['nvidia_10k.pdf', 'ai_report.pdf']
    parsed_docs = await llama_parse.parse(uploaded_files)
    return parsed_docs

Creating Query Engine: The create_query_engine function sets up Llama Index, parses the markdown elements, creates a vector store index, and sets up a recursive query engine.

def create_query_engine(parsed_docs):
    llama_index = LlamaIndex()
    llama_index.set_global_settings(
        base_llm='gpt-3.5-turbo',
        embedding_model='text-embedding-ada-002'
    )
    markdown_parser = MarkdownNodeParser()
    nodes = markdown_parser.get_nodes(parsed_docs)
    vector_store = VectorStore(nodes)
    query_engine = RecursiveQueryEngine(
        vector_store=vector_store,
        ranker_model='bge-rank-large',
        top_k=15
    )
    return query_engine

Main Function: The main function orchestrates parsing the documents and querying the data.

async def main():
    parsed_docs = await parse_documents()
    query_engine = create_query_engine(parsed_docs)
    query1 = "Who is the executive VP of operations and how old are they?"
    response1 = await query_engine.query(query1)
    print("Query 1 Response:", response1)
    query2 = "What is the gross carrying amount of total amortizable intangible assets for January 29, 2023?"
    response2 = await query_engine.query(query2)
    print("Query 2 Response:", response2)

Running the Script: The script is executed using asyncio.run(main()).
```
if __name__ == "__main__":
    asyncio.run(main())
```

This setup will allow you to parse PDFs, create a query engine, and retrieve information from the documents using Llama Parse and Llama Index.

Conclusion

Out of the box, Llama Parse is a good starting point, especially for those not looking to build custom solutions. However, it is not yet suitable for latency-critical applications. The tabular extraction works well, but figure extraction and speed improvements are areas that need further development.

Llama Parse and Llama Index v0.10 show promise in simplifying the handling of complex documents with embedded data, and we look forward to future updates that will enhance their capabilities even further.

Revanth Tech Trends