PDF Chaos to Structured Insights with Gemini File Search

Financial data isn’t scarce.
Usable financial data is.

Financial services are among the most regulated industries in the world. Regulators instruct financial companies to make a vast amount of information public, enforcing strict requirements on what information needs to be shared. However, the instructions on how or in which structure this information should be shared are often vague.

The result? A plethora of valuable information lying in PDF documents, theoretically accessible to the public but practically useless for generating insights.

Traditionally, this gap has been filled by data brokers whose business model is to convert this unstructured, scattered data into a structured format available at a single source. For users, this has meant a stark choice: pay high subscription fees to data brokers or don’t use the information at all.

In this post, I’ll walk through how we built a workflow that converts PDF documents into structured, queryable data using Gemini.

The Challenge: Mutual Fund Factsheets

Mutual Fund monthly factsheets are a prime example. There is a tremendous amount of data made public by Asset Management Companies (AMCs) in these documents — Fund Managers, Benchmark Indices, Expense Ratios, and detailed Portfolio Holdings. Yet, extracting this data from 100+ page PDFs is a tedious, prone to errors and incredibly time-consuming.

We took up a project: KnowYourFund. Our goal is to mine this data into a structured format automatically, democratizing access to financial insights.

The Solution: Can we use NotebookLM?

Many of us have used tools like NotebookLM to query PDF documents and can imagine how perfect it would be for this task. We wanted to build something similar — an automated pipeline that could convert thousands of PDFs pages into structured data.

To achieve this, we leveraged the Gemini File Search API, which is essentially the closest API service to the power of NotebookLM.

Why Gemini File Search?

Implementing a Retrieval-Augmented Generation (RAG) system from scratch is complex. You have to handle parsing, chunking, vector storage, indexing, and retrieval.

Allowed us to:

Remove Complexity: The API takes away all the heavy lifting of managing vector databases and chunking strategies.
Accelerate Time to Market: We focused on the extraction logic, not the infrastructure
Lower Costs: By utilizing efficient filtering, we drastically reduced token usage compared to naive RAG approaches.

Live Demo & Future Plans

We have started by extracting information for a few AMCs and created a portal to showcase the extracted data.

Visit knowyourfund.thrivegen.ai to explore.

We will be expanding this to cover more AMCs and showcase further insights, including features like “Ask Fund” (conversational queries) and cross-period analysis.

Replicability: Democratizing Document Search

The PDF data mining problem is not limited to Mutual Funds. Using this same structure, organizations can implement LLM search on their own documents as well. Consider it a custom “NotebookLM” for your company’s data.

This implementation of RAG by Gemini File Search API makes it accessible and affordable, leading to faster innovation across industries — whether it’s legal contracts, medical records, or scientific research.

If you’re working with document-heavy data — financial, legal, or otherwise — I’d love to hear how you’re approaching it.