Extract Data From PDF
Automate the conversion of unstructured documents into organized, searchable data records in Google Sheets, streamlining data management and eliminating manual entry.
Last updated
October 1, 2025
Connectors used
Tags
Key Takeaways
- Unstructured to structured - Converts document content from a Needle collection into organized rows in Google Sheets using AI
- Handles pagination automatically - A loop with offset-based pagination processes collections of any size, fetching files in batches
- Customizable extraction - You define the questions the AI should answer for each document, so it extracts exactly what you need
- Google Sheets output - Extracted data is written directly to a spreadsheet using AI-driven Google Sheets tools
What This Workflow Does
This workflow processes documents stored in a Needle collection, reads their contents, and uses AI to extract specific information you define (such as title, summary, dates, amounts, or any custom field). The extracted data is then written to Google Sheets as structured rows. It is triggered manually, so you run it whenever you have a new batch of documents to process.
Use cases:
- Extracting key fields from invoices (amounts, vendors, dates)
- Summarizing and cataloging research papers or reports
- Pulling structured data from contracts (parties, terms, dates)
How It Works
| Step | What Happens |
|---|---|
| 1. Manual Trigger | You start the workflow manually when you have documents to process |
| 2. Loop with Pagination | A loop iterates through the collection using offset-based pagination (batches of 20 files) |
| 3. List Files | The List Files node fetches the next batch of files from your Needle collection |
| 4. Transform (Flatten) | Flattens the paginated results into a single list of files |
| 5. Get File Contents | Fetches the text content of each individual file from the collection |
| 6. AI Extraction | GPT-4.1 reads the file content and extracts structured fields (title, summary, file_name) based on your prompt |
| 7. AI Write to Sheets | A second GPT-4.1 node writes the extracted data to Google Sheets using upsert, find, update, and add row tools |
Workflow Nodes
| Node | Role |
|---|---|
| Manual Trigger | Starts the workflow on demand |
| Loop | Paginates through the collection (condition: up to 20 iterations while files remain) |
| List Files | Fetches a batch of files from the Needle collection with an offset expression |
| Transform | Flattens the list of batched results into a single file list |
| Get File Contents | Retrieves the text content of each file from the collection |
| AI Extract (GPT-4.1) | Extracts structured fields (title, summary, file_name) from each file's content |
| AI Write to Sheets (GPT-4.1) | Writes extracted data to Google Sheets using upsert_row, find_row, update_cell, and add_multiple_rows tools |
Setup Instructions
- Add the "Extract Data From PDF" template to your Needle workspace
- Create a Needle collection and upload the documents you want to process (PDFs, Word docs, etc.)
- Update the
collectionIdin the List Files and Get File Contents nodes to point to your collection - Create a Google Sheets connector and select it in the AI Write to Sheets node
- Set up your target Google Sheet with the column headers you want (e.g., File, Answer 1, Answer 2, etc.)
- Update the AI Write to Sheets prompt with your Google Sheet URL and column names
- Edit the AI Extract prompt to define the questions you want answered for each document
- Run the workflow manually
Customization
| What You Can Change | How |
|---|---|
| Extraction questions | Edit the prompt in the AI Extract node to ask different questions (dates, amounts, names, codes, etc.) |
| Structured output fields | Update the structured output schema in the AI Extract node to match your extraction needs |
| Google Sheet columns | Change the column headers in your sheet and update the AI Write to Sheets prompt accordingly |
| Batch size | Adjust the loop condition and offset logic to process smaller or larger batches |
| AI model | Swap the model in either AI node if you prefer a different provider |
| Target collection | Change the collectionId to process a different set of documents |
FAQ
Q: What document formats are supported? A: Needle collections support PDFs, Word documents, and other common document formats. The workflow reads whatever text content the collection has extracted.
Q: How many documents can this handle? A: The loop supports up to 20 iterations with 20 files per batch, so it can process up to 400 files in a single run. You can adjust the loop condition for larger collections.
Q: Can I extract more than just title and summary? A: Yes. Edit the AI Extract prompt and structured output schema to include any fields you need, such as dates, monetary amounts, reference numbers, or custom categories.
Q: Does it overwrite existing rows in Google Sheets? A: The AI Write to Sheets node uses upsert logic, so it can update existing rows or add new ones depending on how you configure the prompt and matching criteria.
Want to showcase your own workflows?
Become a Needle workflow partner and turn your expertise into recurring revenue.