Workflow

Extract Data From PDF

Automate the conversion of unstructured documents into organized, searchable data records in Google Sheets, streamlining data management and eliminating manual entry.

Needle Team

Last updated

October 1, 2025

Connectors used

Google Sheets

Tags

PDF Data ExtractionDocument AutomationGoogle SheetsOCR ProcessingDocument ProcessingDocument AutomationDocument ExtractionDocument ConversionDocument TransformationDocument ManagementDocument WorkflowNeedle Collections

Key Takeaways

  • Unstructured to structured - Converts document content from a Needle collection into organized rows in Google Sheets using AI
  • Handles pagination automatically - A loop with offset-based pagination processes collections of any size, fetching files in batches
  • Customizable extraction - You define the questions the AI should answer for each document, so it extracts exactly what you need
  • Google Sheets output - Extracted data is written directly to a spreadsheet using AI-driven Google Sheets tools

What This Workflow Does

This workflow processes documents stored in a Needle collection, reads their contents, and uses AI to extract specific information you define (such as title, summary, dates, amounts, or any custom field). The extracted data is then written to Google Sheets as structured rows. It is triggered manually, so you run it whenever you have a new batch of documents to process.

Use cases:

  • Extracting key fields from invoices (amounts, vendors, dates)
  • Summarizing and cataloging research papers or reports
  • Pulling structured data from contracts (parties, terms, dates)

How It Works

StepWhat Happens
1. Manual TriggerYou start the workflow manually when you have documents to process
2. Loop with PaginationA loop iterates through the collection using offset-based pagination (batches of 20 files)
3. List FilesThe List Files node fetches the next batch of files from your Needle collection
4. Transform (Flatten)Flattens the paginated results into a single list of files
5. Get File ContentsFetches the text content of each individual file from the collection
6. AI ExtractionGPT-4.1 reads the file content and extracts structured fields (title, summary, file_name) based on your prompt
7. AI Write to SheetsA second GPT-4.1 node writes the extracted data to Google Sheets using upsert, find, update, and add row tools

Workflow Nodes

NodeRole
Manual TriggerStarts the workflow on demand
LoopPaginates through the collection (condition: up to 20 iterations while files remain)
List FilesFetches a batch of files from the Needle collection with an offset expression
TransformFlattens the list of batched results into a single file list
Get File ContentsRetrieves the text content of each file from the collection
AI Extract (GPT-4.1)Extracts structured fields (title, summary, file_name) from each file's content
AI Write to Sheets (GPT-4.1)Writes extracted data to Google Sheets using upsert_row, find_row, update_cell, and add_multiple_rows tools

Setup Instructions

  1. Add the "Extract Data From PDF" template to your Needle workspace
  2. Create a Needle collection and upload the documents you want to process (PDFs, Word docs, etc.)
  3. Update the collectionId in the List Files and Get File Contents nodes to point to your collection
  4. Create a Google Sheets connector and select it in the AI Write to Sheets node
  5. Set up your target Google Sheet with the column headers you want (e.g., File, Answer 1, Answer 2, etc.)
  6. Update the AI Write to Sheets prompt with your Google Sheet URL and column names
  7. Edit the AI Extract prompt to define the questions you want answered for each document
  8. Run the workflow manually

Customization

What You Can ChangeHow
Extraction questionsEdit the prompt in the AI Extract node to ask different questions (dates, amounts, names, codes, etc.)
Structured output fieldsUpdate the structured output schema in the AI Extract node to match your extraction needs
Google Sheet columnsChange the column headers in your sheet and update the AI Write to Sheets prompt accordingly
Batch sizeAdjust the loop condition and offset logic to process smaller or larger batches
AI modelSwap the model in either AI node if you prefer a different provider
Target collectionChange the collectionId to process a different set of documents

FAQ

Q: What document formats are supported? A: Needle collections support PDFs, Word documents, and other common document formats. The workflow reads whatever text content the collection has extracted.

Q: How many documents can this handle? A: The loop supports up to 20 iterations with 20 files per batch, so it can process up to 400 files in a single run. You can adjust the loop condition for larger collections.

Q: Can I extract more than just title and summary? A: Yes. Edit the AI Extract prompt and structured output schema to include any fields you need, such as dates, monetary amounts, reference numbers, or custom categories.

Q: Does it overwrite existing rows in Google Sheets? A: The AI Write to Sheets node uses upsert logic, so it can update existing rows or add new ones depending on how you configure the prompt and matching criteria.

Want to showcase your own workflows?

Become a Needle workflow partner and turn your expertise into recurring revenue.

Try Needle today

Streamline AI productivity at your company today

Join thousands of people who have transformed their workflows.

Agentic workflowsAutomations, meet AI agents
AI SearchAll your data, searchable
Chat widgetsDrop-in widget for your website
Developer APIMake your app talk to Needle
    Needle LogoNeedle
    Like many websites, we use cookies to enhance your experience, analyze site traffic and deliver personalized content while you are here. By clicking "Accept", you are giving us your consent to use cookies in this way. Read our more on our cookie policy .