How to Split a Large PDF with AI for eDiscovery and Relativity
If you work in litigation support or eDiscovery, you know the problem: opposing counsel hands over a 5,000-page PDF containing hundreds of individual documents - emails, contracts, memos, and attachments - all combined as a single PDF. Before any of that content can be reviewed, coded, or produced, someone has to figure out where each document starts and ends.
That process is called document unitization (sometimes also called LDD - logical document determination), and it has traditionally been one of the most tedious, manual steps in the entire eDiscovery workflow. AI is changing that.
What Does It Mean to "Split" a Large PDF?
Splitting a large PDF isn't just about cutting it into equal chunks. In a legal context, splitting means identifying the logical boundaries between documents within a single file. Page 1 might be the first page of an email, page 3 might be an attached contract, and page 7 might be an entirely different email chain.
A proper split must:
- Detect where one document ends and the next begins
- Identify document types (email, letter, memo, spreadsheet, fax cover sheet)
- Extract metadata like dates, authors, recipients, and subject lines
- Handle attachments by linking them to their parent documents
- And if they exist: capture Bates numbers and ranges for each unitized document
Manual unitization typically requires a trained reviewer to examine every page, decide the boundaries, and re-type out the metadata each document. For a 5,000-page PDF, this can take days of focused work.
How AI Splits Large PDFs
Modern vision-language models can examine each page of a PDF (both the text content and the visual layout) and determine document boundaries with high accuracy. Here is how an AI-powered pipeline works:
1. OCR and Layout Extraction
The PDF is first processed through optical character recognition (OCR) to extract both the text and the spatial layout of every page. This captures not just what words appear, but where they appear on the page headers at the top, signatures at the bottom, tabular data in the middle.
2. Visual Page Analysis
A vision-language model examines each page image alongside the extracted text. It can recognize visual patterns that signal document boundaries: letterheads, email headers ("From / To / Subject / Date"), fax cover sheets, signature blocks, and changes in formatting.
3. Boundary Detection
The model determines whether each page is a first page (starting a new document) or a continuation of the previous document. It considers multiple signals:
- Does the page have a new email header or letterhead?
- Does it reference a different date, author, or email subject?
- Is there a visual break? Does the orientation change? Is this the same formatting style as the last page?
- Does the content logically continue from the previous page (ie nested/threaded email signals)?
4. Metadata Extraction
For each identified document, the model extracts structured metadata where available: document titles, email header information, and Bates or control numbers already stamped on the pages.
5. Attachment Linking
Emails often reference attachments ("Please see attached contract"). The AI tracks these references and links attachment documents to their parent emails, preserving the family relationships that are critical for legal review.
Why eDiscovery Teams Need Automated Unitization
In eDiscovery, document unitization sits between collection and review. If unitization is wrong, everything downstream suffers. Review is slower, privilege calls are harder, and productions may be inaccurate.
Scale is the core problem. At a typical manual processing rate of 300 pages per hour, a 5,000-page document takes over 16 hours of tedious, focused work. A single litigation matter can easily involve multiple collections of that size.
Consistency matters. Different reviewers make different boundary decisions. One person might treat a two-page fax cover sheet and attached letter as one document; another might split them. AI applies the same rules to every page.
Speed enables earlier case assessment. When unitization takes days or weeks, attorneys are waiting to start reviewing key documents. Automated processing that completes in hours means earlier insights and better case strategy.
Loading Split Documents into Relativity
For teams using Relativity (or similar review platforms like Everlaw or DISCO), the output format matters as much as the splitting accuracy. A properly unitized production should include:
Concordance DAT Load File
The standard import format for Relativity is a Concordance DAT file - a delimited text file where each row represents one document and columns contain metadata fields. The file uses specific delimiters:
- Column separator: Unicode character 0x14
- Text qualifier: thorn character (þ, Unicode 0xFE)
- Newline within fields: registered trademark symbol (®)
A well-formed DAT file lets you map fields directly to Relativity's workspace fields during import - document type, date, author, Bates range, and family relationships all come in cleanly.
Opticon OPT Cross-Reference
The OPT file tells Relativity which image files correspond to which documents. Each line maps a Bates number to an image file path, with a flag indicating whether it is the first page of a document ("Y") or a continuation ("").
TIFF or PDF Images
Individual document images - either single-page TIFFs (Group IV compression, the industry standard) or multi-page PDFs named by their beginning Bates numbers. Relativity can display these in its image viewer during review.
Extracted Text
Full-text files for each document enable searching across the collection. Each text file corresponds to one document and is named to match its Bates identifier.
When all four components are present and properly formatted, the import into Relativity is straightforward: point the Relativity Desktop Client (or Import/Export Job) at the DAT file, map the fields, specify the image and text paths, and load.
What to Look for in an AI Unitization Tool
Not all automated splitting tools are equal. When evaluating options, consider:
- Does it handle mixed document types? Real-world PDFs contain emails, attachments, memos, faxes, and printed spreadsheets all interleaved. The tool needs to handle all of them.
- Does it extract metadata? Splitting without metadata extraction only does half the job. You still need dates, authors, and document types for review.
- Does it preserve family relationships? Email-attachment linking is critical for privilege review and production.
- Does it produce Relativity-ready load files? If you need to import into a review platform, DAT/OPT output saves you a separate processing step.
- What happens to your data? For sensitive litigation documents, look for tools that process on isolated infrastructure and do not retain data after processing completes.
Getting Started
UnitizeAI processes large PDFs through a fully automated pipeline: OCR extracts text and layout, a vision-language model identifies boundaries and extracts metadata, and the output is packaged into Relativity-ready load files (DAT, OPT, TIFFs, and extracted text) or simplified PDF deliverables.
Upload your PDF, and the platform handles the rest - from page-level analysis through to production-ready output. No manual page-turning required.
Try UnitizeAI today and see how AI unitization works on your documents.