Large Text Indexing

This article provides a detailed description of the factors associated with indexing documents containing very large bodies of text.


Reveal has a 16 MB indexing limit that applies to all text sets. Text Sets are searchable text groups defined by import stream, for example extracted text, optical character recognition (OCR), spreadsheet, translation, or transcription. 

In general, documents larger than 16MB are considered to contain too many words and vague or repeated terms to be useful. The indexing limit therefore balances practical review search requirements and performance.

Reveal analytics require indexed text. Processes such as email threading, near duplicate detection, entity extraction, clustering, emotional intelligence assessment and classification vectors are generated as part of indexing. Documents which do not have indexed text therefore will not be subject to analytics. 

In practical terms, it is not the file size that matters here, but the rendering created from the source document. Consider that a 20MB document might create a 10MB HTML which would be indexed for the Native/HTML text set. Conversely, if a 5MB file (a spreadsheet, for example) created a rendering that was greater than 16MB, that would fail.

NOTE:  The native and text file sizes differ from the expanded file sizes. The expanded file size is the size of the text set created.

Example:

The index log below indicates the Text Set for which each document was indexed and the error encountered. For the Too large errors listed, the document will still be indexable if the Extracted/OCR text is less than 16MB, regardless of the size of the original native. Most frequently big documents like this will be visible and searchable on the text level, but not the native rendering.

253 - 01 - RM Index Log - Too Large Items

From this list you can see that ItemID 7865 (Begdoc MJF_CNTRL00000579) would not be searchable since it failed in both Extracted text and Native/HTML format. The other documents in this list should have searchable text indexed, enabling them to be retrieved by a keyword search.

 

Last Updated 8/06/2024