Build your own AI models leveraging previously reviewed data.
Introduction
Utilizing reviewed historical data to build models can be a cost-effective way to leverage existing information and expertise. When managed properly, it can help identify initial sets of responsive documents before manual expert review starts.
Not all cases are suitable for AI Modeling. Before proceeding, please take the following two factors into account.
- Model Suitability
- Not all reviewed cases are appropriate for model building. For example, cases where documents contain a wide range of issues, but only a single Yes or No tag is being applied, may lack specificity.
- Data Sufficiency
- Data size - the minimum number of documents required to build a case with Analytics is 300. We expect at least a few thousand documents to be able to build a model effectively.
- Available sample size - although it is possible to build a model with less reviewed documents, it is recommended to have at least a few hundred Yes and No samples.
Reach out to the Reveal consulting team for an AI Model building session if additional assistance is necessary before proceeding with the following workflow.
Process Flow
Workflow Steps
Data Preparation
The first step in AI Model building is to get data loaded into a Reveal case. Here are two scenarios to consider.
- Scenario 1 - data is already present in an existing Reveal case.
- If the case is already in Reveal, it is possible to use the existing case directly to save on time and cost.
- Scenario 2 - data exists in a non-Reveal platform.
- If the data exists in a different platform, it needs to be exported and ingested into Reveal first.
- For AI model building purposes, we recommend exporting the review tags together with the following metadata and extracted text:
From/To/CC/BCC/Subject/Date Time Sent/Filename/Date Time Last Modified/Last Author/Responsive Tag
- For document selection, when ingesting data into a new Reveal case for AI Model building purposes, consider excluding the following types of documents:
- Documents with numeric values only, for example, Excel/CSV files with very few texts content.
- Documents with no recognizable text, for example, image files or files with no text or bad OCR text.
- Documents with oversized text content. By default, any document text over 16MB text will be ignored by analytics. Setting an internally agreed upon maximum text size (for example, 10MB) will help eliminate issues down the road.
- Document with file types that are not best for training the specific model. For example, if the model is targeting detecting Pricing discussion, exclude non-email documents and use only emails.
- For document selection, when ingesting data into a new Reveal case for AI Model building purposes, consider excluding the following types of documents:
Once the data is in a Reveal case, the next step is to create an AI-enabled tag.
Note
It is recommended we create a new AI-enabled tag for Model building instead of using the existing tags. A new AI-enabled tag gives user more control over sampling and training.
To add a new AI-enabled tag, go to Project Admin > Tags > Add Tag and Choices, set up Yes/No choices. Click Add and then confirm the corresponding classifier is created successfully in the Supervised Learning tab. An example is shown below:
Refer to Reveal Online Knowledge article Supervised Learning Overview for more details.
Build Samples for Training Purpose
Once the newly AI-enabled tag is created, the next step is to select samples and transfer review choices over from the previous review.
For sample selection, consider the following two scenarios.
- Scenario 1 - there are enough Yes/No samples for both training and validation purposes. For example, all documents in the case have been previously tagged.
In this scenario, use the built-in Sample Documents feature to select enough samples.
Note
Typically, the sample universe for Yes or No does not need more than 10,000 documents.
- Create a search to find all documents previously tagged as Yes, excluding documents listed in Scenario 2 of the Data Preparation section above.
- Go to Grid > Sample, for Number of documents and enter the desired number of sample documents, for example, 5000.
- Save the samples to a Work or Search folder.
- Open the saved Work/Search folder, update all documents to Yes for the newly created AI-enable Tag (see Bulk Tag Documents for further additional details).
- Repeat above steps to create samples for documents tagged No.
Note
To enhance sampling results, in steps a. and b. above, consider distributing samples across the top clusters by creating individual searches for each cluster, incorporating the cluster ID as part of the search criteria and consolidating the results afterward. Cluster IDs for the top clusters can be located in the search box by clicking the ellipses button (...) in each cluster panel and selecting Add Cluster to Search.
- Scenario 2 - there are a limited number of previously reviewed documents in the case.
- In this scenario, we can use the 80/20 rule to break samples into training sets and testing sets if validation is required later.
- For example, in a case with 10,000 documents previously tagged as Yes, randomly select 8,000 documents as Yes samples for training. Leave 2,000 documents for validation purposes later.
- The steps to build samples are the same as Scenario 1 in this section.
Build Models
Once the Yes/No samples are in place, follow the steps below to finish building the model.
- Open the corresponding Classifier page, click Run Full Process to submit a request to build the model.
- Wait till the Classifier Status is set back to Ready, and the Score Distribution chart is populated under Tagging and Scoring.
Note
After the classifier finishes progressing, it may take up to an hour for the AI scores to be synchronized to the front-end. Users should wait until scores are viewable and populated in the front-end Grid before moving to the Validation phase.
-
The model is now ready for publishing. Refer to Add AI Model to Library for additional details on publishing to the model library.
Test Models (optional)
To test how the model performs, consider the following scenarios.
- Scenario 1 (across cases) - if another case is available for testing, follow the steps below to test out the new model.
- Create a new AI-enabled tag in the second case, import the model and click Run Full Process to begin scoring (see Supervised Learning Overview for more details).
- In the Grid view, sort the documents by the AI score in descending order. Select the top 100 (number can vary depending on case).
- Review the documents selected above and calculate how many documents are responsive to the issue.
- Calculate the Recall ratio using the following formula:
- Recall = (Number of Responsive Documents in c. above) / (Total Number of Documents Selected in b. above)
- Scenario 2 (within the same case) - if all documents in the case have been previously reviewed, the following steps describe how to build a Control Set for evaluating model performance.
- Increase batch size to the desired number for the Control Set. For example, if 2,000 documents is desired for the Control Set, then the Batch Size should be set to 2000.
- Use AI Batches to create a Control Set batch that contains the desired number of documents.
- Use bulk tagging to copy previously reviewed results to the Control Set of documents above.
- Search for all documents within the Control Set batch that have been tagged as Yes under the previous review tag.
- Bulk tag the search results as Yes, these documents are for the newly created AI-enabled tag.
- Repeat the same steps to transfer No choices.
- Wait until the tag synchronization finishes, then open the Control Set page to check Precision, Recall, and F1 scores.
- Increase batch size to the desired number for the Control Set. For example, if 2,000 documents is desired for the Control Set, then the Batch Size should be set to 2000.
- Scenario 3 (within the same case) - only a limited number of documents are available for validation.
In this scenario, Precision/Recall can be calculated manually once AI scores are visible in the front-end after the model is finished building.- Search for documents tagged as Yes under the previous review tag, which are also a part of the validation sets saved earlier (Scenario 2 - Step 2 above). Note down this number.
- Choose a threshold value that will be used to recognize a Yes document by the model, for example, 60.
- Search for documents that are scored equal to or above the threshold from within the validation set. Note down this number.
- Search for documents that scored equal to or above the threshold, and were tagged as Yes under the previous review tag. Note down this number.
Example search: - Calculate the Recall/Precision/F1 values using the formulas below:
Recall
- (Number from d. above) / (Number from a. above)
- (Number from d. above) / (Number from c. above)
- F1 = 2 * (Precision * Recall) / (Precision + Recall)