Page 8 - E-Discovery
P. 8
S8 | MONDAY, OCTOBER 5, 2015 | E-Discovery
| NYLJ.COM
Practical Considerations
or “negative,” often with relevance scores. A
higher score does not necessarily mean that
Predictive Coding
a document is more relevant, but rather that
In Using the tool has determined that it has a greater
likelihood of being relevant.
Predictive coding can also be effective on
foreign language documents, including Asian
languages.
Quality Control
An additional step that is frequently taken,
although not always deemed necessary, is to
“validate” the effectiveness of predictive cod-
ing through a quality control check. Reviewers
code a random sample drawn from the overall
document population, excluding documents
from the seed and training sets. This sample
is known as the “control sample” or “valida-
tion sample.”
The coding of the control sample is then
compared to the tool’s decisions on the same
documents. If the number of “false positives”
and “false negatives” in the predictive coding
results—as compared to the control sample—
is acceptable, the training is complete. If not,
you may seek to improve the results with
further training.
Review Before Production
A few years ago, when predictive coding
first gained some notoriety as a technology
for document review, some envisioned docu-
ments being blindly produced after only the
“computer” reviewed them.
The typical workflow that has emerged in
practice, by contrast, is to review, prior to any
production, documents that the predictive
coding tool has identified as likely relevant.
This allows for false positives—i.e., irrelevant
documents—and privileged documents to be
removed before production.
Continuous Training
Predictive coding technology has been
evolving. One noteworthy development
has been the appearance of tools utilizing
a training methodology known as “continu- K
ous active learning” or “CAL.” CAL, in effect, TOC
combines the training and final review phases IGS
described above.
B
After initially training the predictive model
with a seed set, a CAL tool will present review-
ers with documents that it has identified as relevant. To do so, it utilizes machine learning, to be relevant documents, or through other
likely relevant and others it has strategically BY GARETH EVANS
in which reviewers code sample documents means.
selected for training. The review continues— AND JENNIFER REARDEN
drawn from the overall document population.
After processing the seed set, machine
and the model is continuously trained—until Essentially, the predictive coding tool iden- learning is then refined through iterative
all the relevant documents have been found P redictive coding has tremendous appeal, tifies other documents in the population that review of “training sets.” These are batches
at the desired rate of recall.
at least in theory. As a practical matter, share similar features with the sample docu- of documents that the tool selects for review-
Vendors of CAL tools claim that they train however, many have been deterred from
ments coded as “positive” (i.e., relevant or ers to code until the predictive coding model
the predictive model faster and that reviewers using it because various hurdles can arise. responsive) or “negative” (i.e., irrelevant or is “stabilized,” i.e., when additional training
end up reviewing fewer irrelevant documents Nevertheless, with some forethought and non-responsive).
does not result in any meaningful improve-
than with other tools.
preparation, and by involving those with the ment in results.
right expertise, many of the hurdles can be How Does It Work?
Some predictive coding tools select train-
What’s in It for the Producing Party?
overcome, or at least minimized, and parties ing documents strategically instead of just
may more often realize the potential benefits To understand how to make predictive cod- randomly, e.g., documents that appear to be
For the producing party, significantly of predictive coding.
ing practical, you first need to have a general close to the boundary between “positive” and
increased speed, substantial cost savings and understanding of how it works.
“negative,” or samples from clusters of simi-
improved accuracy are among the potential
What Is Predictive Coding?
The traditional workflow for predictive lar documents. Using these techniques, the
coding has involved commencing machine model may achieve stabilization more quickly.
Predictive coding—often referred to as learning with a “seed set” of pre-coded docu- The tool then applies the learning from the
GARETH EVANS and JENNIFER REARDEN are partners “technology assisted review” or “TAR”—uses ments. The seed set can consist of a sample seed and training sets to the entire document
at Gibson, Dunn & Crutcher’s Orange County and mathematical and statistical algorithms to selected at random, through the use of initial population. It identifies the likelihood that the
New York offices, respectively.
determine whether documents are likely to be
search terms, documents already determined
remaining documents are either “positive”