Welcome to Module 2-J.
Jason Baron on Search – How Do You Find Anything When You Have a Billion Emails?
Jason R. Baron is a lawyer, writer, editor, and important thought leader in the field of e-discovery. He is also a good friend and one of the wittiest people you will ever me. Jason worked tirelessly for The Sedona Conference for many years, including serving as Co-Chair of the Working Group on Electronic Document Retention and Production. He was also the Editor-in-Chief of The Sedona Conference Best Practices Commentary on the Use of Search and Information Retrieval in E-Discovery (2007), and The Sedona Conference Commentary on Achieving Quality in the E-Discovery Process (2009). Jason is also a co-founder of TREC Legal Track, the only scientific group that investigates the problems of search in the legal field. He also teaches e-discovery at the University of Maryland, and countless CLE events around the country.
This module goes deep into the mind of Baron and the important writings he has edited and projects he supervised. It includes three videos where he talks to law students about search and some of his projects. There are few teachers who are more enthused, deep, and humorous than Mr. Baron. Relax and spend as much time as you need to study this important module.
Jason Baron is also well-known as a pioneer of a dialogue between the disciplines of law and information science. This led to his co-founding the TREC Legal Track with Doug Oard. Jason Baron’s efforts to bridge the disciplines of law and information science are driven by his desire to help the law cope with the sudden explosion in the volume of information. Jason was on the front line of this problem because his old employer, NARA (National Archives and Records Administration), among other things, handles White House email litigation and other federal records disputes. He lives in a world where the management of billions of emails and government records are routine. He understand far better than most the need of law to work with science to cope with these issues.
NARA keeps the permanent records of the U.S. government, including the emails and other records of the White House and Presidential Libraries. As difficult as it is to search for one relevant email in a universe of 200,000,000, the situation is getting worse all of the time. Jason expects the Obama administration, if it goes for two terms, to generate over a billion emails by 2017. These kinds of Carl Sagan type numbers (“Billions and Billions!”) help motivate Jason to think long and hard about the future of search and explains why he has reached out to information scientists for help. See eg. National Institute of Standards and Technology TREC Legal Track, the general TREC conference, and the DESI III at ICAIL 2009 workshop in Barcelona.
Jason shared his thoughts and science outreach efforts with about 60 law students at the University of Florida in a class that Bill Hamilton and I taught. Jason spoke for two hours before students who had previously read Jason’s scholarly writings on the subject and the landmark cases. Paul & Baron, Information Inflation: Can The Legal System Adapt? 13 Rich J.L. & Tech 10 (2007); Baron, Jason, Editor, The Sedona Conference® Best Practices Commentary on Search & Retrieval Methods(Aug. 2007); Baron, Jason, E-discovery and the Problem of Asymmetric Knowledge (Presentation at the Mercer Law School Ethics in the Digital Age Symposium, Nov. 2008); Disability Rights Council of Greater Wash. v. Wash. Metro. Area Transit Auth., 2007 WL 1585452 (D.D.C. June 1, 2007); United States v. O’Keefe, 2008 WL 449729 (D.D.C. Feb. 18, 2008); Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md., May 29, 2008). This is the perfect format to give Jason sufficient time to present a full overview of his ideas and projects in this area. It also led to excellent questions and discussions, which are hard to come by in a typical attorney CLE program where few, if any of the attendees actually study the material in advance.
I say Jason Baron had time to provide a full overview of his ideas because it would take a day or more to flush out all of the details of his work on this subject. Typically, e-discovery CLEs include search as part of a curriculum, and, at best, you only hear Jason Baron as part of a panel with limited time. I know because I’ve been on two search panels with him. That is better than nothing, but not really adequate for a full airing of his views or mine.
Here is a short excerpt of a presentation he gave to law students at the University of Florida. Please note that his lecture consists of a taping of a pro bono lecture Mr. Baron gave in 2009 at the University of Florida, Levin School of Law. The lecture has been copied for use here with his permission.
Tobacco Litigation e-Discovery
Jason begins his presentation with a story of his experience assisting trial attorneys in the Department of Justice on the tobacco litigation team. In the early 2000s, the team at DOJ worked with various agency counsel (including Jason representing NARA) on the task of responding to discovery requests from the tobacco industry in U.S. v. Philip Morris. This was a mammoth e-discovery project. There were 1,726 Requests to Produce propounded by tobacco companies against 30 federal agencies for tobacco related records.
The hardest part of the project was the search of 32 million Clinton era email records. It started by Jason and his team studying the requests and “dreaming up” 12 keyword combinations to search/cull the 32 million emails. They ran some tests on samples and then had the good sense to do something that was then new and daring: they told the tobacco company requesting parties what the search terms were and invited them to participate. The tobacco company lawyers responded favorably and suggested some new terms that were then explored. This was followed by more sampling to find “noisy” terms, that is, keyword terms that generated too many false positives (Marlboro, PMI, TI, etc.). The results were reported back to the opposing counsel and a consensus was reached as to additional terms to be used in the search protocol. Then and only then was the full search run against the 32 Million emails. Here is an example that Jason gave of one of the boolean search strings that was used in the search:
(((master settlement agreement OR msa) AND NOT (medical savings account OR metropolitan standard area)) OR s. 1415 OR (ets AND NOT educational testing service) OR (liggett AND NOT sharon a. liggett) OR atco OR lorillard OR (pmi AND NOT presidential management intern) OR pm usa OR rjr OR (b&w AND NOT photo*) OR phillip morris OR batco OR ftc test method OR star scientific OR vector group OR joe camel OR (marlboro AND NOT upper marlboro)) AND NOT (tobacco* OR cigarette* OR smoking OR tar OR nicotine OR smokeless OR synar amendment OR philip morris OR r.j. reynolds OR (“brown and williamson”) OR (“brown & williamson”) OR bat industries OR liggett group)
As a result of the search, 99% of the documents were culled out. But that still left 320,000 emails, plus attachments. About half of those were found to be relevant, which, in my experience, is a high precision ratio. Of the relevant emails and attachments, about 20% were found to be privileged. They were logged and withheld, and the 80% balance of relevant files were produced. Although I am sure the documents uncovered were of some help to both sides, the sad truth is, none were ever used as an exhibit at trial.
The One Percent Solution Does Not Scale
The parties in the tobacco litigation were, under Jason’s leadership, able to cooperate and agree upon boolean search parameters that reduced the total universe to be reviewed for production by 99%. That is, in my experience, a very high cull rate. The use of keyword based culling alone can rarely, if ever, go beyond the one percent barrier. That is especially true in a negotiated term setting. In the tobacco case the government was willing to search the one percent remaining after culling, here 320,000 emails. The case was big enough (billions of dollars were at stake) and the U.S. government could afford the millions of dollars required for the review and production.
Jason then explained that the core problem is that the one percent solution does not scale. The government could afford to review and produce one percent of the Clinton era email, but cannot afford to review and produce one percent of Bush’s email, which equals 2 million emails (1% of 200,000,000 = 2,000,000), much less the expected email of Obama (1% of 1,000,000,000 = 10,000,000). What would it cost and how long would it take to review ten million emails (1% of 1 billion)? Jason estimates it would cost at least $20 Million and take a team of 100 lawyers working 10-hour days, seven days a week, over 28 weeks. I personally think that is an underestimate in time and cost. But regardless, it is far more than the federal government can afford or is willing to pay for a discovery request (even if, in my opinion, not Jason’s, some of the judges on the D.C. Circuit Court of Appeals do not appear to care how much discovery costs as the decision In Re Fannie Mae Litigation suggests).
Here is how Jason summed up the problem of scale in his talk to U.F. law students:
One percent of a billion after a keyword search is too much. Something has got to change… You have to take that huge volume and somehow cut down the haystack as much as possible that’s reasonable to do searches against, and then those searches need to be more efficient than what they are today. But that problem is a hard one; doing efficient searches is very hard.
Jason then explained some of the many reasons that search of large, heterogeneous data collections is so hard to do. They include such things as “Polysemy,” which means ambiguous terms (e.g., “George Bush,” “strike”), “Synonymy,” which means variation in describing the same person or thing in a multiplicity of ways (e.g., “diplomat,” “consul,” “official,” ambassador,” etc.), and “Pace of Change,” which refers to the never-ending development of new communication media and languages (e.g., twitter, text messaging, and computer gaming, i.e. “POS,” “1337”).
The Myth of Search & Retrieval
Most litigation lawyers today do not understand just how hard it is to search large data-sets. They think that when they request production of “all” relevant documents (and now ESI), that “all or substantially all” will in fact be retrieved by existing manual or automated search methods. This is a myth. The corollary of this myth is that the use of “keywords” alone in automated searches will reliably produce all or substantially all documents from a large document collection. Again, most litigators think this is true, but it is not. That is not just Jason’s opinion, or my opinion, it is what scientific, peer-reviewed research has shown to be true.
A study by information scientists David Blair and M.E. Maron in 1985 revealed a significant gap or disconnect between lawyers’ perceptions of their ability to ferret out relevant documents and their actual ability to do so. The study involved a 40,000 document case (350,000 pages). The lawyers estimated that a keyword search process uncovered 75% of the relevant documents, when in fact it had only found 20%! Blair, David C., & Maron, M. E., An evaluation of retrieval effectiveness for a full-text document-retrieval system; Communications of the ACM Volume 28, Issue 3 (March 1985); Also see: Dabney, The Curse of Thamus: An Analysis of Full-Text Legal Document Retrieval, 78 LawLibr. J. 5 (1986).
The myth of the effectiveness of keyword search is perpetuated by some e-discovery vendors who claim that a very high rate of “recall” (i.e., finding all relevant documents) is easily obtainable, provided you use their software product or service. In fact, research performed by information scientists and lawyers at the National Institute of Standards and Technology TREC Legal Track has again confirmed that keyword search alone still finds only about 20%, on average, of relevant ESI in the search of a large data-set, although alternative methods are beginning to achieve better results.
Electronic documents that are relevant to a request for information, and are retrieved by a search process, are referred to as “True Positives.” These are the files we want. We do not want a search to retrieve irrelevant files. The irrelevant files that are not retrieved are called “True Negatives.” In an ideal, perfect world, our automated search would find all relevant files, and only relevant files. We would have 100% True Positives and 100% True Negatives. But in reality, it never works that way, at least not in large sets of data. In reality, a search retrieves both relevant files and irrelevant files. The irrelevant files retrieved are called False Positives.
The ratio between True Positives and False Positives is referred to in information science as “Precision.” Precision is good; it means you spend less time reviewing irrelevant files. That saves money and thus is very important to real world e-discovery. In the Blair and Maron study, for instance, the Precision was 79%, while the Recall was only 20%. That means that 79% of the documents retrieved by the search were relevant, a high rate of Precision in my experience, but 80% of the relevant documents were not retrieved.
The relevant documents that are not found by a search are called “False Negatives.” The ratio between the True Positives, and the False and True Positives, is the “Recall” rate. Thus, in the Blair and Maron study, which was again confirmed in the TREC study, for every 100 relevant files the keyword search sorted through, it identified only 20, the True Positives, and failed to see 80, the False Negatives. In an ideal, perfect search, which again is impossible for large data-sets, you would find all relevant documents and achieve a 100% Recall. Information science research has discovered that in the search of large data-sets there is a typical ratio between Recall and Precision, such that the higher your Precision, the lower your Recall, and visa versa.
Thus, for example, if your search only uncovered five documents, and you were lucky enough that all five were relevant, then you would have 100% Precision. There would be no False Positives. But in that circumstance, you would likely have attained a very low Recall rate. You may have found five relevant files, but left behind another five hundred. Thus, in that example, your Recall would be 5/505, or slightly less than one percent (.99%). That is the basic stuff of search analysis. The next instructional step after that, in my opinion, requires venturing into the world of sampling and thus is one of those things that requires a full day seminar, and (horrors) more math.
TREC Legal Track
The Recall Precision trade-off is a problem well known to all of the participants in the Legal Track of the TREC conferences. The Legal Track supervises an open data search experiment and sponsors an annual meeting where the results are discussed and debated in academic fashion. The participants were primarily professors and their students from information science departments, plus a few attorneys like Jason, plus a few e-discovery vendors as well. In addition to Jason Baron, the coordinators for the 2008 TREC Legal Track were Bruce Hedin, Ph.D., Douglas W. Oard, Ph.D., and Stephen Tomlinson. Go here for the official, lengthy report on the 2007 TREC Legal Track. Also see Sedona Conference Open Letter on the 2008 TREC Legal Track.
The TREC Conference series is sponsored by the National Institute of Standards and Technology (NIST). It is designed to promote research into the science of information retrieval in general and has a number of different fields of study, or “Tracks.” The first TREC conference was in 1992. The 15th Conference was held in 2006 where Jason Baron and his colleague Doug Oard at the University of Maryland convinced the TREC conference to begin a new Legal Track for the study of problems faced by attorneys searching large data sets to respond to discovery requests. The TREC Legal Track was thus born in 2006 and continued until 2011.
TREC Legal Track set up a search problem using hypothetical legal complaints and “requests to produce” with over 100 categories created to date. The requests are drafted by members of The Sedona Conference with litigation experience. “Boolean negotiations” were then conducted by a control group of expert attorneys simulating real-world conditions. They agreed upon baseline keyword search terms with Boolean operators and wildcards to retrieve data relevant to the requested categories. These categories varied tremendously from the dry and serious shown in the example below, to the slightly whimsical, such as a category requesting all documents making a connection between the music and songs of Peter, Paul, and Mary, Joan Baez, or Bob Dylan, and the sale of cigarettes. Here is the example provided as to how the negotiations went for one of the 100 topics:
Request Number: 52
Request Text: Please produce any and all documents that discuss the use or introduction of high-phosphate fertilizers (HPF) for the specific purpose of boosting crop yield in commercial agriculture.
Proposal by Defendant (recipient of discovery): “high-phosphate fertilizer!” AND (boost! w/5 “crop yield”) AND (commercial w/5 agricultur!)
Rejoinder by Plaintiff (requestor of discovery): (phosphat! OR hpf OR phosphorus OR fertiliz!) AND (yield! OR output OR produc! OR crop OR crops)
Final Query (as agreed to by the parties): ((“high-phosphat! fertiliz!” OR hpf) OR ((phosphat! OR phosphorus) w/15 (fertiliz! OR soil))) AND (boost! OR increas! OR rais! OR augment! OR affect! OR effect! OR multipl! OR doubl! OR tripl! OR high! OR greater) AND (yield! OR output OR produc! OR crop OR crops)
A search was then made of the chosen public document database using the agreed protocols. In 2006, 2007, and 2008 TREC used the nearly 7 million document database from the tobacco litigation. These documents are a set of OCR scanned Tiff type files. The next study in the Summer and Fall of 2009 used the Enron litigation public data-set. This collection has no OCR scanning errors and thus, in my opinion, is more reflective of modern practice.
The various search teams participating then ran their own searches of the same database. Up until 2008 most of the participating teams were information scientists from universities, but in 2008 two e-discovery vendors joined the project, H5 and Clearwell Systems. The public database is, of course, totally unstructured and disorganized, and, like real life, it is filled with spelling errors, scanning errors, and language idiosyncrasies. The search teams used various automated methods and protocols to try to locate documents in the database responsive to various categories.
The experiment was, among other things, designed to evaluate the Precision and Recall of the various search methods used by the teams and to compare their results with the arms-length, expert attorney negotiated search terms. The negotiated keyword search method did about the same as the original Blair and Maron study with an approximate average 22% recall rate based on sampling. This means that once again approximately 78% of the relevant documents were not found by the approach now most commonly employed by attorneys. Some of the automated search methods used by the various teams beat this 22% Recall rate, but usually not by much, and not consistently over all categories. The degree of success depended upon the particular category. But, I am pleased to report higher recall – up to 81% — was achieved for at least one individual topic in the so-called “Interactive task,” which more closely models e-discovery practice than TREC’s set piece “ad hoc” task. The Interactive task used “Topic Authorities” drawn from the ranks of the Sedona Conference who acted in the role of senior litigators giving advice and feedback about the topics to the participating teams, and participating teams spent far greater overall resources in attempting to respond to one or two or three topics only. I am not sure if the 81% figure was for the Bob Dylan topic or what, but it is a far cry from the 22% average of keyword searches and shows great hope for the future of search.
The TREC Legal Track was grappling with three fundamental issues. In Jason’s words:
(1) How can one go about improving rates of Recall and Precision (so as to find a greater number of relevant documents, while spending less overall time, cost, etc., sifting through noise?)
(2) What alternatives to keyword searching exist?
(3) Are there ways in which to benchmark alternative search methodologies so as to evaluate their efficacy?
The importance of TREC to the legal community has already been recognized by many scholars and at least two leading jurists, Judge Grimm in Victor Stanley and Judge Scheindlin in Securities and Exchange Commission v. Collins & Aikman Corp., 2009 WL 94311 (S.D.N.Y., Jan. 13, 2009).
In 2011 Jason Baron wrote an article providing his thoughts on search, search ethics, and some of what he and others have learned from TREC. Baron, J.R., Law in the Age of Exabytes: Some Further Thoughts on “Information Inflation” and Current Issues in e-Discovery Search, Richmond Journal of Law and Technology, Vol. XVII, Issue 3 (Spring 2011). This is an important article to read. For one last excellent, albeit difficult article on search, see: Grossman and Cormack Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, Richmond Journal of Law and Technology, Vol. XVII, Issue 3 (Spring 2011). It is helpful for those who want to go deeper into TREC and search, but will not be part of any testing for this program. Yes, the Grossman and Cormack article is that difficult.
The old days of simple keyword search for relevant documents are coming to an end. We can no longer afford its gross inefficiencies and its outrageous expense. There is simply too much data in law suits today to continue using this method of search from the 1980s. It was only able to recall 20% of the relevant information when it first started in the 1980s, and still does little better than that today, even in the hands of experts. My guess is that average lawyers with no special expertise in keyword search are only achieving Recall of from 10% to 15%, but like the attorneys in the Blair and Maron study, think they are getting most of it. The power of myth is strong.
There has got to be a better way than negotiated keyword search. Many people are working on this problem right now, myself included, and breakthroughs are imminent. As Jason Baron put it at the end of his session at U.F.:
We are just at the beginning, sort of the dawn of some new paradigm in the law. There is something happening out there, something different – and you can feel it.
Jason Baron on the TREC Legal Track
Here is an exclusive video explanation by Jason Baron for the students of this class on the TREC Legal Track.
Study this second video lecture which Mr. Baron made especially for this class.
SUPPLEMENTAL READING: Read Jason Baron’s Law in the Age of Exabytes: Some Further Thoughts on “Information Inflation” and Current Issues in e-Discovery Search. Write a memo summarizing one fact, or insight, that you found particularly interesting and explain why.
EXERCISE. What did you find most interesting about these videos by Jason Baron? Next, think about what Jason Baron had to say about the future of the legal profession in this video, and other materials he has written, including the Did You Know? video with Ralph Losey. What impact do you think future technologies and inventions will have on your future life as a lawyer? How might it impact what you will be doing (as a lawyer) 10 years from now? Thirty years from now? Really take the time to think this through. This is your future we are talking about here. How will you help make it happen?
Students are invited to leave a public comment below. Insights that might help other students are especially welcome. Let’s collaborate!
Copyright Ralph Losey 2015