Welcome to Module 4-J.
Hash, Search and Privacy.
The three essays in this module will cover one of the most powerful technological tools in e-discovery, the hash algorithm. You need to know about hash and understand its power to be an e-discovery lawyer in the Twenty First Century, especially if you ever handle trade secret cases. This class will help you to master this important subject and see some of its applications in litigation.
Computer hash is an encryption algorithm that forms the mathematical foundation of e-discovery. Hashing generates a unique alphanumeric value to identify a particular computer file, group of files, or even an entire hard drive. As an example, the hash of the animated GIFF file is shown above. The unique alphanumeric of a computer file is called its “hash value.” Hash is also known in mathematical parlance as the “condensed representation” or “message digest” of the original message. It is more popularly known today as a “digital fingerprint.” Hash is the bedrock of e-discovery because the digital fingerprint guarantees the authenticity of data, and protects it against alteration, either negligent or intentional. Hash also allows for the identification of particular files, and the easy filtration of duplicate documents, a process called “deduplication” that is essential to all e-discovery document processing.
Hash is my favorite e-discovery technology. I became fascinated by its great potential as a safeguard for electronic evidence in the future, and ended up reading and experimenting with this algorithm in depth. Ultimately I wrote a forty-four page law review article on the subject. HASH: The New Bates Stamp, 12 Journal of Technology Law & Policy 1 (June 2007). Here I discuss hash at length and review just about every case that mentions it. The article has 174 footnotes to provide reference to almost everything on the subject that might be of interest to a lawyer or others in the e-discovery field. As the title suggests, I make a specific proposal in the article for the adoption of an e-discovery file naming protocol based on hash to replace the paper oriented Bates stamp. For more background on this law review article see my prior blog about it. For information on how the use of hash, instead of Bates stamps, is much more efficient and saves money in e-discovery processing, see my other blog The Days of the Bates Stamp Are Numbered.
Technically, hashing is based on the substitution and transposition of data by various mathematical formulas. Thus the process is called “hashing,” in the linguistic sense of “to chop and mix.” The hash value is commonly represented as a short string of random-looking letters and numbers, which are actually binary data written in hexadecimal notation. Hash is commonly called a file’s “fingerprint” because it represents its absolute uniqueness.
If two computer files are identical, then they will have the same hash value. Even if the files have a different name, if their contents are the same, exactly the same, they will have the same hash. This allows for easy identification and elimination of redundant documents, the mentioned deduplication process. But if you so much as change a single comma in a thousand page text, it will have a completely different hash number than the original. There are no similarities in the hash numbers based on similarities in the files. Each number is unique. That is how the math in all hashing works.
Many kinds of effective hash formulas have been invented, but two are in wide use today: the SHA-1 and MD5 algorithms. Both are very effective, in that mathematicians conjecture that it is “computationally infeasible” for two different files to produce the same hash value. That is why hashing is commonly employed in data transmissions to verify that the integrity of a file has been maintained in transmission. If you hash the file received, and it does not produce the same hash value, then it has been corrupted, and at least one byte is not the same as the original. It is a guaranteed way of verifying the integrity of an electronic file.
Software to run both the SHA-1 and MD5 hash analysis of files is widely available, easy to use and free. I use a HashTab Shell Extension to Windows, available for free at http://www.beeblebrox.org/software.php. The hash value of any file can be instantly determined, regardless of the type of electronic file, including graphics. For instance, the hash values of a Word document I am working on now are:
If I only change one comma in this multipage document, all else remaining the same, the hash values are now:
Although the two files have only this trivial difference, there are no similarities in these hash values, proving that hashing will detect even the slightest file alteration.
Hashing can also be used to determine when fields or segments within files are identical, even though the entire file might be quite different. This requires special software, but again is commonly available from many e-discovery vendors, for a price. This software allows you to hash only portions of a file. Thus, for instance, you can hash only the body of an email, the actual message, to determine whether it is identical with another email, even when the “reference” or the “to” and “from” fields are different. This allows for an important filtering process called “near de-duplication.”
Trade Secrets Case Uses MD5 Hash and Keyword Search to Protect Defendants’ Rights – Magistrate’s Privilege Waiver Order Is Reversed
A District Court Judge in Philadelphia reversed a Magistrate’s order requiring a defendant in a trade secret case to produce a forensic image of two of its computers. Bro-Tech Corp. v. Thermax, Inc., 2008 WL 724627 (E.D. Pa. March 17, 2008). The computers in question were defendant’s servers located in Michigan and India. The order required production of full images to plaintiff’s counsel.
The defendant was willing to produce forensic images to plaintiff’s computer forensic expert, not its legal counsel. Defendant wanted to protect its confidential information on these servers by limiting the expert’s search to the trade secret documents, or files that might contain information about these secrets. Accordingly, defendant would only agree to allow the expert to search for files with matching MD5 hash values, matching file names, or files containing plaintiff’s keywords. Hash value searches are often used in trade secret cases. See Eg. Creative Science Systems, Inc. v. Forex Capital Markets, LLC, 2006 WL 870970, at *4 (N.D. Cal. 2006). As I explained at pages 17-20 of my article, HASH: The New Bates Stamp, 12 Journal of Technology Law & Policy 1 (June 2007), “the irreversibility quality of hashing makes it possible to perform a hash search of a computer for specific hash values without revealing the actual contents of the computer searched.”
Further, defendant was only willing to allow these searches of its servers if it could protect its attorney-client communications and work product. To do this, defendant proposed the standard procedure typically used for productions of this kind. See Playboy Enterprises v. Wells, 60 F. Supp.2d 1050 (S.D. Cal. 1999). After plaintiff’s expert performed the search of the forensic images, the files found would first be produced to defendant for a privilege review. Defendant would have a right to remove any privileged files, prepare a log of the files removed, and produce the rest to the plaintiff.
Judge Cynthia M. Rufe agreed with the defendant. She held that it was clear legal error for the magistrate to require production of the forensic images “without any limitation as to the scope of the disclosure or prior filtering for privileged or work-product materials that the images might hold.” In other words, she reversed because the order was too broad and did not protect defendant’s secrecy rights. Instead, the Magistrate erroneously assumed that the defendant had waived all of its confidentiality rights to all of the information on the servers by the mere act of having these servers examined by its forensic expert.
Before I go into the intricacies of the waiver argument, it is helpful to review the case background. It is a trade secret action brought by Bro-Tech against one of its competitors, Thermax, and seven former employees who went to work for Thermax USA, Ltd.. The plaintiff, Bro-Tech Corporation, a/k/a “The Purolite Company,” designs and manufactures chemical solutions, namely ion exchange resins, used to remove impurities from water and air. The twenty eight page amended complaint alleges twelve causes of action:
Purolite asserts the following causes of action: (1) misappropriation of trade secrets; (2) misappropriation of trade secrets through inevitable disclosure; (3) common law unfair competition; (4) breach of contract; (5) breach of the duty of loyalty; (6) tortious interference with existing and prospective business relationships; (7) conversion; (8) violation of the Computer Fraud and Abuse Act, 18 U.S.C. § 1030; (9) commercial disparagement; (10) unjust enrichment; (11) violation of the Racketeer Influenced and Corrupt Organizations Act, 18 U.S.C. §1962(c) and (d); and (12) civil conspiracy.
Defendants responded by denying all allegations, and the competitor corporation, Thermax, counter-sued. Thermax alleged that Bro-Tech was intentionally interfering with its relationships with its customers by making false accusations that Thermax stole Bro-Tech’s trade secrets. They also claimed that Bro-Tech itself stole trade secrets, in a kind of two wrongs cancel each other out defense, known as a “clean hands” affirmative defense (it seldom works). In other words, this is a typical trade secret case with competent counsel on both sides. In fact dozens of lawyers from Philadelphia and New York have appeared of record in this case, including Baker & McKenzie for the defendants.
The amended complaint seeks, among other things, temporary and permanent injunctive relief requiring the return of any trade secrets that the individual defendants took with them or disclosed to their new employer, Thermax. Apparently to avoid a temporary injunction hearing early in the case, the defendants, in 2005, agreed to a Stipulation and Order (“the May 23 Order”) that “imposed an ongoing obligation on Defendants to return to Plaintiffs any Purolite files in their possession, and then to purge said files from their possession, custody and/or control.” Bro-Tech Corp. v. Thermax, Inc., supra at *1.
In late 2007, plaintiff deposed the defendant’s computer forensic expert, Stephen Wolfe, of the Huron Consulting Group. Wolfe testified that he had searched forensic images of defendant, Thermax’s Michigan and India servers, to see if they contained the hash values, file names, or keywords used by plaintiff’s expert, Lawrence Golden, to identify plaintiff’s trade secret files. Here is how the court described it:
Wolfe searched India and Michigan servers for (1) the unique electronic “fingerprints” (or MD5 hash values) of all Purolite documents identified as such in this litigation; (2) the file names of the identified Purolite documents; and (3) certain search terms drawn from the Golden Exhibits.
Id.at FN 8.
Wolfe admitted in his deposition that his search uncovered a number of matching files. Wolfe then filtered out files that were obviously false hits, such as standard application files that happened to contain the keywords. He then submitted the rest of the files with hits to Thermax’s legal counsel for review. Wolfe did not actually review the contents of the India and Michigan files himself, but he did review the contents of files on other Thermax computers. The court explains that:
. . . hits in the India or Michigan servers apparently were not substantively evaluated by Wolfe, but were categorized and identified according to more superficial file characteristics, filtered for “false hits” by reference to external attributes, and submitted to Thermax’s counsel for review of the actual content of the files.
The plaintiff responded to this testimony by arguing that the hits Wolfe admitted finding on Thermax’s servers in India and Michigan showed that the May 23rd Order had not been followed. The order required Thermax to return and purge any trade secrets on all of its computers. Plaintiff argued that it was therefore entitled to production of the full images of these servers and moved to compel. Magistrate Judge Carol Wells agreed after an evidentiary hearing that production was required to permit a determination of whether Defendants had violated the May 23rd Order. Judge Wells ordered the production of the full images to “designated counsel only.” Bro-Tech v. Thermax, 2008 U.S. Dist. LEXIS 8970 (Feb. 7, 2008).
Defendant appealed the Magistrate Judge’s ruling to the District Court Judge arguing clear legal error on two grounds. First, they argued:
that before any disclosure of the contents of the India and Michigan servers to counsel for Purolite occurs, Thermax has the legal right to filter the information to be disclosed in order to remove any attorney-client communications or work product material therein.
Id. at *2.
Second, defendants argued that:
they should be required to disclose to Purolite (after a review for privileged materials) only files which yield hits during a targeted search of the India and Michigan servers for evidence of Purolite files, and not, as the February 7 Order requires, to disclose the entire content of the India and Michigan servers for Plaintiffs’ counsel’s review.
Plaintiff argued that the magistrate’s order should be upheld because only inspection of the entire India and Michigan servers by Plaintiff’s counsel could ensure that no violation of the order had occurred. Plaintiff also argued that defendant had waived privilege to any confidential content on these servers “by disclosing the servers to Stephen Wolfe, who authored an expert report for Defendants, albeit one which did not, in any way, concern the content of the India or Michigan servers.” Id.
The magistrate erroneously found waiver on the basis of Rule 26(a)(2)(B), FRCP. This is the expert witness rule that requires a party to disclose all material considered by its expert in formulating an expert report to an opposing party. Plaintiff argued that this disclosure applied to all otherwise privileged materials, regardless of whether the expert actually examined the materials or relied upon them in a report. For authority, plaintiff relied upon Synthes Spine Co., L.P. v. Walden, 232 F.R.D. 460, 463-464 (E.D. Pa. 2005) (disclosure requirements of Rule 26(a)(2)(B) override all claims of attorney-client privilege), and Vitalo v. Cabot Corp., 212 F.R.D. 478, 479 (E.D. Pa. 2002) (overrides work product privilege).
Defendant countered that Wolfe had not examined these two servers as a testifying expert, but rather as a consultative expert, and so Rule 26(a)(2)(B) did not apply. Wolfe had examined and prepared reports on other computers owned by defendants, and thus was a testifying expert for these other computers. But he had not prepared a report to be used as evidence on the Michigan and India servers. Instead, he had only examined these computers to help the corporate defendant, Thermax, evaluate its case. Thus, he was only a consultative expert, and not a testifying expert, as to these two servers.
Although not discussed in this opinion, Thermax probably also argued that even if Wolfe had been a testifying witness as to these servers, and thus Rule 26(a)(2)(B) did apply, its privilege could only be waived as to specific attorney-client communications actually disclosed to Wolfe and relied upon by him to form the expert opinion stated in the report. Since Wolfe testified that he never examined the contents of any files on these servers, there was no disclosure, and, of course, no reliance.
Judge Rufe rejected the Magistrate’s over-broad construction of privilege waiver and allowed defendant to protect its privileged communications. Here is the Judge’s discussion and analysis of the law.
When privileged communications or work product materials are voluntarily disclosed to a third party, the privilege is waived. [FN18] An exception to this rule exists for disclosures to third parties which are necessary for the client to obtain adequately informed legal advice. [FN19] Under this exception, Thermax has not waived its privilege or work product protections in the India and Michigan server files disclosed to Wolfe. When searching these files, Wolfe was functioning in his capacity as “a non-testifying expert, retained by the lawyer to assist the lawyer in preparing the clients’s case.” [FN20] Thermax did not waive any protections it might have in the India and Michigan servers by disclosing them to Wolfe for consultative expert assistance in this litigation. Accordingly, this Order must provide for a privilege and work product filter.
This was obviously the correct decision, not only for the reasons stated, but also because Wolfe had only looked at information about the files (names, hash, and whether they contained key words chosen by plaintiff), and had not actually examined the contents of the files themselves. Further, only a small percentage of the files on these servers had these matching characteristics.
Here is Judge Rufe’s actual holding reversing the Magistrate’s order:
*3 In this instance, the Court must overrule as contrary to law that portion of the February 7 Order which compels Thermax to produce to Plaintiffs the entire India and Michigan servers for Plaintiffs’ review, without regard for privilege, on Rule 26(a)(2)(B) grounds. Wolfe repeatedly stated under oath that the India and Michigan servers were outside the scope of his expert report, and that he did not consider them in his testifying expert role. [FN15] Instead, his expert report exclusively concerned the contents of other devices. Because the information on the India and Michigan servers was not disclosed to or considered by Wolfe for purposes of his expert report, Rule 26(a)(2)(B) does not apply to the materials on those servers, and does not provide a legal basis for requiring their disclosure to Purolite.
Although Judge Rufe agreed with defendants that they had a right to protect their privileges, she did want a search of these servers performed to determine whether defendants had retained any of plaintiff’s trade secret information in violation of the prior stipulated order:
Notwithstanding the foregoing ruling, the Court wholly agrees with the Magistrate Judge that, in present circumstances, a significant measure of disclosure of the contents of the India and Michigan servers is necessary to ensure that Thermax has not retained Purolite information in violation of the May 23 Order. The fact that Wolfe’s electronic search of the India and Michigan servers using search terms designed to find Purolite information yielded numerous hits suggests the strong possibility (if not providing conclusive proof) that Purolite information is improperly contained in those servers. Furthermore, the parties agree that some disclosure is now necessary, although they disagree on the proper scope of the disclosure. [FN16] Thus, disclosure of the images, to some extent, shall be required.
Id. at *3.
Judge Rufe suggests that if the limited disclosure does reveal any intentional violation of the prior court order to return and purge any trade secrets, then a full search of the imaged server hard drives might be permitted. Such an inspection would include deleted files and slack space, and this might provide further evidence of intentional violation of the order or spoliation:
*4 The Court finds that there is not, at present, evidence of an intentional violation of the May 23 Order by Defendants, as would warrant full disclosure. We know too little about the contents of the files that yielded hits during Wolfe’s search of the India and Michigan servers to reach such a conclusion at this time. Wolfe’s search may have yielded false hits, or may otherwise have signaled files that were properly in Thermax’s possession; conversely, the hits may indicate a Thermax violation. Lacking clear evidence of an intentional violation, the Court will not impose the type of disclosure ordered previously in materially different circumstances involving Defendant Sachdev. Instead, a more measured, yet still significant, disclosure will be required.
Based on these findings, the court followed defendant’s suggested protocol for limited production and required the following:
*5 (1) Within three (3) days of the date of this Order, Defendants’ counsel shall produce to Plaintiffs’ computer forensic expert forensically sound copies of the images of all electronic data storage devices in Michigan and India of which Huron Consulting Group (“Huron”) made copies in May and June 2007. These forensically sound copies are to be marked “CONFIDENTIAL–DESIGNATED COUNSEL ONLY”;
(2) Review of these forensically sound copies shall be limited to:
(a) MD5 hash value searches for Purolite documents identified as such in this litigation;
(b) File name searches for the Purolite documents; and
(c) Searches for documents containing any term identified by Stephen C. Wolfe in his November 28, 2007 expert report;
(3) All documents identified in these searches by Plaintiffs’ computer forensic expert will be provided to Defendants’ counsel in electronic format, who will review these documents for privilege;
(4) Within seven (7) days of receiving these documents from Plaintiffs’ computer forensic expert, Defendants’ counsel will provide all such documents which are not privileged, and a privilege log for any withheld or redacted documents, to Plaintiffs’ counsel. Plaintiffs’ counsel shall not have access to any other documents on these images;
Judge Rufe has, I think, done the right thing under these circumstances. A waiver of attorney-client privilege should never be implied from a forensic expert’s mere review of a party’s computer. Otherwise, parties would be chilled from employing experts and other skillful persons to help them to evaluate a case. Would justice really be served by uneducated guesses, or blind ignorance? Do we really want to discourage clients from telling their lawyer the full story for fear that their secrets will not be safe?
It was obviously not defendant’s intent to waive its privileges in this case. The Magistrate Judge’s finding of waiver appears to have been a kind of improper punishment of defendant for its assumed violation of the prior court order. But, as Judge Rufe implies, that is taking the cart before the horse. The violation of the order has not yet been proven. The hits Wolfe testified to may all be false positives resulting from overly broad keywords by plaintiff’s expert.
In any event, even if a violation is later proven by, for instance, multiple hash value matches (which is a common way to prove trade secret theft), this would still not justify stripping defendants of their attorney client privilege. It might justify sanctions and further search of the computers. It might even result in defendant’s loss of the case on all twelve counts. But even a losing defendant has a right to communicate with their lawyer in private. It is unfair to deprive a litigant of this fundamental right as a punishment for perceived misconduct.
The United States Supreme Court has repeatedly recognized, since at least 1826, that the attorney-client privilege is a fundamental right. Public interest demands maintenance of the privilege so that a client may communicate freely and confidentially with his attorney. In Chirac v. Reinicker, 11 Wheat. (24 U.S.) 280, 294 (1826), the Supreme Court, through Justice Joseph Story, declared that “it is indispensable for the purposes of private justice” that our legal system preserve the confidentiality of facts “communicated by client to counsel” in confidence. Later, in Blackburn v. Crawfords, 3 Wall. (70 U.S.) 175, 192-193 (1865), the Supreme Court quoted with approval the following statement from an earlier English case: “If the [attorney-client] privilege did not exist at all, everyone would be thrown upon his own legal resources. Deprived of all professional assistance, a man would not venture to consult any skilful person, or would only dare to tell his counsel half his case.”
The judiciary should be wary of unwarranted intrusions upon this essential right. Judge Cynthia Rufe, like Justice Story before her, was correct to reverse the Magistrate Judge and uphold the attorney-client privilege.
Case where Police Use Hash to Catch a Perp and My Favored Truncated Hash Labeling System to ID the Evidence
Part of my discipline as an e-discovery specialist is to try to read (or at least skim) every published opinion on the subject. Lots of attorneys specializing in this area do that. But there is one other type of case I also read, every opinion that uses the word “hash.” No, I do not need help from Narcotics or Overeaters Anonymous. The kind of hash I am addicted to is purely algorithmic. This hash comes in many flavors, but the best known, and the ones usually employed in e-discovery, are called MD5 hash, SHA-1 hash, or the latest and greatest, SHA-2 hash.
As I explain in my blog Hash page, hash is the mathematical foundation of e-discovery and the most powerful tool of any forensic investigator. It reveals the unique mathematical fingerprint of every computer file that allows for perfect identification and authentication of electronic evidence. I became fascinated with the powers of hash in 2006 and ended up writing a lengthy law review article on the subject. HASH: The New Bates Stamp, 12 Journal of Technology Law & Policy 1 (June 2007).
In the process of researching the original law review article, I am pretty sure I read every legal opinion and legal article ever written that mentions hash. I also read a few scientific and cryptological articles as well, most of which I did not really understand. Having put that much time and effort into the subject, I try to keep up by reading every new legal opinion or article mentioning hash. That is why I have a standing search for all cases using the term, and automatically receive a copy of them by email as soon as they are published. I can be in the middle of dinner and my blackberry will buzz alerting me of a new hash case. Lest you think that’s a tad weird, I am willing to bet that there are a few other hash enthusiasts out there, Craig Ball comes to mind, who do the same thing. (See Craig Ball’s excellent article “In Praise of Hash” at pg. 52.)
Hash and Child Pornography
Most of the hash cases I see have nothing to do with e-discovery per se. Instead, they are usually criminal law cases, typically cases involving one of the most disgusting of crimes, child pornography. Police have been using hash to catch perps in this area for years. Hash is an effective tool for this because it allows police to know if certain child pornography is located on a computer, usually videos or still photos, by looking to see if the hash values for these files are present. That is a bit of an over-simplification, but suffice it to say that there are lists of hash values that are known to be associated with computer files which are unquestionably child pornography. New York Attorney General Andrew Cuomo explained the process in a press release in June 2008 announcing a deal with major Internet providers to block major sources of child pornography:
As part of the undercover investigation, the Attorney General’s office developed a new system for identifying online content that contains child pornography. Every online picture has a unique “Hash Value” that, once identified and collected, can be used to digitally match the same image anywhere else it is distributed. By building a library of the Hash Values for images identified as being child pornography, the Attorney General’s investigators were able to filter through tens of thousands of online files at a time, speedily identifying which Internet Service Providers were providing access to child pornography images.
U.S. v. Warren
A district court in Missouri mentions hash. U.S. v. Warren, 2008 WL 3010156 (E.D.Mo. 2008). Warren is a case considering and rejecting a motion to suppress evidence, namely computer video files of underage teens having sex. The motion to suppress was based on a series of hyper-technical challenges to the affidavit which the St. Louis police submitted to the judge to receive a search warrant of defendant’s computer. The affidavit explained how the police had searched the Internet for files “whose digital SHA-1 value was identical to that of a file known to contain child pornography.” They found a computer with an Internet Protocol address of 70 … 167 offering to share one such known file, and then subpoenaed AT&T to get the physical address of the subscriber with that IP address. The computer was located in Affton, Missouri.
The police detective’s affidavit explained how the hash values and offer to upload established “that a computer in Missouri was ‘offering to participate in the distribution of known child pornography.’” Based on this affidavit, the judge found probable cause to issue the search warrant of the computers located in Warren’s home. The police then went to his home, found no one there, forced entry, and seized his computer. Warren himself later came along, and, foolishly enough, voluntarily came to the police station, waived his right to counsel several times, and spoke at length to the police. The opinion includes extensive excerpts of the taped interview, which Warren later argued was made in violation of his right to legal counsel.
The defendant’s technical search warrant objections forced the court to delve into many of the characteristics and evidentiary properties of hash. For that reason alone, the case is useful to any practitioner trying to better understand the subject. But what is really special about the case, at least for me, is the system of hash file identification used by the court to identify the offending video tape at issue in this case. That video computer file was the key piece of evidence, the “smoking gun.”
Six-Place Hash Truncation Naming Protocol
The opinion by Magistrate Judge David D. Noce in Warren is unusual and special because it is the first case to use the truncated hash value labeling system I proposed in HASH: The New Bates Stamp. My article was not mentioned, and apparently Judge Noce was not aware of it. He used the six-place hash truncation system I proposed in my article because it was, in his words, “convenient” to do so, and because the detectives had used that system in their affidavits and testimony. I doubt the police detectives had read my law review article either, which makes their use of the abbreviation system all the more important. It shows that it is a natural and reasonable thing to do, although this is the first time it has been utilized or mentioned in a legal opinion.
So what is the six-place hash truncation system which I proposed that these Missouri officials are now in fact using? Before I can answer that, I have to go into a little more depth about hash and Bates stamps. HASH: The New Bates Stamp not only explains hash and its importance to e-discovery, it also argues for the legal profession and e-discovery industry to adopt a new type of electronic document naming protocol that uses hash values, instead of sequential numbering, to identify electronic evidence. I argue that the time has come for the legal profession to abandon Nineteenth Century Bates stamp paper mentality, and adopt Twenty-First Century ESI hash mentality. I proposed that sequential Bates stamps be replaced by non-linear, intrinsic hash values.
The hash values would not only identify ESI, they would authenticate it too, something the lowly Bates stamp could never do. But the problem with using hash values to identify ESI, instead of Bates stamps, is that hash values are too long and awkward for the human mind. Here is what a typical forty place hexadecimal SHA-1 hash value looks like: 2B37BC6257556E954F90755DDE5DB8CDA8D76619.
Police detectives, lawyers and judges cannot go around describing computer files used as evidence with such long alphanumerics. It is too cumbersome a name to replace the Bates stamp. So my common sense proposal, which Judge Noce in Warren calls “convenient,” is to only use the first and last three places of the hash value, instead of all forty. So the hash value above becomes the much more manageable 2B3 … 619. That truncated hash value becomes a pretty good document name, and, in my opinion and that of many others, should replace the arbitrary Bates stamp.
Turns out that the detectives in Missouri were already following this six-place truncation protocol at the time my article was published in June 2007. Perhaps they and other law enforcement agencies have been using this system for years. I do not know for sure, although I doubt it has been a widespread practice. I have talked to many e-discovery forensic experts about the hash naming proposal over the past two years. Many of these experts did police work before going into e-discovery, and none ever mentioned having done this before. Also, it certainly does not appear in the legal literature on the subject, that is, until U.S. v. Warren.
Hexadecimal Values v. Base32 Number System
At first, I was disappointed to see that Judge Noce’s introduction of the truncated hash value naming protocol was flawed with two obvious technical errors. See if you can catch them:
The search turned up a list of files, including one with a 32-character alpha-numeric SHA1 designation of “H4V … UTI.” Fn4
FN4 – For convenience, in this opinion the SHA1 value set out in full in the search warrant affidavit will be referred to as “H4V … UTI.” The affidavit defined the term “SHA1” (also known as “SHA-1”) as being a mathematical algorithm that uses the Secure Hash Algorithm (SHA), developed by the National Institute of Standards and Technology (NIST), along with the National Security Agency (NSA) . . . Basically the SHA1 is an algorithm for computing a condensed representation of a message or data file like a fingerprint.
Warren at *1.
First of all, the SHA-1 hash generates a 40-character hexadecimal string, not 32-character. The other kind of hash, MD5 hash, is the one that uses a 32 character string, not SHA-1. For this reason, my first reaction was that the Judge, or police, mixed up the two different types of hash, and meant to say 40 characters, not 32.
But then there seemed to be yet another, even bigger mistake. The letters H V U T and I should not have been in the hash value name. The values generated in e-discovery work to represent SHA-1 and MD5 hash are always hexadecimal. That is a numerical system with a base of 16. This is typically represented by the numbers 0–9 for the first ten values, and A, B, C, D, E, and F to represent the last six, for a total of sixteen. In other words, a hexadecimal value does not employ any letters after F. Yet, the so called SHA-1 alphanumeric stated in the Warren opinion uses the letters H, U, T and I: “H4V … UTI.”
I thought the police or Judge Noce must have messed things up, but I also seemed to remember reading somewhere that were other ways to express hash values, and anyway, I am always very careful before I tell a judge that he or she is wrong. So doing a little online research, I learned that there are indeed other ways to display hash values using different binary based number systems, typically the 32 base or 64 base number systems. Base32 is defined in IETF RFC 3548, as using the characters A-Z and 2-7. While Base64 is defined in IETF PEM RFC 1421 as using the characters A-Z, a-z, 0-9, / and +.
My Online Investigation of Base32 Hash Math
Led to a Shocking Discovery
Coming back to the Warren opinion, the hash values “H4V … UTI” are not hexadecimal, but they could be either Base 32 or Base 64. At this point, I did a little more online research about Base32 hash, and quickly found that there are many websites where you can locate music and videos to download based on their hash values. Almost right away, by simply using Google, I located a site where you can find media to download based upon their SHA1 Base32 value. It then took less than a minute to find the web page where the Base32 SHA-1 hash values were listed that began with “H4V.” That is how all of the media on the site was listed, in numerical order based upon the first three numbers of their Base32 hash values.
There were 83 entires on the webpage whose hash values began with H4V. The site included listings of music and videos ranging from Beethoven’s Symphony No. 9 to a video of Lee Trevano’s Golf Instruction. One video listing which was 11.1 MB in size had a disturbing title that suggested it could contain the kind of porn referenced in Warren. It was dated May 29, 2003. I clicked on its hash value button and saw that the full SHA-1 hash value for this video was H4VIBLSKAZ477WRTKH7IURE6NXEDCUTI.
When I saw that hash value, it shook me up. The first and last three values exactly matched the hash described in Warren: H4V … UTI. My academic investigation of the mathematical properties of hash had led me right to the smoking gun in Warren! I knew from my article, and the research of Bill Speros described in footnote 168, that this match of the first and last three values meant there was a 98.6% probability that this was the exact same file referenced in Warren. Mr. Warren was charged with a felony for distributing this same video. I think it is a crime to even have it on your computer.
I do not know for sure if it is the same file, since the Warren opinion nowhere states the full hash value, but in view of the description of this video, it is just too much of a coincidence for it not to be. It was astonishing on many levels to see just how quickly you can find a file like this on the Internet, simply by knowing the first three hash numbers.
It is probably not possible to actually download or view the file from this website. I do not really know for sure, since that would involve clicking on this file, which I was not about to do. But when I clicked on the link for Beethoven’s Symphony No. 9, a piece of media which I do not find morally reprehensible, it took me to another web page. This page had links to other computers where you may in fact have been able to download Beethoven’s music. (I did not try, recognizing that might be a copyright violation.) At that point, the referring website included a statement that it “ONLY HAS INFO ABOUT FILES, AND DOES NOT OFFER ANY FILES FOR DOWNLOAD.” Still, if any law enforcement agency wants to contact me for the full website address, including Cuomo’s group, I would be happy to provide it. It is really very easy to find, and so I assume the proper authorities are already well aware of this site and its hash values, or lack thereof. I am certainly no police officer, and even if I was, I would not have the stomach for this kind of investigative work. Reading the email of parties in civil suits is about as horrid as I can handle.
Judge Noce Was Right
This little investigation proved to me that Judge Noce and the St. Louis police were correct. There is a SHA-1 hash that has 32 places, not 40, and it can use the whole alphabet, not just A-F.
The hash value H4V … UTI is indeed a correct first and last place truncation of a full SHA-1 hash value. But it is a SHA-1 hash that is expressed in Base32, not hexadecimal. Although the hash values used in e-discovery are almost always hexadecimal, the hash values used in “Peer-to-Peer” websites include a variety of different numerical systems, frequently including the Base32 system.
In addition, in my brief investigation of the P2P webs, I learned that countless P2P type websites now commonly use the first three places of hash values as a convenient shorthand naming system. For all I know, the “perps” may also. As Judge Noce says, it is the convenient thing to do. So when will the e-discovery vendors start doing so too?
SUPPLEMENTAL READING: If you have not already read Ralph Losey’s law review article on hash, you should definitely read it now.
EXERCISE: Download one of the free hashing software programs available online and find the hash value of some of your files. Change a single comma in one of your larger word files and see what happens to your hash. Now change the name of the file and see what happens. Tell us about it. Also tell us something new you learned from actual hands-on use of a hash program.
Discretionary Bonus Exercise: There is a lot of cool stuff out there on hash and P2P. Find and read a few. Note many of the old hashes named in this class have now been broken. They are no longer secure, but still function fine as an identification shorthand.
Students are invited to leave a public comment below. Insights that might help other students are especially welcome. Let’s collaborate!
Copyright Ralph Losey 2015