If you are a part of the Litigation Support community, you have no doubt been asked for a complete list of file types that cannot or should not be processed. What was your answer?
My answer has always been: There is no definitive list of file types that fall within that category.
Here’s why and some tools to help you figure out what you don’t need…
Some may argue that the NIST’s (National Institute of Standards and Technology) National Software Reference Library (NSRL) is just such a list. This list contains, “…[information that] will help alleviate much of the effort involved in determining which files are important as evidence on computers or file systems that have been seized as part of criminal investigations. The [Reference Data Set] RDS is a collection of digital signatures of known, traceable software applications.” This is a great start for getting rid of file types that are unnecessary and known by the NIST, but it is by no means all-encompassing.
So, what can be done?
Before we get into details, let’s first define “processing.” Some think of processing as ingesting data and extracting metadata and text from that data. Others see processing as the process of converting a native file into an image (usually a TIFF or a JPG). Many define processing as a combination of these things, along with performing OCR on data that did not have text to extract. The EDRM lists processing as encompassing the following:
- Aims: Perform actions on [Electronically Stored Information] ESI to allow for metadata preservation, itemization, normalization of format, and data reduction via selection for review.
- Goal: Identify ESI items appropriate for review and production as per project requirements.
A quick word about metadata:
Most documents have at least a bit of metadata (data about the data) that can be extracted, many have “extractable” text, and most can be included within a review database. This; however, does not mean that you will be able to review the document. In most instances, if a document cannot be readily opened by an installed application, it’s likely that the same document cannot be converted to an image format.
If you’re having issues working with the file, your opposing counsel will likely encounter issues too and we all know what that means!
Now, back to our originally scheduled discussion…
A huge piece of the answer lies within your specific data set and the focus of your case. Is the focus of your case on intellectual property? If so, you might need to keep all of those program files around for review and, possibly, production. Or, is yours an employment case, all about retaliation? In that instance, you’ll want to push the program/system files aside and focus on the email messages and general documentation. Ask yourself what type of information needs to be produced to the other side and then make decisions about what files to keep for review.
One place to start identifying the files you’ll want to keep is in the original file path. If you have collected the data (in a forensically sound manner) as it was originally kept and had that data inventoried, the original file path information should be readily available and identifiable. An inventory of your data can be generated by a service provider or after you have made a copy of the original data (again, in a forensically sound manner), using any of a variety of applications that serve the purpose of generating an inventory of files.
Here’s an example: If your case is an employment matter, files that are in the “…\Program Files\” or the “…\Windows\” folders can usually be put aside. These are all of the files that your computer uses to run programs and to remember what image you have chosen to use as the desktop wallpaper. It is worth taking a look at the content of these types of folders, just to make sure nothing has been hidden there. However, most people use folders to stay organized and most people don’t make a habit of navigating to program or system folders to store their every-day files.
If you take a look in the “My Documents” or “Documents” folder on a user’s machine, you’re more likely to find a better crop of documents. You may see a folder structure that looks something like this:
- Sales Information
- Staff Information
- Employee 001
- Employee 002
- Employee 003
In this structure, you’ll likely want to take a pretty close look at the files in the “Staff Information” folder and its sub-folders. But, don’t ignore the other folders. Maybe someone incorrectly filed a document or maybe someone intentionally saved a file in an unrelated folder to throw others off of the trail.
Another quick way to determine what to keep and what to ignore is to grab a list of file extensions. These are the three or four (sometimes less, sometimes more, sometimes none at all) characters that come after the period in file names. In the file name “JanuaryInvoices.xlsx,” the “XLSX” is the file extension. Your computer might be set to hide the extensions of more common file types. This setting can be changed in Windows.
WARNING: One file type that will most likely end up on your “do not process” list is the EXE. This is the common extension of the files that prompt a program to start/run. It’s best to avoid these guys, if at all possible, unless your case specifically calls for you to produce them. If that happens, you’ll likely need to review an entire program and not the EXE alone.
TIP: Those pesky “thumbs.db” files. You’ve likely seen them and wondered what they are. According to Microsoft, “[A] Thumbs.db stores graphics, movie, and some document files, then generates a preview of the folder contents using a thumbnail cache.” These files are not useful to your review efforts and will usually return an error when trying to view as a native or convert to an image.
There are a variety of applications that will extract the file’s extension from the file’s name and place the extension in a field or column. From there, you can group or tally the list of extensions and find which file types you want to keep. Just don’t forget about keeping families together! In other words, you might not want to remove a DLL file if it’s attached to an email message. If you do, be prepared to explain why there is an attachment referenced by an email, but no attachment to be found in your production set.
If you’re not sure about the software that can be used to open a certain type of file, you can either search for the file extension itself using a search engine or use one of the many sites that contain file extensions and their commonly associated applications. My go-to site is FILExt, but there are several others out there that contain great information.
Using a compiled list of file extensions can help you knock out hundreds, maybe thousands of unwanted files at once. Remember to watch out for families!
In the end, file content is what will ultimately be reviewed and produced. Once you’ve used the NIST list, case-specific information, locations, and extensions to cull down the data, you are left with content. There are a variety of applications that will organize your data’s content to quickly get to the documents you really need to review. However, content-based culling is an entirely different post for another day.