Let’s talk about data load files. If you know me, you know I’m wordy and this is no exception. So, grab a coffee (or whatever) and maybe a snack and read on!
Be on the lookout for my next couple of posts: Understanding Image Load Files and Troubleshooting Load Files.
If you have been in the litigation support industry, this will likely be a familiar topic. However, if you are new to all of this, here’s a look at Data Load Files and what the heck they mean…
For most review applications a “flat” load file is used to populate fields. The “flat” part means that the load file is a text file that contains a unique identifier along with various pieces of information (metadata/data) about each document. Each row/line within the file usually represents one document. These files are considered “flat” because they are not databases, which often contain several related tables–related by a unique identifier. These related tables provide the non-flat dimensions. There is at least one review application that uses a Microsoft Access Database (MDB) as its load file. We won’t dig into this here, because it is a discussion all on its own, but it is worth mentioning.
Unique identifier…what’s that? When you are dealing with documents during witness prep, depositions, or trial, you want each document to have a unique reference number (e.g, Bates number, exhibit number, etc.), so when you discuss it, everyone knows exactly which document you mean. Review software thinks the same way: It needs to identify a document based on something unique, so it knows what you mean when you load/reference information about the document. The unique identifier in a data load file is usually either a Control Number or a Bates Number, but it can be something other than that, as long as the identifier is unique to that document. The best part about the unique identifier is that it can be used later to load additional information, if necessary. Exciting!
Here’s an example of a “standard,” flat data load file (usually has a .dat file extension: PRODABC0001.dat):
Let’s break the down the information within this load file:
Within the “flat” file, the first row (header row) should contain the names of the fields. You can see a snippet of this in the image above. These field names provide information to let you, the human, know which fields will be imported (aka loaded) into the database. The review software into which you are loading the information should give you an option to skip this first row of the load file, so that the field names aren’t mistakenly loaded as a document. Most of the time, the review application doesn’t care about that first row, as long as you choose to pull the information in using the order presented in the file. However, there are some review applications that will detect that first row and use the information to match up the fields in the file to the fields in the application.
If you receive a load file that doesn’t contain header information, you will want to contact the party that provided the load file and ask them to either provide a new file with the necessary header information or to provide you with a list of what each field is. If they opt (or argue) for the second choice, make sure they let you know the field names in the order they appear in the load file. Otherwise, you’re looking at additional work trying to pinpoint which field content is supposed to go into which field (e.g., date fields can get mixed up easily).
Symbols that separate each field’s content (delimiters) are used to tell the software where one field ends and the next one begins.
A quick exercise:
- Open Excel
- Type information into the first three or four columns
- Now, save that file as a CSV (Comma delimited) file
- Open the CSV in NotePad or some other text editor
What you should notice about the CSV version of the Excel file is that each piece of information is separated by a comma and, in the second example, each is surrounded by quotes and separated by commas. If you open the CSV in Excel, those pieces of information will fall back into the correct columns. This is because the commas and quotes are telling Excel where each field (column) should be separated.
In this example the p’s surrounding the information are equivalent to the quotes and the square in between each is equivalent to the comma:
Wouldn’t it be easiest to use commas and quotes? Sure it would! However, commas and quotes are incredibly common characters and can cause issues when trying to load the data to the database. Avoid using commas and quotes as delimiters as much as possible.
The “Quote” Delimiter
The little “p’s” (lower case Thorns) around the field names and content tell the review software that the content between the two symbols is the information to be placed (imported) within that field.
The “Comma” Delimiter
In the example, the square (device control 4) between the Thorns is telling the review software that one field has ended and a new field is about to start.
Another Delimiter You Might See
There is one symbol that you will see if there are new lines (hard returns) within the content of the field. This will often be a registration mark ® (ASCII 174). The review application will need to know what symbol to use–even if it doesn’t exist in the load file. Using the registration mark will normally serve this purpose.
When you are preparing to load the data into your chosen review application, most will let you simply type in the symbols (like the comma and quote or see Other Delimiters below); however, in most instances, you may need to know the ASCII code for those symbols. For the symbols you see above, the Thorns are ASCII 254 and the squares are ASCII 20.
Thankfully, the delimiters discussed so far are pretty standard in load files and the ASCII information is shown in the review software when you are choosing the delimiters to use, so you don’t necessarily have to memorize them or look them up every time. And, even better, some review applications will automatically detect the symbols used and populate them for you!
As I said, the delimiters discussed above are considered “standard” delimiters, but you may see other delimiters like pipes (|) and carets (^). Here’s a pipe and caret example (usually has a .txt file extension):
I said it before, but it’s worth repeating – Try to avoid using commas and quotes as delimiters, as they can confuse the database into thinking that the fields break where they really don’t. This is especially true when someone’s name (e.g., LastName, FirstName) is included in a data load file.
The information that will be used to fill in the fields within the database (data)…
In this example, I have focused on the first field referenced in the load file, Begin Bates, but you can have just about as many fields in the load file as you want, including:
- Email Subject
- File Name
- Date Created
- Native File Path
- Text Path
As long as each of these fields is delimited by your chosen symbols, the software into which you are importing the information should recognize the difference between each field and place the “fielded” information into the correct field.
Most applications expect you to indicate which field within the database matches with which field in the load file. This is called, “field mapping.” For example, the load file contains a field called “To,” but your database field is called “Email To.” When you are preparing to load the information, you will need to let the application know that the information in the “To” field from the load file should be loaded into the “Email To” field in the database. In other words, you are mapping “To” and “Email To.”
Number of Documents
Each row within the load file, minus the header row, represents the information about one document. If you scroll to the bottom of the load file and can see how many lines there are, you can subtract one, ignore the blank last row, and know exactly how many documents are represented in this file.
Based on the information in the example above, I know that there are 2,688 documents represented in the file.
Note that one major piece of information that might throw you off is when the documents’ extracted text or OCR is included in the load file. I recommend avoiding this, if at all possible. Including the text within the load file can make the file cumbersome and can cause loading issues. If the extracted text or OCR is included in the load file, you will likely have a hard time determining the number of documents.
Hopefully, you now have a better understanding of data/metadata load files. However, if you’d like to learn from another perspective, take a look at one of these:
There are many more resources out there!