Saturday, April 14, 2012

eDiscovery - Lower in the Stack pt.IV - Wrap

The question put forth in this project, where I built a small eDiscovery processing environment, gathered test data and ran it through Concordance and FreeEed, was 'How do we balance cost, effectiveness and reliability in handling this data?'

I don't think there is a definitive answer to that question; I think it is at the front of the minds of anyone whose job it is to manage data -- for business management and intelligence, regulatory compliance, litigation, research or hobby -- and I think we saw some of why it can present a moving target. And this was without even using buzzwords like 'cloud', standards like 'de-duplication', or anachronisms like 'paper' -- all valid and important considerations to many who face these challenges professionally.

The industry-standard Electronic Discovery Reference Model shows the breadth and depth of these issues -- and why I provided the caveat that I'd be looking at Small-to-Medium-Data, where manual intervention is possible. Plenty are the defendants who still scan paper redacted documents to PDF and present them on CD-Rs.

EDRM Graphic


It should not be forgotten that it was just six years ago that the FRCP were amended to speak to the growth of electronically stored information, and still, it is said that for eDiscovery, judges normally care more about results than methodology. And by the same token, most any eDiscovery tool -- in the proper hands -- can be employed successfully.

As an eDiscovery sys admin with a background in social scientific research, I quickly recognized the intersection of these two paths. In their own way, both seek to collect, process and organize data about human action, and from it divine patterns and present them. So I started this blog with that in mind; I used eDiscovery tools for collection and processing, and communication science tools for analysis. I asked questions that I thought were of interest in both fields. And as it turned out, I was in good company. It was an excellent union. All that was missing was system administration, so by going 'lower in the stack' I closed the loop completely.

The "CapitalToomey Data Center" has undergone some changes since this project's outset. They will be outlined in future posts, but at least now I've got control of more components of the eDiscovery and Social Science research frameworks. I will keep on the lookout for interesting topics to post in this general area. I welcome your input, and thank you for reading.

Finally, if you are interested in sharing your results with the test data from this experiment, I would very much like to hear about it - in comments here or via Twitter.

Tuesday, April 3, 2012

eDiscovery - Lower in the Stack pt.III - FreeEed

This is part 3 of a project I started to explore the underpinnings of the platforms I'd been using (and taking for granted) for eDiscovery and data analysis for the past five-plus years. Starting from scratch, the question I posed was: How to balance cost, effectiveness and reliability in handling this data?

"This data" is about 98 MB of mixed email and edocs that typify the material filling the hard drives and file servers everywhere, all the time. I have most recently used FreeEed to process this data, and here is a flyover - from 30,000 down to maybe 5,000 feet.


Setup


FreeEed is an open source eDiscovery platform currently being developed and growing in features and capabilities. It's designed to run (in the JRE) on Linux, Mac and Windows machines standalone, or in Hadoop clusters, including Amazon S3. So it has a great deal of scalability. I have it running on a Windows 2003 server, where the data also resides, but it certainly has the ability to work across the network.



The UI is sparse. You set up a project with a few options, including target data locations, output delimiters, and data de-NISTing, and save it to a *.project file, which can be used to keep track of these settings across multiple sessions.



From there, your work runs in two phases - staging and processing. Staging copies the target files to zip archives in an output subfolder. Processing pulls from these archives to create a metadata file and native and text archives, in their own output subfolder.




 Job 1001 - My Test Project - one of two runs






Results

For reference, here's a rundown on the source data:
Source Data
Files: 208

Extensions:
doc  41
pdf  73
xls  19
xlsx  1
htm  4
html  18
eml  4
txt  8
csv  1
pst  3
ppt  12
[none]  3
1b  1
exe  1
db  1
zip  5
jpg  6
odp  1
odt  1
docx  1
pps  1
wav  1
msg 2

From the processing history log, I found the following results:




Staging
Files: 213





Processing
Files: 2892

Extensions:
doc  45
pdf  76
xls  20
xlsx  1
htm  22
html  18
eml  2665
txt  8
csv  1
pst  0
ppt  12
[none] 3
1b  1
exe  1
db  1
zip 0
jpg  6
odp  1
odt  1
docx  1
pps  1
wav  1
msg 2
.DS_Store  5


First off, here is an excellent illustration of my point on the importance and challenges of quality forensics in the eDiscovery pipeline. The growth of the source doc population, which has not otherwise changed since the project's beginning, can be attributed to the five .DS_Store docs that  most likely appeared when I shared that directory across the network to a Mac workstation.  208 files were copied on a flash drive from one Windows machine to another (hence the thumbs.db), and worked on. However, recent system changes apart from the data itself have changed this population to 213. Whether or not such a slip-up would be defensible I can't say, but it's definitely sloppy.


Further, there are other differences in the doc counts between collection, staging and processing, including differences from the Concordance session. The most obvious is the thousands of emails extracted from the pst files that Concordance didn't touch. We also see where containers -- zip and pst files -- themselves are not treated as records, but their contents are. This is correct.


Excel spreadsheets are handled nicely with FreeEed. This example, which has five worksheets, exploded to five records in Concordance, where the FreeEed record looks like this:


HOUSE 1
   
   
   
    ADRESS    ?
    # UNITS    15
    TAX VALUE    ?
    RENT    85000 YEAR
    ASKING PRICE    $450,000.00
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1
    ASSESED VALUE
    SOLD PRICE
    HOUSE 2
    ASSESED VALUE
    SOLD PRICE
    HOUSE 3
    ASSESED VALUE
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE
   
    REPAIRS NEEDED    N/A
    CONTACTS


HOUSE (2)
    ADRESS    5225 S 21 ST
    # UNITS    2 BEDROOM
    Building Size    824 SQ. FT.
    SALES INFO    $24,500 1993
    ASKING PRICE    $42,500.00
    TAX VALUE    $33,000.00
    RENT
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1    5213 S 21 ST (528 SQ FT)
    ASSESED VALUE    $55,200.00
    SOLD PRICE    $30,000 1997
    HOUSE 2    5240 S 20 ST (1200 SQ FT)
    ASSESED VALUE    $39,300
    SOLD PRICE    $65,000 2002
    HOUSE 3    5229 S 21 ST (529 SQ FT)
    ASSESED VALUE    $31,800
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE
   
    REPAIRS NEEDED    N/A
    CONTACTS


HOUSE (3)
    ADRESS    1134 S 32 ST
    # UNITS    5 (8 BEDROOMS)
    Building Size    3023 SQ FT
    SALES INFO    47250 (2001)
    ASKING PRICE    $111,000.00
    TAX VALUE    $123,000.00
    RENT    1800 MONTH
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1
    ASSESED VALUE
    SOLD PRICE
    HOUSE 2
    ASSESED VALUE
    SOLD PRICE
    HOUSE 3
    ASSESED VALUE
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE
   
    REPAIRS NEEDED


HOUSE (4)
    ADRESS    1003 BERT MURPHY (OLD BELVUE)
    # UNITS    1
    Building Size    ?
    SALES INFO    ?
    ASKING PRICE    $69,000.00
    TAX VALUE    ?
    RENT
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1
    ASSESED VALUE
    SOLD PRICE
    HOUSE 2
    ASSESED VALUE
    SOLD PRICE
    HOUSE 3
    ASSESED VALUE
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE
   
    REPAIRS NEEDED    ROOF KITCHEN BATH AND BASEMENT
    CONTACTS


HOUSE (5)
    ADRESS
    # UNITS
    Building Size
    SALES INFO
    ASKING PRICE
    TAX VALUE
    RENT
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1
    ASSESED VALUE
    SOLD PRICE
    HOUSE 2
    ASSESED VALUE
    SOLD PRICE
    HOUSE 3
    ASSESED VALUE
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE
   
    REPAIRS NEEDED


Here is also a bit of very good news. Recall the PDF I sneakily renamed to *.xls. FreeEed was not fooled, and extracted the text correctly.

Microsoft® Windows® and the Windows logo are either registered trademarks or trademarks
of Microsoft Corporation in the United States and/or other countries.

Netscape® and Netscape Navigator are registered trademarks of Netscape Communications
Corporation in the United States and other countries. The Netscape logo and the Netscape
product and service names are also trademarks of Netscape Communications Corporation in
the United States and other countries.
This text was copied from FreeEed's text output file for this doc



There is no OCR engine on board FreeEed. So flat PDFs and image files will still have any text undiscovered.  Also, note that these files were not de-NISTed, which would have removed the thumbs.db and .DS_Store files, but would also have skewed the count results even further.

After this initial assessment, this output looks to be ready for review. Going forward, I would refine my ability to tie off doc counts from collection to staging to processing to output, in order to account for everything. However, this is not yet prime time, and for the purposes of this experiment, it's best we move on.

I was unable to transform the output from previous versions of FreeEed into neat delimited load files. However, with some prompt and effective support from Mark (thanks, Mark!), we're now pumping out the following fields for each record:

UPI
File Name
Custodian
Source Device
Source Path
Production Path
Modified Date
Modified Time
Time Offset Value
processing_exception
master_duplicate
text
To
From
CC
BCC
Date Sent
Time Sent
Subject
Date Received
Time Received


The UPI links each record to its text file; the native filename, in the native folder, is a concatenation of UPI and File Name. I've successfully loaded portions of the output to SQL Server, but the whole load file was not ready at press time. However, I have shared the data, which I encourage you to run through your eDiscovery platform, and share what you get. A subsequent project that's in the works here will feature more on FreeEed processing in production. Please stay tuned.


Thoughts


Here are some of my observations following this trial run.

  • When processing completed, I had to pull the old copy-and-paste maneuver from the processing history window into a text file in order to save the log in my output folder. I'd like it to just go there when processing is done.
  • Output folders are automatically placed in FreeEed's home folder. They contain staging copies of the source data, plus metadata, doc text files, plus additional native copies for production. This can quickly add up to a lot of data - especially at the volumes to which this application can scale. It may in fact be an artifact of the product's Hadoop architecture. However, my application server is not necessarily the best place for this stuff -- it's really just the place on my network where Java lives. So, for my standalone instance, I would like the option to configure these output folder locations, by project.
  • In looking through the logs and trying to track down error files, or to compare originals to output versions, I often wished for the original source path information -- full path. This would involve more robust logging at the time of staging, but really, with its ability to NIST cull and recurse folders across a network, I think this can present FreeEed with a more prominent role in the collections process. In the future, it could maybe even connect to the Exchange server like this. But for now at least, I would like each file's original source path logged. 
  • >>>Further, this staging log could potentially become its own metadata file. It could link custodian, source path, parent-attachment information, and collection date, by UPI. A more robust review platform necessarily uses multiple tables. </spitballing>  
  • I do like the exceptions folder, where docs that could not be processed are copied for further investigation. I would also like an exceptions log, which shows what's in there, where it came from and what the error message was. I know most of this information can be found in the metadata and Processing History log already, but it would be convenient to have it separately.


Next


I think this new phase on CapitalToomey.blogspot.com is off to a good start. I've set up a passable processing environment and tested two tools thereupon. In the fourth and final post to this project, I'll lay out in a bit more detail the setup of my system, and the direction I'm looking to take for future projects.

Thank you very much for reading. Any input would be very much appreciated. Please check back for future posts.