CapitalToomey: April 2012

This is part 3 of a project I started to explore the underpinnings of the platforms I'd been using (and taking for granted) for eDiscovery and data analysis for the past five-plus years. Starting from scratch, the question I posed was: How to balance cost, effectiveness and reliability in handling this data?

"This data" is about 98 MB of mixed email and edocs that typify the material filling the hard drives and file servers everywhere, all the time. I have most recently used FreeEed to process this data, and here is a flyover - from 30,000 down to maybe 5,000 feet.

Setup

FreeEed is an open source eDiscovery platform currently being developed and growing in features and capabilities. It's designed to run (in the JRE) on Linux, Mac and Windows machines standalone, or in Hadoop clusters, including Amazon S3. So it has a great deal of scalability. I have it running on a Windows 2003 server, where the data also resides, but it certainly has the ability to work across the network.

The UI is sparse. You set up a project with a few options, including target data locations, output delimiters, and data de-NISTing, and save it to a *.project file, which can be used to keep track of these settings across multiple sessions.

From there, your work runs in two phases - staging and processing. Staging copies the target files to zip archives in an output subfolder. Processing pulls from these archives to create a metadata file and native and text archives, in their own output subfolder.

Job 1001 - My Test Project - one of two runs

Results

For reference, here's a rundown on the source data:

Source Data

Files: 208

Extensions:

doc 41

pdf 73

xls 19

xlsx 1

htm 4

html 18

eml 4

txt 8

csv 1

pst 3

ppt 12

[none] 3

1b 1

exe 1

db 1

zip 5

jpg 6

odp 1

odt 1

docx 1

pps 1

wav 1

msg 2

From the processing history log, I found the following results:

Staging

Files: 213

Processing

Files: 2892

Extensions:

doc 45

pdf 76

xls 20

xlsx 1

htm 22

html 18

eml 2665

txt 8

csv 1

pst 0

ppt 12

[none] 3

1b 1

exe 1

db 1

zip 0

jpg 6

odp 1

odt 1

docx 1

pps 1

wav 1

msg 2

.DS_Store 5

First off, here is an excellent illustration of my point on the importance and challenges of quality forensics in the eDiscovery pipeline. The growth of the source doc population, which has not otherwise changed since the project's beginning, can be attributed to the five .DS_Store docs that most likely appeared when I shared that directory across the network to a Mac workstation. 208 files were copied on a flash drive from one Windows machine to another (hence the thumbs.db), and worked on. However, recent system changes apart from the data itself have changed this population to 213. Whether or not such a slip-up would be defensible I can't say, but it's definitely sloppy.

Further, there are other differences in the doc counts between collection, staging and processing, including differences from the Concordance session. The most obvious is the thousands of emails extracted from the pst files that Concordance didn't touch. We also see where containers -- zip and pst files -- themselves are not treated as records, but their contents are. This is correct.

Excel spreadsheets are handled nicely with FreeEed. This example, which has five worksheets, exploded to five records in Concordance, where the FreeEed record looks like this:

HOUSE 1



    ADRESS    ?
    # UNITS    15
    TAX VALUE    ?
    RENT    85000 YEAR
    ASKING PRICE    $450,000.00
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1
    ASSESED VALUE
    SOLD PRICE
    HOUSE 2
    ASSESED VALUE
    SOLD PRICE
    HOUSE 3
    ASSESED VALUE
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE

    REPAIRS NEEDED    N/A
    CONTACTS

HOUSE (2)
    ADRESS    5225 S 21 ST
    # UNITS    2 BEDROOM
    Building Size    824 SQ. FT.
    SALES INFO    $24,500 1993
    ASKING PRICE    $42,500.00
    TAX VALUE    $33,000.00
    RENT
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1    5213 S 21 ST (528 SQ FT)
    ASSESED VALUE    $55,200.00
    SOLD PRICE    $30,000 1997
    HOUSE 2    5240 S 20 ST (1200 SQ FT)
    ASSESED VALUE    $39,300
    SOLD PRICE    $65,000 2002
    HOUSE 3    5229 S 21 ST (529 SQ FT)
    ASSESED VALUE    $31,800
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE

    REPAIRS NEEDED    N/A
    CONTACTS

HOUSE (3)
    ADRESS    1134 S 32 ST
    # UNITS    5 (8 BEDROOMS)
    Building Size    3023 SQ FT
    SALES INFO    47250 (2001)
    ASKING PRICE    $111,000.00
    TAX VALUE    $123,000.00
    RENT    1800 MONTH
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1
    ASSESED VALUE
    SOLD PRICE
    HOUSE 2
    ASSESED VALUE
    SOLD PRICE
    HOUSE 3
    ASSESED VALUE
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE

    REPAIRS NEEDED

HOUSE (4)
    ADRESS    1003 BERT MURPHY (OLD BELVUE)
    # UNITS    1
    Building Size    ?
    SALES INFO    ?
    ASKING PRICE    $69,000.00
    TAX VALUE    ?
    RENT
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1
    ASSESED VALUE
    SOLD PRICE
    HOUSE 2
    ASSESED VALUE
    SOLD PRICE
    HOUSE 3
    ASSESED VALUE
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE

    REPAIRS NEEDED    ROOF KITCHEN BATH AND BASEMENT
    CONTACTS

HOUSE (5)
    ADRESS
    # UNITS
    Building Size
    SALES INFO
    ASKING PRICE
    TAX VALUE
    RENT
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1
    ASSESED VALUE
    SOLD PRICE
    HOUSE 2
    ASSESED VALUE
    SOLD PRICE
    HOUSE 3
    ASSESED VALUE
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE

    REPAIRS NEEDED

Here is also a bit of very good news. Recall the PDF I sneakily renamed to *.xls. FreeEed was not fooled, and extracted the text correctly.

Microsoft® Windows® and the Windows logo are either registered trademarks or trademarks
of Microsoft Corporation in the United States and/or other countries.

Netscape® and Netscape Navigator are registered trademarks of Netscape Communications
Corporation in the United States and other countries. The Netscape logo and the Netscape
product and service names are also trademarks of Netscape Communications Corporation in
the United States and other countries.

This text was copied from FreeEed's text output file for this doc

There is no OCR engine on board FreeEed. So flat PDFs and image files will still have any text undiscovered. Also, note that these files were not de-NISTed, which would have removed the thumbs.db and .DS_Store files, but would also have skewed the count results even further.

After this initial assessment, this output looks to be ready for review. Going forward, I would refine my ability to tie off doc counts from collection to staging to processing to output, in order to account for everything. However, this is not yet prime time, and for the purposes of this experiment, it's best we move on.

I was unable to transform the output from previous versions of FreeEed into neat delimited load files. However, with some prompt and effective support from Mark (thanks, Mark!), we're now pumping out the following fields for each record:

UPI
File Name
Custodian
Source Device
Source Path
Production Path
Modified Date
Modified Time
Time Offset Value
processing_exception
master_duplicate
text
To
From
CC
BCC
Date Sent
Time Sent
Subject
Date Received
Time Received

The UPI links each record to its text file; the native filename, in the native folder, is a concatenation of UPI and File Name. I've successfully loaded portions of the output to SQL Server, but the whole load file was not ready at press time. However, I have shared the data, which I encourage you to run through your eDiscovery platform, and share what you get. A subsequent project that's in the works here will feature more on FreeEed processing in production. Please stay tuned.

Thoughts

Here are some of my observations following this trial run.

When processing completed, I had to pull the old copy-and-paste maneuver from the processing history window into a text file in order to save the log in my output folder. I'd like it to just go there when processing is done.

Output folders are automatically placed in FreeEed's home folder. They contain staging copies of the source data, plus metadata, doc text files, plus additional native copies for production. This can quickly add up to a lot of data - especially at the volumes to which this application can scale. It may in fact be an artifact of the product's Hadoop architecture. However, my application server is not necessarily the best place for this stuff -- it's really just the place on my network where Java lives. So, for my standalone instance, I would like the option to configure these output folder locations, by project.

In looking through the logs and trying to track down error files, or to compare originals to output versions, I often wished for the original source path information -- full path. This would involve more robust logging at the time of staging, but really, with its ability to NIST cull and recurse folders across a network, I think this can present FreeEed with a more prominent role in the collections process. In the future, it could maybe even connect to the Exchange server like this. But for now at least, I would like each file's original source path logged.
>>>Further, this staging log could potentially become its own metadata file. It could link custodian, source path, parent-attachment information, and collection date, by UPI. A more robust review platform necessarily uses multiple tables. </spitballing>

I do like the exceptions folder, where docs that could not be processed are copied for further investigation. I would also like an exceptions log, which shows what's in there, where it came from and what the error message was. I know most of this information can be found in the metadata and Processing History log already, but it would be convenient to have it separately.

I think this new phase on CapitalToomey.blogspot.com is off to a good start. I've set up a passable processing environment and tested two tools thereupon. In the fourth and final post to this project, I'll lay out in a bit more detail the setup of my system, and the direction I'm looking to take for future projects.

Thank you very much for reading. Any input would be very much appreciated. Please check back for future posts.

CapitalToomey

Saturday, April 14, 2012

eDiscovery - Lower in the Stack pt.IV - Wrap

Tuesday, April 3, 2012

eDiscovery - Lower in the Stack pt.III - FreeEed

Setup

Results

Thoughts

Next

Followers