Monday, March 5, 2012

eDiscovery - Lower in the Stack

A recent career upgrade, however welcome, nevertheless cut off my access to LAW Prediscovery and Clearwell platforms for play and experimentation. This presented both the obligation and opportunity to explore other avenues and issues to which I might not have otherwise been introduced. Specifically, I had to get lower in the eDiscovery stack.

A stable, flexible domain with ample storage and powerful workstations was no more my personal playground. Step one was to build a replacement.

Close enough?

It's neater up there now, but I thought this photo funny enough to share. Here's the current rundown on the CapitalToomey Data Center: Atop my family's WiFi, in the attic, I've got a Windows Server 2003 box running Concordance 9.58 and Free-EED 3.5. There's also an Acer Veriton M420 with Free-NAS that's been put through Proof-Of-Concept, but not yet filled with permanent drives (perhaps more on this later). So my small bit of case data is being housed on its own drive in the one work server. But it all works, so I'm happy with that at least. (As a quick side note, I've been using Microsoft's RDP Client for Mac, and it's been great.)

So, having reestablished a "work"environment, my question was one facing thousands of companies, law firms, consultants and litigants, everywhere, right now: How do we balance cost, effectiveness and reliability in handling this data?

eDiscovery for Small-to-Medium Data 
If you have potentially relevant discoverable data that's too big to fit on a CD-R, you're probably in need of technological assistance in collecting and reviewing it. In the coming series of posts, I will review two potential solutions for eDiscovery on this scale: Concordance and FreeEed*. I will use them both to process a fairly standard batch of e-docs and emails, comparing the processes and results, and offering some observations from my own experience along the way.

The Data
The initial dataset for this project is what LexisNexis provided for Concordance certification training in 2007, a bit of the Enron emails and the FreeEed test data. It totals about 98 MB.

One very important aspect of any eDiscovery project that we will not be looking at here, however, is collection. Pulling data from your stores in a thorough and dependable way, without trampling the metadata and potentially invalidating its production...this is a field of expertise and many series of experiments in and of itself.

So, once it was "collected," I used RoboCopy to place the data on to the work server. Here is part of the nice log it creates:

------------------------------------------------------------------------------

                Total    Copied   Skipped  Mismatch    FAILED    Extras
    Dirs :        18        18         0         0         0         0
    Files :       208       208         0         0         0         0
    Bytes :   97.93 m   97.93 m         0         0         0         0
    Times :   0:00:12   0:00:12                       0:00:00   0:00:00

    Speed :             8014316 Bytes/sec.
    Speed :             458.582 MegaBytes/min.

    Ended : Sun Mar 04 21:01:15 2012


And here is a report of the distribution of file extensions in this set.

doc  41
pdf  73
xls  19
xlsx  1
htm  4
html  18
eml  4
txt  8
csv  1
pst  3
ppt  12
[none]  3
1b  1
exe  1
db  1
zip  5
jpg  6
odp  1
odt  1
docx  1
pps  1
wav  1
msg 2

These two logs are important, and will be used to validate our processing results.

And now we're ready to start consuming this stuff. I have made the first runs, and will start putting together the results to share.

Thank you for reading. Please check back soon. I will post updates to my twitter feed.


* It's important to note that FreeEed, designed as it is to run on a Hadoop cluster, should be able to scale way way WAY beyond the scope of my experiments. Here's hoping I get to the point soon to test those abilities. For now, I'll focus on usability and reliability.

No comments:

Post a Comment