A stable, flexible domain with ample storage and powerful workstations was no more my personal playground. Step one was to build a replacement.
Close enough?
It's neater up there now, but I thought this photo funny enough to share. Here's the current rundown on the CapitalToomey Data Center: Atop my family's WiFi, in the attic, I've got a Windows Server 2003 box running Concordance 9.58 and Free-EED 3.5. There's also an Acer Veriton M420 with Free-NAS that's been put through Proof-Of-Concept, but not yet filled with permanent drives (perhaps more on this later). So my small bit of case data is being housed on its own drive in the one work server. But it all works, so I'm happy with that at least. (As a quick side note, I've been using Microsoft's RDP Client for Mac, and it's been great.)
So, having reestablished a "work"environment, my question was one facing thousands of companies, law firms, consultants and litigants, everywhere, right now: How do we balance cost, effectiveness and reliability in handling this data?
eDiscovery for Small-to-Medium Data
If you have potentially relevant discoverable data that's too big to fit on a CD-R, you're probably in need of technological assistance in collecting and reviewing it. In the coming series of posts, I will review two potential solutions for eDiscovery on this scale: Concordance and FreeEed*. I will use them both to process a fairly standard batch of e-docs and emails, comparing the processes and results, and offering some observations from my own experience along the way.
The Data
The initial dataset for this project is what LexisNexis provided for Concordance certification training in 2007, a bit of the Enron emails and the FreeEed test data. It totals about 98 MB.
One very important aspect of any eDiscovery project that we will not be looking at here, however, is collection. Pulling data from your stores in a thorough and dependable way, without trampling the metadata and potentially invalidating its production...this is a field of expertise and many series of experiments in and of itself.
So, once it was "collected," I used RoboCopy to place the data on to the work server. Here is part of the nice log it creates:
And here is a report of the distribution of file extensions in this set.
These two logs are important, and will be used to validate our processing results.
And now we're ready to start consuming this stuff. I have made the first runs, and will start putting together the results to share.
Thank you for reading. Please check back soon. I will post updates to my twitter feed.
eDiscovery for Small-to-Medium Data
If you have potentially relevant discoverable data that's too big to fit on a CD-R, you're probably in need of technological assistance in collecting and reviewing it. In the coming series of posts, I will review two potential solutions for eDiscovery on this scale: Concordance and FreeEed*. I will use them both to process a fairly standard batch of e-docs and emails, comparing the processes and results, and offering some observations from my own experience along the way.
The Data
The initial dataset for this project is what LexisNexis provided for Concordance certification training in 2007, a bit of the Enron emails and the FreeEed test data. It totals about 98 MB.
One very important aspect of any eDiscovery project that we will not be looking at here, however, is collection. Pulling data from your stores in a thorough and dependable way, without trampling the metadata and potentially invalidating its production...this is a field of expertise and many series of experiments in and of itself.
So, once it was "collected," I used RoboCopy to place the data on to the work server. Here is part of the nice log it creates:
------------------------------------------------------------------------------
Total Copied Skipped Mismatch FAILED Extras
Dirs : 18 18 0 0 0 0
Files : 208 208 0 0 0 0
Bytes : 97.93 m 97.93 m 0 0 0 0
Times : 0:00:12 0:00:12 0:00:00 0:00:00
Speed : 8014316 Bytes/sec.
Speed : 458.582 MegaBytes/min.
Ended : Sun Mar 04 21:01:15 2012
Total Copied Skipped Mismatch FAILED Extras
Dirs : 18 18 0 0 0 0
Files : 208 208 0 0 0 0
Bytes : 97.93 m 97.93 m 0 0 0 0
Times : 0:00:12 0:00:12 0:00:00 0:00:00
Speed : 8014316 Bytes/sec.
Speed : 458.582 MegaBytes/min.
Ended : Sun Mar 04 21:01:15 2012
And here is a report of the distribution of file extensions in this set.
doc 41
pdf 73
xls 19
xlsx 1
htm 4
html 18
eml 4
txt 8
csv 1
pst 3
ppt 12
[none] 3
1b 1
exe 1
db 1
zip 5
jpg 6
odp 1
odt 1
docx 1
pps 1
wav 1
msg 2
These two logs are important, and will be used to validate our processing results.
And now we're ready to start consuming this stuff. I have made the first runs, and will start putting together the results to share.
Thank you for reading. Please check back soon. I will post updates to my twitter feed.
* It's important to note that FreeEed, designed as it is to run on a Hadoop cluster, should be able to scale way way WAY beyond the scope of my experiments. Here's hoping I get to the point soon to test those abilities. For now, I'll focus on usability and reliability.
No comments:
Post a Comment