Wednesday, May 2, 2012

The CapitalToomey Data Center

I said that my work space has gotten neater. In fact, it has gone most of the way to neat-o!

Actually, the details I'm about to share are fairly pedestrian, but I do want to document the environment in which I run the experiments detailed here. Also, I'm pretty excited about finding "free Visio" solution Dia. Check out the diagrammatic pyrotechnics:



This setup serves both "work" and home needs - data shares, backups, media, wireless, etc. Here are the highlights:

NAS - FreeNAS 8.0.2 with 2 x 2TB hard drives in RAID1 mirror configuration, which I built into an older Acer Veriton. It serves Windows and Apple shares, including Time Machine backups for the Mini and my laptop, which runs on Lion. This is especially cool because when Lion was released, it was discovered that the change to its authentication protocol for Time Machine had made a lot of third party NAS solutions incompatible. No issues here.

ED - Windows Server 2003 on a 3GHz desktop I got at Gig when I first moved to Albany after college. A long time ago. It's running SQL Express and VS Express, the JRE for things like FreeEed and WORDij, and is a web server. Right now, all it's got is BugTracker.net, which is basically my to-do list and notebook. It's made available outside by DynDNS, which is a lot cheaper -- and I imagine safer -- than buying a static IP address. It has two hard drives -- SQL Server is not using the NAS as of now. I usually connect to ED via RDC for Mac.

Mini - This Leopard Mac Mini had been the family desktop since 2007, until we upgraded recently and retired it. It's now the iTunes Home Sharing server, whose library lives on the NAS.

Router - I'm happy with this Netgear -N router. My house has three floors plus a garden apartment. The router is on the third floor, and its wireless reaches all the way downstairs with no problem. It's split into two networks, so my tenants are not on the network shown above, but nevertheless have wireless access to the Internet and their own machines. It's got port forwarding to send http traffic to ED, and a built-in client for DynDNS.

Like most living systems, this will continue to develop. But for now it's working and stable. It's most of what I need for the work done for this blog, which can now continue!

I'd love to see your network! Please share in the comments, or offer suggestions. Thank you very much for reading.

Saturday, April 14, 2012

eDiscovery - Lower in the Stack pt.IV - Wrap

The question put forth in this project, where I built a small eDiscovery processing environment, gathered test data and ran it through Concordance and FreeEed, was 'How do we balance cost, effectiveness and reliability in handling this data?'

I don't think there is a definitive answer to that question; I think it is at the front of the minds of anyone whose job it is to manage data -- for business management and intelligence, regulatory compliance, litigation, research or hobby -- and I think we saw some of why it can present a moving target. And this was without even using buzzwords like 'cloud', standards like 'de-duplication', or anachronisms like 'paper' -- all valid and important considerations to many who face these challenges professionally.

The industry-standard Electronic Discovery Reference Model shows the breadth and depth of these issues -- and why I provided the caveat that I'd be looking at Small-to-Medium-Data, where manual intervention is possible. Plenty are the defendants who still scan paper redacted documents to PDF and present them on CD-Rs.

EDRM Graphic


It should not be forgotten that it was just six years ago that the FRCP were amended to speak to the growth of electronically stored information, and still, it is said that for eDiscovery, judges normally care more about results than methodology. And by the same token, most any eDiscovery tool -- in the proper hands -- can be employed successfully.

As an eDiscovery sys admin with a background in social scientific research, I quickly recognized the intersection of these two paths. In their own way, both seek to collect, process and organize data about human action, and from it divine patterns and present them. So I started this blog with that in mind; I used eDiscovery tools for collection and processing, and communication science tools for analysis. I asked questions that I thought were of interest in both fields. And as it turned out, I was in good company. It was an excellent union. All that was missing was system administration, so by going 'lower in the stack' I closed the loop completely.

The "CapitalToomey Data Center" has undergone some changes since this project's outset. They will be outlined in future posts, but at least now I've got control of more components of the eDiscovery and Social Science research frameworks. I will keep on the lookout for interesting topics to post in this general area. I welcome your input, and thank you for reading.

Finally, if you are interested in sharing your results with the test data from this experiment, I would very much like to hear about it - in comments here or via Twitter.

Tuesday, April 3, 2012

eDiscovery - Lower in the Stack pt.III - FreeEed

This is part 3 of a project I started to explore the underpinnings of the platforms I'd been using (and taking for granted) for eDiscovery and data analysis for the past five-plus years. Starting from scratch, the question I posed was: How to balance cost, effectiveness and reliability in handling this data?

"This data" is about 98 MB of mixed email and edocs that typify the material filling the hard drives and file servers everywhere, all the time. I have most recently used FreeEed to process this data, and here is a flyover - from 30,000 down to maybe 5,000 feet.


Setup


FreeEed is an open source eDiscovery platform currently being developed and growing in features and capabilities. It's designed to run (in the JRE) on Linux, Mac and Windows machines standalone, or in Hadoop clusters, including Amazon S3. So it has a great deal of scalability. I have it running on a Windows 2003 server, where the data also resides, but it certainly has the ability to work across the network.



The UI is sparse. You set up a project with a few options, including target data locations, output delimiters, and data de-NISTing, and save it to a *.project file, which can be used to keep track of these settings across multiple sessions.



From there, your work runs in two phases - staging and processing. Staging copies the target files to zip archives in an output subfolder. Processing pulls from these archives to create a metadata file and native and text archives, in their own output subfolder.




 Job 1001 - My Test Project - one of two runs






Results

For reference, here's a rundown on the source data:
Source Data
Files: 208

Extensions:
doc  41
pdf  73
xls  19
xlsx  1
htm  4
html  18
eml  4
txt  8
csv  1
pst  3
ppt  12
[none]  3
1b  1
exe  1
db  1
zip  5
jpg  6
odp  1
odt  1
docx  1
pps  1
wav  1
msg 2

From the processing history log, I found the following results:




Staging
Files: 213





Processing
Files: 2892

Extensions:
doc  45
pdf  76
xls  20
xlsx  1
htm  22
html  18
eml  2665
txt  8
csv  1
pst  0
ppt  12
[none] 3
1b  1
exe  1
db  1
zip 0
jpg  6
odp  1
odt  1
docx  1
pps  1
wav  1
msg 2
.DS_Store  5


First off, here is an excellent illustration of my point on the importance and challenges of quality forensics in the eDiscovery pipeline. The growth of the source doc population, which has not otherwise changed since the project's beginning, can be attributed to the five .DS_Store docs that  most likely appeared when I shared that directory across the network to a Mac workstation.  208 files were copied on a flash drive from one Windows machine to another (hence the thumbs.db), and worked on. However, recent system changes apart from the data itself have changed this population to 213. Whether or not such a slip-up would be defensible I can't say, but it's definitely sloppy.


Further, there are other differences in the doc counts between collection, staging and processing, including differences from the Concordance session. The most obvious is the thousands of emails extracted from the pst files that Concordance didn't touch. We also see where containers -- zip and pst files -- themselves are not treated as records, but their contents are. This is correct.


Excel spreadsheets are handled nicely with FreeEed. This example, which has five worksheets, exploded to five records in Concordance, where the FreeEed record looks like this:


HOUSE 1
   
   
   
    ADRESS    ?
    # UNITS    15
    TAX VALUE    ?
    RENT    85000 YEAR
    ASKING PRICE    $450,000.00
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1
    ASSESED VALUE
    SOLD PRICE
    HOUSE 2
    ASSESED VALUE
    SOLD PRICE
    HOUSE 3
    ASSESED VALUE
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE
   
    REPAIRS NEEDED    N/A
    CONTACTS


HOUSE (2)
    ADRESS    5225 S 21 ST
    # UNITS    2 BEDROOM
    Building Size    824 SQ. FT.
    SALES INFO    $24,500 1993
    ASKING PRICE    $42,500.00
    TAX VALUE    $33,000.00
    RENT
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1    5213 S 21 ST (528 SQ FT)
    ASSESED VALUE    $55,200.00
    SOLD PRICE    $30,000 1997
    HOUSE 2    5240 S 20 ST (1200 SQ FT)
    ASSESED VALUE    $39,300
    SOLD PRICE    $65,000 2002
    HOUSE 3    5229 S 21 ST (529 SQ FT)
    ASSESED VALUE    $31,800
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE
   
    REPAIRS NEEDED    N/A
    CONTACTS


HOUSE (3)
    ADRESS    1134 S 32 ST
    # UNITS    5 (8 BEDROOMS)
    Building Size    3023 SQ FT
    SALES INFO    47250 (2001)
    ASKING PRICE    $111,000.00
    TAX VALUE    $123,000.00
    RENT    1800 MONTH
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1
    ASSESED VALUE
    SOLD PRICE
    HOUSE 2
    ASSESED VALUE
    SOLD PRICE
    HOUSE 3
    ASSESED VALUE
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE
   
    REPAIRS NEEDED


HOUSE (4)
    ADRESS    1003 BERT MURPHY (OLD BELVUE)
    # UNITS    1
    Building Size    ?
    SALES INFO    ?
    ASKING PRICE    $69,000.00
    TAX VALUE    ?
    RENT
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1
    ASSESED VALUE
    SOLD PRICE
    HOUSE 2
    ASSESED VALUE
    SOLD PRICE
    HOUSE 3
    ASSESED VALUE
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE
   
    REPAIRS NEEDED    ROOF KITCHEN BATH AND BASEMENT
    CONTACTS


HOUSE (5)
    ADRESS
    # UNITS
    Building Size
    SALES INFO
    ASKING PRICE
    TAX VALUE
    RENT
    TAXES
    UTILITIES
    NEIGHBORS
    HOUSE 1
    ASSESED VALUE
    SOLD PRICE
    HOUSE 2
    ASSESED VALUE
    SOLD PRICE
    HOUSE 3
    ASSESED VALUE
    SOLD PRICE
    HOUSE 4
    ASSESED VALUE
    SOLD PRICE
   
    REPAIRS NEEDED


Here is also a bit of very good news. Recall the PDF I sneakily renamed to *.xls. FreeEed was not fooled, and extracted the text correctly.

Microsoft® Windows® and the Windows logo are either registered trademarks or trademarks
of Microsoft Corporation in the United States and/or other countries.

Netscape® and Netscape Navigator are registered trademarks of Netscape Communications
Corporation in the United States and other countries. The Netscape logo and the Netscape
product and service names are also trademarks of Netscape Communications Corporation in
the United States and other countries.
This text was copied from FreeEed's text output file for this doc



There is no OCR engine on board FreeEed. So flat PDFs and image files will still have any text undiscovered.  Also, note that these files were not de-NISTed, which would have removed the thumbs.db and .DS_Store files, but would also have skewed the count results even further.

After this initial assessment, this output looks to be ready for review. Going forward, I would refine my ability to tie off doc counts from collection to staging to processing to output, in order to account for everything. However, this is not yet prime time, and for the purposes of this experiment, it's best we move on.

I was unable to transform the output from previous versions of FreeEed into neat delimited load files. However, with some prompt and effective support from Mark (thanks, Mark!), we're now pumping out the following fields for each record:

UPI
File Name
Custodian
Source Device
Source Path
Production Path
Modified Date
Modified Time
Time Offset Value
processing_exception
master_duplicate
text
To
From
CC
BCC
Date Sent
Time Sent
Subject
Date Received
Time Received


The UPI links each record to its text file; the native filename, in the native folder, is a concatenation of UPI and File Name. I've successfully loaded portions of the output to SQL Server, but the whole load file was not ready at press time. However, I have shared the data, which I encourage you to run through your eDiscovery platform, and share what you get. A subsequent project that's in the works here will feature more on FreeEed processing in production. Please stay tuned.


Thoughts


Here are some of my observations following this trial run.

  • When processing completed, I had to pull the old copy-and-paste maneuver from the processing history window into a text file in order to save the log in my output folder. I'd like it to just go there when processing is done.
  • Output folders are automatically placed in FreeEed's home folder. They contain staging copies of the source data, plus metadata, doc text files, plus additional native copies for production. This can quickly add up to a lot of data - especially at the volumes to which this application can scale. It may in fact be an artifact of the product's Hadoop architecture. However, my application server is not necessarily the best place for this stuff -- it's really just the place on my network where Java lives. So, for my standalone instance, I would like the option to configure these output folder locations, by project.
  • In looking through the logs and trying to track down error files, or to compare originals to output versions, I often wished for the original source path information -- full path. This would involve more robust logging at the time of staging, but really, with its ability to NIST cull and recurse folders across a network, I think this can present FreeEed with a more prominent role in the collections process. In the future, it could maybe even connect to the Exchange server like this. But for now at least, I would like each file's original source path logged. 
  • >>>Further, this staging log could potentially become its own metadata file. It could link custodian, source path, parent-attachment information, and collection date, by UPI. A more robust review platform necessarily uses multiple tables. </spitballing>  
  • I do like the exceptions folder, where docs that could not be processed are copied for further investigation. I would also like an exceptions log, which shows what's in there, where it came from and what the error message was. I know most of this information can be found in the metadata and Processing History log already, but it would be convenient to have it separately.


Next


I think this new phase on CapitalToomey.blogspot.com is off to a good start. I've set up a passable processing environment and tested two tools thereupon. In the fourth and final post to this project, I'll lay out in a bit more detail the setup of my system, and the direction I'm looking to take for future projects.

Thank you very much for reading. Any input would be very much appreciated. Please check back for future posts.

Tuesday, March 27, 2012

eDiscovery - Lower in the Stack - Test Data

Reader Paul C. Easton made the excellent suggestion that I zip up and share the test data being used in this experiment. And I agree for at least three (related) reasons: I too would like to see how other tools would handle this data set; I currently have no access to LAW, and would love to hear from someone who does; and as interesting as this project already is, it can only improve on input from others. So here goes (59.2 MB - be patient).

Please let me know how you make out with this, and thanks, Paul! I will be sharing my FreeEed results ASAP.

Note - inside the zip archive, there is a RENAME_LOG that lists the files whose extensions I intentionally switched. That file itself was not processed in the test runs.

Thursday, March 15, 2012

eDiscovery - Lower in the Stack pt.II - Concordance

This is part 2 of a project I started to look at some data-gathering tools, absent LAW and Clearwell, which I had been using since this blog's outset.

Concordance is best known -- and probably best used -- as a review tool. However, it does have the ability to discover and import edocs, email and transcripts, in addition to the industry standard "Concordance style" TIF/TXT loads of the same.

I have on occasion used Concordance to add edocs to existing document collections in cases. Most often, it works fine for PDFs, with two caveats: Adobe Reader or Acrobat must be installed on the same machine, and any document text needs to be already extracted, as Concordance has no OCR engine. This, coupled with the fact that Acrobat by default will not OCR a doc with any text in it can leave you with PDFs that show up like this:


[ASDF-000123]


[ASDF-000124]


[ASDF-000125]


[ASDF-000126]


"Yes, I Batch-OCRed everything before importing it. Right after I added the Bates footers."

Actually, Concordance has no on-board document processing tools at all, but uses other installed programs to extract metadata natively. Most of the time, including now, this means MS Office and Acrobat.

Here's the basic procedure:

After creating a blank eDocs database, select Import > Edocs from the menu, and select your population by location and doc type:



You decide how to match gathered metadata to your database fields, and start the import. Logging is less-than-verbose.

But at least you have a list of the error docs for your QC. Here's a shot of the Concordance screen looking at a PDF after import:


The auto-hyperlinking is convenient. Those hyperlinks will launch a doc's native viewer, and works as long as the data location will not change. However, Concordance ships with a pile of scripts, one of which can adjust the link fairly easily.

Here is the list of metadata Concordance pulls:

TITLE
SUBJECT
AUTHOR
COMPANY
CATEGORY
KEYWORDS
PRODUCER
CREATOR
COMMENTS
METADATA
FILEPATH
DATE
MODDATE
CREATIONDATE
PRINTDATE
TEXT01
TEXT02
TEXT03
TEXT04
TEXT05

It will do its best to populate those fields with what it finds in the docs.

Results

I imported files using the *.* method, where any object in the specified path is processed. For reference, here's a rundown on the source data:

Files: 208

Extensions:
doc  41
pdf  73
xls  19
xlsx  1
htm  4
html  18
eml  4
txt  8
csv  1
pst  3
ppt  12
[none]  3
1b  1
exe  1
db  1
zip  5
jpg  6
odp  1
odt  1
docx  1
pps  1
wav  1
msg 2

And here is an export from the database of all file extensions.

Files: 253

Extensions:
csv,1
db,1
doc,41
docx,1
exe,1
htm,4
html,14
jpg,6
msg,2
odp,1
odt,1
pdf,73
pps,1
ppt,11
pst,3
txt,8
wav,1
xls,73
xlsx,3
zip,5
1b,1
[blank],1

Emails are processed with a different database module. At least the email files were included in the error log, so the admin knows they're not in the database. There are two reasons the count increased on import:

1. This import method takes every object in the supplied path, including system files like Thumbs.db and *.store_ds, so there's obviously no NIST culling with Concordance.

2. Concordance handles multi-sheet excel files by splitting each sheet into a separate document, much to the chagrin of reviewers; the hyperlinks on the docs pulled from a 14-sheet excel file will point to that file 14 times.

Another "Easter Egg" in this dataset are the bogus extensions I set before copying up to the server. I took a handful of files and, like a user sometimes will, gave them incorrect extensions. For example, here's a bit of text from a PDF I renamed to *.xls:

%PDF-1.2
%âãÏÓ
663 0 obj
<<
/Linearized 1
/O 668
/H [ 3860 1307 ]
/L 718869
/E 36196
/N 58
/T 705490
>>
endobj


And here is part of what this document actually says:

Microsoft® Windows® and the Windows logo are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.
Netscape® and Netscape Navigator are registered trademarks of Netscape Communications Corporation in the United States and other countries. The Netscape logo and the Netscape product and service names are also trademarks of Netscape Communications Corporation in the United States and other countries.

So furthermore there is no MIME type discovery prior to processing. And given that some text was able to be extracted, there was no error for Concordance to report. This document, let's say it's being reviewed in response to a copyright violation complaint, has been rendered unresponsive. As scale increases, so does the risk involving issues like these.

In part 3 I will share the results of using FreeEed to process this same dataset. I very much welcome your input. Thank you for reading.

Monday, March 5, 2012

eDiscovery - Lower in the Stack

A recent career upgrade, however welcome, nevertheless cut off my access to LAW Prediscovery and Clearwell platforms for play and experimentation. This presented both the obligation and opportunity to explore other avenues and issues to which I might not have otherwise been introduced. Specifically, I had to get lower in the eDiscovery stack.

A stable, flexible domain with ample storage and powerful workstations was no more my personal playground. Step one was to build a replacement.

Close enough?

It's neater up there now, but I thought this photo funny enough to share. Here's the current rundown on the CapitalToomey Data Center: Atop my family's WiFi, in the attic, I've got a Windows Server 2003 box running Concordance 9.58 and Free-EED 3.5. There's also an Acer Veriton M420 with Free-NAS that's been put through Proof-Of-Concept, but not yet filled with permanent drives (perhaps more on this later). So my small bit of case data is being housed on its own drive in the one work server. But it all works, so I'm happy with that at least. (As a quick side note, I've been using Microsoft's RDP Client for Mac, and it's been great.)

So, having reestablished a "work"environment, my question was one facing thousands of companies, law firms, consultants and litigants, everywhere, right now: How do we balance cost, effectiveness and reliability in handling this data?

eDiscovery for Small-to-Medium Data 
If you have potentially relevant discoverable data that's too big to fit on a CD-R, you're probably in need of technological assistance in collecting and reviewing it. In the coming series of posts, I will review two potential solutions for eDiscovery on this scale: Concordance and FreeEed*. I will use them both to process a fairly standard batch of e-docs and emails, comparing the processes and results, and offering some observations from my own experience along the way.

The Data
The initial dataset for this project is what LexisNexis provided for Concordance certification training in 2007, a bit of the Enron emails and the FreeEed test data. It totals about 98 MB.

One very important aspect of any eDiscovery project that we will not be looking at here, however, is collection. Pulling data from your stores in a thorough and dependable way, without trampling the metadata and potentially invalidating its production...this is a field of expertise and many series of experiments in and of itself.

So, once it was "collected," I used RoboCopy to place the data on to the work server. Here is part of the nice log it creates:

------------------------------------------------------------------------------

                Total    Copied   Skipped  Mismatch    FAILED    Extras
    Dirs :        18        18         0         0         0         0
    Files :       208       208         0         0         0         0
    Bytes :   97.93 m   97.93 m         0         0         0         0
    Times :   0:00:12   0:00:12                       0:00:00   0:00:00

    Speed :             8014316 Bytes/sec.
    Speed :             458.582 MegaBytes/min.

    Ended : Sun Mar 04 21:01:15 2012


And here is a report of the distribution of file extensions in this set.

doc  41
pdf  73
xls  19
xlsx  1
htm  4
html  18
eml  4
txt  8
csv  1
pst  3
ppt  12
[none]  3
1b  1
exe  1
db  1
zip  5
jpg  6
odp  1
odt  1
docx  1
pps  1
wav  1
msg 2

These two logs are important, and will be used to validate our processing results.

And now we're ready to start consuming this stuff. I have made the first runs, and will start putting together the results to share.

Thank you for reading. Please check back soon. I will post updates to my twitter feed.


* It's important to note that FreeEed, designed as it is to run on a Hadoop cluster, should be able to scale way way WAY beyond the scope of my experiments. Here's hoping I get to the point soon to test those abilities. For now, I'll focus on usability and reliability.