This is part 3 of a project I started to explore the underpinnings of the platforms I'd been using (and taking for granted) for eDiscovery and data analysis for the past five-plus years. Starting from scratch, the question I posed was: How to balance cost, effectiveness and reliability in handling this data?
"This data" is about 98 MB of mixed email and edocs that typify the material filling the hard drives and file servers everywhere, all the time. I have most recently used FreeEed to process this data, and here is a flyover - from 30,000 down to maybe 5,000 feet.
"This data" is about 98 MB of mixed email and edocs that typify the material filling the hard drives and file servers everywhere, all the time. I have most recently used FreeEed to process this data, and here is a flyover - from 30,000 down to maybe 5,000 feet.
Setup
FreeEed is an open source eDiscovery platform currently being developed and growing in features and capabilities. It's designed to run (in the JRE) on Linux, Mac and Windows machines standalone, or in Hadoop clusters, including Amazon S3. So it has a great deal of scalability. I have it running on a Windows 2003 server, where the data also resides, but it certainly has the ability to work across the network.
The UI is sparse. You set up a project with a few options, including target data locations, output delimiters, and data de-NISTing, and save it to a *.project file, which can be used to keep track of these settings across multiple sessions.
From there, your work runs in two phases - staging and processing. Staging copies the target files to zip archives in an output subfolder. Processing pulls from these archives to create a metadata file and native and text archives, in their own output subfolder.
Job 1001 - My Test Project - one of two runs
Results
For reference, here's a rundown on the source data:
Source Data
Files: 208
Extensions:
doc 41
pdf 73
xls 19
xlsx 1
htm 4
html 18
eml 4
txt 8
csv 1
pst 3
ppt 12
[none] 3
1b 1
exe 1
db 1
zip 5
jpg 6
odp 1
odt 1
docx 1
pps 1
wav 1
msg 2
From the processing history log, I found the following results:
First off, here is an excellent illustration of my point on the importance and challenges of quality forensics in the eDiscovery pipeline. The growth of the source doc population, which has not otherwise changed since the project's beginning, can be attributed to the five .DS_Store docs that most likely appeared when I shared that directory across the network to a Mac workstation. 208 files were copied on a flash drive from one Windows machine to another (hence the thumbs.db), and worked on. However, recent system changes apart from the data itself have changed this population to 213. Whether or not such a slip-up would be defensible I can't say, but it's definitely sloppy.
Further, there are other differences in the doc counts between collection, staging and processing, including differences from the Concordance session. The most obvious is the thousands of emails extracted from the pst files that Concordance didn't touch. We also see where containers -- zip and pst files -- themselves are not treated as records, but their contents are. This is correct.
Excel spreadsheets are handled nicely with FreeEed. This example, which has five worksheets, exploded to five records in Concordance, where the FreeEed record looks like this:
Here is also a bit of very good news. Recall the PDF I sneakily renamed to *.xls. FreeEed was not fooled, and extracted the text correctly.
Staging
Files: 213
Processing
Files: 2892
Extensions:
doc 45
pdf 76
xls 20
xlsx 1
htm 22
html 18
eml 2665
txt 8
csv 1
pst 0
ppt 12
[none] 3
1b 1
exe 1
db 1
zip 0
jpg 6
odp 1
odt 1
docx 1
pps 1
wav 1
msg 2
.DS_Store 5
First off, here is an excellent illustration of my point on the importance and challenges of quality forensics in the eDiscovery pipeline. The growth of the source doc population, which has not otherwise changed since the project's beginning, can be attributed to the five .DS_Store docs that most likely appeared when I shared that directory across the network to a Mac workstation. 208 files were copied on a flash drive from one Windows machine to another (hence the thumbs.db), and worked on. However, recent system changes apart from the data itself have changed this population to 213. Whether or not such a slip-up would be defensible I can't say, but it's definitely sloppy.
Further, there are other differences in the doc counts between collection, staging and processing, including differences from the Concordance session. The most obvious is the thousands of emails extracted from the pst files that Concordance didn't touch. We also see where containers -- zip and pst files -- themselves are not treated as records, but their contents are. This is correct.
Excel spreadsheets are handled nicely with FreeEed. This example, which has five worksheets, exploded to five records in Concordance, where the FreeEed record looks like this:
HOUSE 1
ADRESS ?
# UNITS 15
TAX VALUE ?
RENT 85000 YEAR
ASKING PRICE $450,000.00
TAXES
UTILITIES
NEIGHBORS
HOUSE 1
ASSESED VALUE
SOLD PRICE
HOUSE 2
ASSESED VALUE
SOLD PRICE
HOUSE 3
ASSESED VALUE
SOLD PRICE
HOUSE 4
ASSESED VALUE
SOLD PRICE
REPAIRS NEEDED N/A
CONTACTS
HOUSE (2)
ADRESS 5225 S 21 ST
# UNITS 2 BEDROOM
Building Size 824 SQ. FT.
SALES INFO $24,500 1993
ASKING PRICE $42,500.00
TAX VALUE $33,000.00
RENT
TAXES
UTILITIES
NEIGHBORS
HOUSE 1 5213 S 21 ST (528 SQ FT)
ASSESED VALUE $55,200.00
SOLD PRICE $30,000 1997
HOUSE 2 5240 S 20 ST (1200 SQ FT)
ASSESED VALUE $39,300
SOLD PRICE $65,000 2002
HOUSE 3 5229 S 21 ST (529 SQ FT)
ASSESED VALUE $31,800
SOLD PRICE
HOUSE 4
ASSESED VALUE
SOLD PRICE
REPAIRS NEEDED N/A
CONTACTS
HOUSE (3)
ADRESS 1134 S 32 ST
# UNITS 5 (8 BEDROOMS)
Building Size 3023 SQ FT
SALES INFO 47250 (2001)
ASKING PRICE $111,000.00
TAX VALUE $123,000.00
RENT 1800 MONTH
TAXES
UTILITIES
NEIGHBORS
HOUSE 1
ASSESED VALUE
SOLD PRICE
HOUSE 2
ASSESED VALUE
SOLD PRICE
HOUSE 3
ASSESED VALUE
SOLD PRICE
HOUSE 4
ASSESED VALUE
SOLD PRICE
REPAIRS NEEDED
HOUSE (4)
ADRESS 1003 BERT MURPHY (OLD BELVUE)
# UNITS 1
Building Size ?
SALES INFO ?
ASKING PRICE $69,000.00
TAX VALUE ?
RENT
TAXES
UTILITIES
NEIGHBORS
HOUSE 1
ASSESED VALUE
SOLD PRICE
HOUSE 2
ASSESED VALUE
SOLD PRICE
HOUSE 3
ASSESED VALUE
SOLD PRICE
HOUSE 4
ASSESED VALUE
SOLD PRICE
REPAIRS NEEDED ROOF KITCHEN BATH AND BASEMENT
CONTACTS
HOUSE (5)
ADRESS
# UNITS
Building Size
SALES INFO
ASKING PRICE
TAX VALUE
RENT
TAXES
UTILITIES
NEIGHBORS
HOUSE 1
ASSESED VALUE
SOLD PRICE
HOUSE 2
ASSESED VALUE
SOLD PRICE
HOUSE 3
ASSESED VALUE
SOLD PRICE
HOUSE 4
ASSESED VALUE
SOLD PRICE
REPAIRS NEEDED
Here is also a bit of very good news. Recall the PDF I sneakily renamed to *.xls. FreeEed was not fooled, and extracted the text correctly.
Microsoft® Windows® and the Windows logo are either registered trademarks or trademarks
of Microsoft Corporation in the United States and/or other countries.
Netscape® and Netscape Navigator are registered trademarks of Netscape Communications
Corporation in the United States and other countries. The Netscape logo and the Netscape
product and service names are also trademarks of Netscape Communications Corporation in
the United States and other countries.
This text was copied from FreeEed's text output file for this doc
There is no OCR engine on board FreeEed. So flat PDFs and image files will still have any text undiscovered. Also, note that these files were not de-NISTed, which would have removed the thumbs.db and .DS_Store files, but would also have skewed the count results even further.
After this initial assessment, this output looks to be ready for review. Going forward, I would refine my ability to tie off doc counts from collection to staging to processing to output, in order to account for everything. However, this is not yet prime time, and for the purposes of this experiment, it's best we move on.
I was unable to transform the output from previous versions of FreeEed into neat delimited load files. However, with some prompt and effective support from Mark (thanks, Mark!), we're now pumping out the following fields for each record:
UPI
File Name
Custodian
Source Device
Source Path
Production Path
Modified Date
Modified Time
Time Offset Value
processing_exception
master_duplicate
text
To
From
CC
BCC
Date Sent
Time Sent
Subject
Date Received
Time Received
The UPI links each record to its text file; the native filename, in the native folder, is a concatenation of UPI and File Name. I've successfully loaded portions of the output to SQL Server, but the whole load file was not ready at press time. However, I have shared the data, which I encourage you to run through your eDiscovery platform, and share what you get. A subsequent project that's in the works here will feature more on FreeEed processing in production. Please stay tuned.
Thoughts
Here are some of my observations following this trial run.
- When processing completed, I had to pull the old copy-and-paste maneuver from the processing history window into a text file in order to save the log in my output folder. I'd like it to just go there when processing is done.
- Output folders are automatically placed in FreeEed's home folder. They contain staging copies of the source data, plus metadata, doc text files, plus additional native copies for production. This can quickly add up to a lot of data - especially at the volumes to which this application can scale. It may in fact be an artifact of the product's Hadoop architecture. However, my application server is not necessarily the best place for this stuff -- it's really just the place on my network where Java lives. So, for my standalone instance, I would like the option to configure these output folder locations, by project.
- In looking through the logs and trying to track down error files, or to compare originals to output versions, I often wished for the original source path information -- full path. This would involve more robust logging at the time of staging, but really, with its ability to NIST cull and recurse folders across a network, I think this can present FreeEed with a more prominent role in the collections process. In the future, it could maybe even connect to the Exchange server like this. But for now at least, I would like each file's original source path logged.
- >>>Further, this staging log could potentially become its own metadata file. It could link custodian, source path, parent-attachment information, and collection date, by UPI. A more robust review platform necessarily uses multiple tables. </spitballing>
- I do like the exceptions folder, where docs that could not be processed are copied for further investigation. I would also like an exceptions log, which shows what's in there, where it came from and what the error message was. I know most of this information can be found in the metadata and Processing History log already, but it would be convenient to have it separately.
Next
I think this new phase on CapitalToomey.blogspot.com is off to a good start. I've set up a passable processing environment and tested two tools thereupon. In the fourth and final post to this project, I'll lay out in a bit more detail the setup of my system, and the direction I'm looking to take for future projects.
Thank you very much for reading. Any input would be very much appreciated. Please check back for future posts.
No comments:
Post a Comment