This is part 2 of a project I
started to look at some data-gathering tools, absent LAW and Clearwell, which I had been using since this blog's outset.
Concordance is best known -- and probably best used -- as a review tool. However, it does have the ability to discover and import edocs, email and transcripts, in addition to the industry standard "Concordance style" TIF/TXT loads of the same.
I have on occasion used Concordance to add edocs to existing document collections in cases. Most often, it works fine for PDFs, with two caveats: Adobe Reader or Acrobat must be installed on the same machine, and any document text needs to be already extracted, as Concordance has no OCR engine. This, coupled with the fact that Acrobat by default will not OCR a doc with any text in it can leave you with PDFs that show up like this:
[ASDF-000123]
[ASDF-000124]
[ASDF-000125]
[ASDF-000126]
"Yes, I Batch-OCRed everything before importing it. Right after I added the Bates footers."
Actually, Concordance has no on-board document processing tools at all, but uses other installed programs to extract metadata natively. Most of the time, including now, this means MS Office and Acrobat.
Here's the basic procedure:
After creating a blank eDocs database, select Import > Edocs from the menu, and select your population by location and doc type:
You decide how to match gathered metadata to your database fields, and start the import. Logging is less-than-verbose.
But at least you have a list of the error docs for your QC. Here's a shot of the Concordance screen looking at a PDF after import:
The auto-hyperlinking is convenient. Those hyperlinks will launch a doc's native viewer, and works as long as the data location will not change. However, Concordance ships with a pile of scripts, one of which can adjust the link fairly easily.
Here is the list of metadata Concordance pulls:
TITLE
SUBJECT
AUTHOR
COMPANY
CATEGORY
KEYWORDS
PRODUCER
CREATOR
COMMENTS
METADATA
FILEPATH
DATE
MODDATE
CREATIONDATE
PRINTDATE
TEXT01
TEXT02
TEXT03
TEXT04
TEXT05
It will do its best to populate those fields with what it finds in the docs.
Results
I imported files using the *.* method, where any object in the specified path is processed. For reference, here's a rundown on the source data:
Files: 208
Extensions:
doc 41
pdf 73
xls 19
xlsx 1
htm 4
html 18
eml 4
txt 8
csv 1
pst 3
ppt 12
[none] 3
1b 1
exe 1
db 1
zip 5
jpg 6
odp 1
odt 1
docx 1
pps 1
wav 1
msg 2
And here is an export from the database of all file extensions.
Files: 253
Extensions:
csv,1
db,1
doc,41
docx,1
exe,1
htm,4
html,14
jpg,6
msg,2
odp,1
odt,1
pdf,73
pps,1
ppt,11
pst,3
txt,8
wav,1
xls,73
xlsx,3
zip,5
1b,1
[blank],1
Emails are processed with a different database module. At least the email files were included in the error log, so the admin knows they're not in the database. There are two reasons the count increased on import:
1. This import method takes every object in the supplied path, including system files like Thumbs.db and *.store_ds, so there's obviously no NIST culling with Concordance.
2. Concordance handles multi-sheet excel files by splitting each sheet into a separate document, much to the chagrin of reviewers; the hyperlinks on the docs pulled from a 14-sheet excel file will point to that file 14 times.
Another "Easter Egg" in this dataset are the bogus extensions I set before copying up to the server. I took a handful of files and, like a user sometimes will, gave them incorrect extensions. For example, here's a bit of text from a PDF I renamed to *.xls:
%PDF-1.2
%âãÏÓ
663 0 obj
<<
/Linearized 1
/O 668
/H [ 3860 1307 ]
/L 718869
/E 36196
/N 58
/T 705490
>>
endobj
And here is part of what this document actually says:
Microsoft® Windows® and the Windows logo are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.
Netscape® and Netscape Navigator are registered trademarks of Netscape Communications Corporation in the United States and other countries. The Netscape logo and the Netscape product and service names are also trademarks of Netscape Communications Corporation in the United States and other countries.
So furthermore there is no
MIME type discovery prior to processing. And given that some text was able to be extracted, there was no error for Concordance to report. This document, let's say it's being reviewed in response to a copyright violation complaint, has been rendered unresponsive. As scale increases, so does the risk involving issues like these.
In part 3 I will share the results of using FreeEed to process this same dataset. I very much welcome your input. Thank you for reading.