Tuesday, January 18, 2011

EDRM Enron Data and Semantic Network Analysis, pt.1

The Electronic Discovery Reference Model organization is doing a lot to shape standards and practices, and this work is important and impressive. But, if all they ever did was make available the Enron data, that'd be enough to make me love them.

More than a million emails and attachments, from a real company in recent history. A standardized, public domain dataset. Available for free... Most excellent.

On this blog I will be using the Enron data for a series of projects and experiments. But first, a few words on what I'm doing and why:

As an eDiscovery support and administration guy, I spend a lot of time taking in data, loading it up for review and supporting users through the review process. This is where my work here begins. Discussion of other facets of eDiscovery can be found here, here, here, here or via google. Here, we will be looking for ways to assist document review, cutting through the irrelevant, zeroing in on the relevant, and finding the hot docs, all in ways that are, well, cool...

Semantic Network Analysis
EDRM-Enron-PST-001 contains seven PST files totaling just over 1.5GB. I processed these with LexisNexis LAW Prediscovery, and loaded the resulting 41,022 records into Concordance, containing the doc text and some metadata.

Enron1DCBscreenshot

I then ran a few very general search terms against the doc text, namely: Profit, Offshore, Corporate, Regulation, Natural Gas, and Electric, and tagged the documents accordingly. I often run early top-level search-and-tag jobs for reviewers to get a sense of the responsiveness of the docs, and to help divide the work among the review team. Today, I am looking to see how a semantic network analysis tool can help reveal more about document content - without any review! (Unreasonable?)

First, to pull out doc text by topic I exported each set of tagged records from Concordance to delimited text files containing any email subject line and the text of the document. Here, I ran the queries for Natural Gas and Electric exclusively, i.e., I exported only the docs that contained Natural Gas, but NOT Electric, and vice-versa. Because these two elements make up the core of Enron's business, they're often talked about together, but I wanted to see whether corporate communication was different when handling the two separately.

To get an overview of the content of these docs -- what do they say? -- I wanted to try semantic network analysis, the measure of co-occurrence of words in text. For that, I used WORDij. In addition to importing and analyzing all sorts of text sources, and preparing input for several other further analysis tools, WORDij has nice visualization on board. Here are some examples.

Profit
ProfitSemanticNetwork

Offshore
OffshoreSemanticNetwork

Corporate
CorporateSemanticNetwork

Regulation
RegulationSemanticNetwork

Electric
ElectricNOTngSemanticNetwork

Natural Gas
NGnotElectricSemanticNetwork

On Day One of this analysis, nothing is jumping out at me from these results. Notice that both the Enron Web Site and Kiodex come up in Gas, but not Electric conversations. According to Wikipedia, Enron was at this time developing an online trading system, and Kiodex is a risk management group.

The presence of the word "Headquarters" in the Offshore communication is possibly funny.

But that's all I've got for now. I will be taking a second swipe in the coming week. Going forward, I'd like your input. Network analysts and eDiscovery gurus alike. Any observations, suggestions, thoughts, etc, would be very much appreciated in the comments. I'll keep going, and let you know what I come up with.

No comments:

Post a Comment