Wednesday, July 27, 2011

Clearwell Enron 'Topics'

Moving forward with my look at semantic network analysis of the Enron data, I've been exploring Clearwell's 'Topics' feature, which like a lot of Clearwell's features, does much of the work for you with a few clicks.

For your perusal, here is the entire Topics list of the Enron corpus, sorted as Clearwell supplies it, and here are the top 10, sorted by occurrence frequency:
TopicTerms     # of     Documents
Margin CallCall, Conference, Conference call, Margin, number, Monday, Margin Call, account, funds, requirements, market, time, amount, equity, Weekly Call, year, Hours, stock, wire, portfolio                1538
intended recipientrecipient, intended recipient, affiliate, contract, sender, basis, party, sender or reply, Thanks, Message, privileged material, basis of a contract, use, relevant affiliate, sole use, RESUME, reply, Report, opportunity, people                1238
staff meetingStaff, Meeting, staff meeting, MONDAY, ENW Staff, location, MORNING, CAO Staff, January, Tuesday, MONDAY MORNING STAFF, CHANGE, ETS STAFF, Reminder, ETS MONDAY MORNING, MORNING STAFF, ETS MONDAY MORNING STAFF, ETS MONDAY, week, Teams                 1204
Enron Stockemployees, company, Enron, Stock, Enron Stock, Fund, millions, retirement, Proceeds, Demand Ken Lay, Demand, Enron Stock Sales, Lay, Sales, Ken Lay, Stock Sales, New York Times, energy crisis last, last, underhanded dealings                1151
experiencesinterview, Schedule, questions, Enron, time, form, date, system, employees, communication, answers, experts, students, conversation, experiences, video, clip, process, position, Guide                 1027
Conference CallCall, Conf, times, Conference Call, Conference, comments, GIR Conf, decision, Wednesday, Practices, Business, Summary, Beeson Conf, One, BPs, BP Conf, Meeting Summary, Business Practices, Gas Conf, Links Meeting Summary                   993
Dow Jones IndexPerson, Unknown Person, Dates, HourAhead hour, hour, Start Date, schedule, Subject, 2001 Subject, file, Message, Index, Kate, Dow Jones Index, Index Prices, Jones Index, djenergy Subject, EPMI Index, EPMI Index Prices, Prices                   896
Repeat parentparent, Repeat, Repeat parent, Date, Description, ENTRY, CALENDAR ENTRY, CALENDAR, Standard Time, Central Standard Time, Time, INVITATION Description, INVITATION, Meeting Dates, Mtg, Russell, call, OFFICE, Buchanan, Stacey                  807
option premiumOPTION, models, premium, option premium, spread, Index, Insurance, stock, HOUR, library, volatility, Email, baskets, average, Yes, price, tax, Digital Options, Index Option, State                   801
Access RequestRequest, Access, Date, Access Request, act, Read, email, Switch, Drop, POLR Request, data, approver, form, ERCOT, data approver, Switch Request, switch date, Customer, period, end                  734

I am interested to more fully know how Clearwell derives these topics, and to compare this process to other tools, like Gensim.
Here is Clearwell's writeup on its 'Topics' analysis process.
I will continue posting updates to this project as it develops.

Monday, February 7, 2011

Just for fun, a look at Legal Tech NY tweets

Unable to make it to this year's Legal Tech conference, I was at least able to follow some of the action from a distance, via Twitter. While I was at it, I took some notes.

Here is a social netwrok graph from Tuesday and Wednesday of Twitter users who used the #LTNY hashtag and @mentioned other users.


This represents tweets between midnight on Tuesday, 2/1/11 and 5 p.m. on Wednesday, 2/2/11, with some gaps. The 118 isolates (including myself), who did not mention or get mentioned were removed, as were the 25 dyads and one triad, who just mentioned one another. This left six isolated groups (shown in that sea-foam color) and one large connected network, comprising 178 nodes.

So what we see here is a group that is more connected than not. Of the 349 participants, 231 - 66 percent - mentioned or were mentioned by at least one other. But what were they tweeting about?

LTNY Tweets

A semantic network graph of tweets between approx.2 p.m. and 8 p.m. 2/1/11.


A semantic network graph of #LTNY tweets on Wednesday, 2/2.

These graphs show the words that co-occurred most often in the #LTNY tweets. This being a trade show, there were a handful of "chance(s) to win", and among those prizes were Starbucks gift cards. Both "legal tech" and "legal hold" were talked about, 16 and 9 times, respectively. Social media - and the lack of coverage thereof at the conference - also came up a lot. 37 times. This of course from a bunch of tweeps.

So there it is - Legal Tech 2011, remote. I hung out with jasnwilsn and had some "caffiene" at Starbucks.

The data were collected with NodeXL, which was also used to draw the social network. The semantic network analysis was done with WORDij.

Tuesday, January 25, 2011

Enron Social Network Analysis, Post eDiscovery (alpha version)

So far, we have looked at the "what" of a portion of the Enron corpus. I often help to develop review plans by crafting queries zeroing in on various topics and providing hit counts for those topics -- in other words, based on the "what." And I think semantic network analysis can help there.

Reviewers are also interested in the "who," and for that -- at least in part -- we have social network analysis. In our increasingly connected world, it is more and more interesting and valuable -- and possible -- to explore and understand those connections and relationships. Recently of note, LinkedIn has started offering that capability. And NodeXL, which we'll be looking at here, does that for several other social networking platforms.

For this project, the questions I'm working to help answer are: Who is talking about this topic? Who should we depose? Whose documents should we request/subpoena? And the clues are coming not from email systems or the Web (today, anyway) but from a set of documents that have been discovered, reviewed and produced to us in TIF/Text format.

Given native docs, or crawling your own network, there are more options open for metadata analysis of "who." Clearwell offers some aspects of social network analysis, and I have heard that Humanizing Technologes will soon launch an eDiscovery tool that may do this. And for the Enron data in particular there's Enronic. However, we're often limited to at best a few fields that give sender and recipient information. And that's where today's post begins.

Just as before, I collected documents based on their content, and exported them to delimited text files. For this exercise, though, I exported only the from, to, cc and bcc fields. For example:

As is often the case, what we see here is what we get. These data are pretty good for this kind of production, i.e., they contain a lot of valid email addresses. Now what we need to do is extract those email addresses, and organize them into a map of who-to-whom. Expressly:
  • Extract from the output all email addresses
  • Make a deduplicated list of email addresses
  • For each pair of addresses, count the number of messages they shared, as sender-recipient or recipient-recipient
  • Create text output listing each pair that shared at least one message, and the number of messages they shared
I couldn't find a tool that did just that, so I made one. Its output is ideal input for NodeXL, a free add-on for Microsoft Excel that makes network analysis very accessible. With NodeXL, anyone with an interest can create a wide range of network analysis on a wide range of sources, including, with my program, Concordance output. This process is now a lot easier than when I was in grad school, and that's to everyone's benefit.

Here's a screenshot of a NodeXL woorkup on the Regulation documents.


This is an analysis of 238 emails that contained the word "regulation." We see several unconnected groups, and I highlighted a central node in one of them. When a node is selected on the graph, NodeXL automatically highlights its corresponding entry in the vertices table. There we see that address belonged to Andrew Lewis, who is identified on this list of former Enron employees as a "Director." That makes him a likely target for deposition or document request, and, as an executive, one that would be identified early in the case.

A less likely target, though, would be Mark Whitt, who is identified on the employee list as "N/A". Yet there he is, linking to two branches of a large group that includes executives like VP Barry Tycholiz.


Now, in Concordance (or Summation, etc., whatever review tool we're using) our next query can be something like "(AUTHOR CO Whitt) OR (TO CO Whitt) OR (CC CO Whitt) OR (BCC CO Whitt)" as a way to shape review. In fact, we can sort our network based on the most connected nodes and start from there.


So, a few things we can do here: If a person has high connectivity on a topic, i.e., s/he is talking a lot to the rest of the corpus about it, then they may make a good deponent. If we did not receive their documents/emails, despite our request for all responsive, this may be a problem. If a high-ranker is not listed in the initial important-persons list, we may want to request their documents. If a high-ranker is not a person, but a service, for example, we may want to ask our deponents about this service. All this, coupled with semantic network analysis, may provide a useful review scheme. And I will work on that...

Speaking of working on it, I know my discovery email scraper has shortcomings. It is an alpha version. The beta version will hopefully:
  • distinguish between sender and recipient, to provide directional ties between nodes,
  • associate names and email addresses, to make use of fields that list a sender or recipient by name only, and
  • export to GraphML, to save the the copying-and-pasting into NodeXL.
However, if you know of software that already does what I described above, please let me know.

Saturday, January 22, 2011

EDRM Enron Data and Semantic Network Analysis, pt.2

In Part 1, I gave an overview of the general word use in subsets of the Enron corpus based on keywords. So, for example, we saw what words often occurred together in documents that also contained the word "profit." The idea here is to provide a clue into the responsiveness of a set of docs and the direction to take in search and review.

Going further, we can also create text strings based on a target start and end word, and what words often co-occurred in between. In the below examples, the target end word was the most central in each subset, i.e., it co-occurred with the most others.

profit -> loss -> or 
profit -> reports -> kiodex -> or 
profit -> reports -> kiodex -> enron -> or 
profit -> reports -> enron -> or 
profit -> reports -> kiodex -> will -> or 
profit -> reports -> kiodex -> shall -> or 
profit -> reports -> kiodex -> tool -> or 
profit -> reports -> kiodex -> site -> or 
profit -> reports -> enron -> kiodex -> or 
profit -> reports -> kiodex -> in -> or 
profit -> reports -> enron -> data -> or 
profit -> reports -> kiodex -> web -> or 
profit -> reports -> enron -> in -> or 
profit -> reports -> kiodex -> lite -> or 
profit -> reports -> kiodex -> such -> or 
profit -> reports -> enron -> shall -> or

Here we see that the Kiodex tool from our general network graph for the "profit" docs is also associated, through the word "reports," with Enron's discussion of profit. There, without having actually looked at the documents, we have learned with a certain degree of certainty one of the uses of the Kiodex tool.

offshore -> inc 
 offshore -> exploration -> inc 
 offshore -> exploration -> us -> energy -> inc 
 offshore -> exploration -> us -> gasphysical -> inc 
 offshore -> exploration -> us -> gas -> inc 
 offshore -> exploration -> us -> gasfinancial -> inc 
 offshore -> exploration -> us -> inc 
 offshore -> exploration -> company -> gasphysical -> inc 
 offshore -> exploration -> us -> resources -> inc 
 offshore -> exploration -> company -> gas -> inc 
 offshore -> exploration -> company -> us -> inc 
 offshore -> exploration -> company -> inc 
 offshore -> exploration -> company -> gasfinancial -> inc 
 offshore -> exploration -> us -> marketing -> inc 
 offshore -> exploration -> us -> power -> inc 
 offshore -> exploration -> gasphysical -> energy -> inc 

A lot of talk here about gas and offshore exploration.

corporate -> action -> or -> gas 
 corporate -> action -> or -> us -> gas 
 corporate -> action -> or -> company -> gas 
 corporate -> action -> or -> can -> gas 
 corporate -> power -> us -> gas 
 corporate -> power -> us -> natural -> gas 
 corporate -> power -> physical -> gas 
 corporate -> power -> physical -> us -> gas 
 corporate -> power -> physical -> natural -> gas 
 corporate -> power -> us -> physical -> gas 
 corporate -> power -> us -> financial -> gas 
 corporate -> power -> fwd -> us -> gas 
 corporate -> power -> firm -> us -> gas 
 corporate -> power -> us -> fin -> gas 
 corporate -> power -> fwd -> or -> gas 
 corporate -> power -> phy -> firm -> gas

Corporate action and power! That's what I like to hear a company talking about. I think I'll invest in them...

regulation -> or 
 regulation -> under -> agreement -> or 
 regulation -> order -> or 
 regulation -> under -> or 
 regulation -> under -> agreement -> in -> or 
 regulation -> under -> agreement -> shall -> or 
 regulation -> under -> agreement -> enron -> or 
 regulation -> under -> agreement -> such -> or 
 regulation -> under -> securities -> or 
 regulation -> under -> such -> or 
 regulation -> under -> section -> or 
 regulation -> under -> agreement -> party -> or 
 regulation -> under -> agreement -> kiodex -> or 
 regulation -> under -> agreement -> may -> or 
 regulation -> under -> such -> in -> or 
 regulation -> under -> securities -> such -> or 

Kiodex again, with regulation. We can't always tell which meaning of regulation is in use. Nevertheless, as head of a review team, I'd be assigning the Kiodex topic now.

Natural Gas
 gas -> in -> or 
 gas -> mmbtu -> oil -> products -> or 
 gas -> mmbtu -> products -> or 
 gas -> in -> kiodex -> or 
 gas -> in -> such -> or 
 gas -> in -> enron -> or 
 gas -> in -> kiodex -> enron -> or 
 gas -> in -> agreement -> or 
 gas -> financial -> united -> products -> or 
 gas -> in -> enron -> kiodex -> or 
 gas -> financial -> usa -> products -> or 
 gas -> financial -> or 
 gas -> physical -> usa -> products -> or 
 gas -> mmbtu -> products -> kiodex -> or 
 gas -> mmbtu -> usa -> products -> or 
 gas -> in -> kiodex -> tool -> or 

 electric -> company -> in 
 electric -> company -> interested -> in 
 electric -> company -> ca -> in 
 electric -> power -> in 
 electric -> company -> enron -> set -> in 
 electric -> company -> enron -> data -> in 
 electric -> company -> contact -> in 
 electric -> power -> from -> in 
 electric -> company -> enron -> contact -> in 
 electric -> company -> sent -> ca -> in 
 electric -> power -> company -> in 
 electric -> company -> energy -> in 
 electric -> company -> person -> contact -> in 
 electric -> power -> enron -> set -> in 
 electric -> company -> enron -> ca -> in 
 electric -> power -> enron -> data -> in 

Again, this would all be given to the review team before any documents are looked at, or any time is invested, and in that by itself the process has value. I am working to improve the other measures of its value. It would very much help to know more about this case so that I can know whether I'm right in calling a document hot or a search path relevant. So if you know where I can find more of that kind of information on these documents, please share.

There are several services and solutions offering this kind of analysis (examples: TextAlalystDiscoverText, and of course Clearwell's "topics"). But this is stuff I thought of around the time I was introduced to Clearwell v3, and I'm now getting around to blogging about it, so I'm taking the long way around here, too - via my educational history. Who knows, the Social Network Analysis sections to come may skip NodeXL and go back to the source: UCINET.

Tuesday, January 18, 2011

EDRM Enron Data and Semantic Network Analysis, pt.1

The Electronic Discovery Reference Model organization is doing a lot to shape standards and practices, and this work is important and impressive. But, if all they ever did was make available the Enron data, that'd be enough to make me love them.

More than a million emails and attachments, from a real company in recent history. A standardized, public domain dataset. Available for free... Most excellent.

On this blog I will be using the Enron data for a series of projects and experiments. But first, a few words on what I'm doing and why:

As an eDiscovery support and administration guy, I spend a lot of time taking in data, loading it up for review and supporting users through the review process. This is where my work here begins. Discussion of other facets of eDiscovery can be found here, here, here, here or via google. Here, we will be looking for ways to assist document review, cutting through the irrelevant, zeroing in on the relevant, and finding the hot docs, all in ways that are, well, cool...

Semantic Network Analysis
EDRM-Enron-PST-001 contains seven PST files totaling just over 1.5GB. I processed these with LexisNexis LAW Prediscovery, and loaded the resulting 41,022 records into Concordance, containing the doc text and some metadata.


I then ran a few very general search terms against the doc text, namely: Profit, Offshore, Corporate, Regulation, Natural Gas, and Electric, and tagged the documents accordingly. I often run early top-level search-and-tag jobs for reviewers to get a sense of the responsiveness of the docs, and to help divide the work among the review team. Today, I am looking to see how a semantic network analysis tool can help reveal more about document content - without any review! (Unreasonable?)

First, to pull out doc text by topic I exported each set of tagged records from Concordance to delimited text files containing any email subject line and the text of the document. Here, I ran the queries for Natural Gas and Electric exclusively, i.e., I exported only the docs that contained Natural Gas, but NOT Electric, and vice-versa. Because these two elements make up the core of Enron's business, they're often talked about together, but I wanted to see whether corporate communication was different when handling the two separately.

To get an overview of the content of these docs -- what do they say? -- I wanted to try semantic network analysis, the measure of co-occurrence of words in text. For that, I used WORDij. In addition to importing and analyzing all sorts of text sources, and preparing input for several other further analysis tools, WORDij has nice visualization on board. Here are some examples.






Natural Gas

On Day One of this analysis, nothing is jumping out at me from these results. Notice that both the Enron Web Site and Kiodex come up in Gas, but not Electric conversations. According to Wikipedia, Enron was at this time developing an online trading system, and Kiodex is a risk management group.

The presence of the word "Headquarters" in the Offshore communication is possibly funny.

But that's all I've got for now. I will be taking a second swipe in the coming week. Going forward, I'd like your input. Network analysts and eDiscovery gurus alike. Any observations, suggestions, thoughts, etc, would be very much appreciated in the comments. I'll keep going, and let you know what I come up with.