Keyword statistics

Chris · July 2, 2015

Hello all,

We have recently been working on a much-requested new feature: advanced keyword list statistics, automating what is otherwise a very time-consuming manual operation.

The Statistics component has been extended with a Keywords tab. In this tab the user can choose a keyword list, specify a number of calculation criteria and click Calculate. This will produce a table showing the keyword list and several statistics for every query in the list.

All controls are placed on the right hand side (see attached image). The user can choose a previously uploaded keyword list or add one here. A second drop-down list controls what field is searched. By default all fields are searched, but you can choose to restrict searches to e.g. the document text, email headers, etc.

The four checkboxes determine what columns the table should contain:

Items adds columns indicating the total number of items that contain the keyword, what percentage of the total items this is, and the deduplicated amount of items.
Hits counts the number of occurrences of the search term in the texts. For example, when a keyword produces a document that contains the keyword 3 times and another document that contains the keyword 5 times, this column will show 8.
Custodians adds a column for every custodian in the case. Each custodian column indicates how many of the matching items originate from that custodian.
Families adds two columns: "Families" and "Family items". A family is a set consisting of a top-level item (e.g. a mail in a PST file) and all its nested items (e.g. attachments, embedded items, archive entries). The Families column shows in how many families the keyword occurs. For example, if a mail and two of its attachments all contain the keyword, that counts as a single family. The Family Items column shows the total number of items that are contained in these families. This may (and usually will) include items that do not contain the keyword at all; they just belong to a family that has a hit in one of its other items. In cases where you are not directly exporting search results but rather their top-level parents (= the default when exporting to PST), this will tell you how much of the case you are conceptually exporting.

One can click on a row in the table and see the items from that result set in the Details view beneath the Statistics. The table can be exported to CSV format.

Although we call this functionality "keyword statistics", you can use the complete full-text search syntax here: wildcards, Boolean operators, phrase queries etc. are all available.

This new functionality is brand new and is still being fine-tuned. Let us know if you can think of any improvements! To give some ideas: would you expect to see percentages and deduplicated amounts for custodians and families? Would you even expect to see percentages for the "Deduplicated" column? Should we include other information types, e.g. related to tagging and reviewing?

PF1 · July 2, 2015

This looks interesting Christiaan.

In the graphic, you show three names (green columns) and referr to them as "custodians." Where is this "custodian" information being pulled from? I don't recall being able to assign a custodian name to indexed items (although that would be fantastic!).

admin · July 2, 2015

Hi PF1,

Version 1.9 will have the ability to assign data to a custodian. Happy to send you an early version so you can test if you like?

PF1 · July 2, 2015

I appreciate the offer, but don't have much "test" time lately! I will wait for the official release, thanks.

AdamS · July 3, 2015

Looks great, also liking the idea of custodian's being introduced.

LitEDD · July 6, 2015

Will reporting also generate a count for “unique” hits per term? In other words- total document hits for a particular term, that did not receive hits from any other terms.

Chris · July 7, 2015

Hello,

At the moment not, but I find this a very interesting addition! I will see what we can do.

I am wondering how we should label this column. It's not that easy to come up with something that is intuitively clear and does not conflict with the current vocabulary: "hits" is already being used for counting the actual occurrences in the text, "unique" could also be interpreted as related to "duplicates", etc. Perhaps "Exclusive items"?

LitEDD · July 7, 2015

"Exclusive items" or "Exclusive Term Items" works. This level of reporting becomes key when negotiating search terms with opposing counsel during the "Meet and Confer" phase of a litigation.

Hanzelmans · January 9, 2017

If I am right, it is only possible to use this very nice feature with pre defined keyword lists or saved searches.

It would be great to see all keywords that were used within the investigation.

If this is not an option, it would be nice if it is possible to hit some button to list all the used keywords (collected search history). That list could then be saved as a keyword list to import within the Keyword TAB.

Regards,

Hans

Alex · January 10, 2017

Hans,

This is a good idea to have an option to export the search history to a KW list file. We will consider this in feature planning for future versions,

In the current version, it's possible to extract a list of all used keywords manually from "case.prefs" file located in "prefs" subfolder of the case. The entry to search in this file is "SearchHistory". The keywords in the list are separated by special "_." character sequences, which need to be removed and replaced by line breaks.

Thanks!

todd.cooper · September 26, 2018

I'm following up on this topic as I still don't see the ability to calculate "unique" hits. Is this feature still on the roadmap?

As mentioned above, this is very useful in evaluating the effectiveness of a given search term. Generally, when we find a large number of documents are hitting on only one term in the list, we determine that particular term is not effective at returning potentially responsive documents.

For what it's worth, I know that several other eDiscovery tools use the phrase "Unique Hits" in this situation.

jon.pearse · September 26, 2018

Hi Todd,

When you say 'unique hits', do you mean unique documents that have hits? E.g. if a document contains 5 different keywords from the KW list, the document is counted only once.

jasoncovey · November 7, 2018

I think what Todd is likely referring to is a Relativity-centric concept rooted in the so-called search term report (STR), which calculates hits on search terms differently than Intella. I know I have communicated about this issue in the past via a support ticket, and created such a report manually in Intella, which is at least possible with some additional effort involving keyword lists, exclusion of all other items in the list, and recording the results manually.

What the STR does is communicate the number of documents identified by a particular search term, and no other search term in the list.

It is specifically defined as this:

Unique hits - counts the number of documents in the searchable set returned by only that particular term. If more than one term returns a particular document, that document is not counted as a unique hit. Unique hits reflect the total number of documents returned by a particular term and only that particular term.

I have been aware of this issue for years, and although I strongly disagree regarding the value of such data as presented in the STR (and have written about extensively to my users), the fact is that, in ediscovery, groupthink is extremely common. The effect is that a kind of "requirement" is created that all practitioners must either use the exact same tools, or that all tools are required to function exactly the same (which I find to be in stark contrast to the forensics world).

I actually found myself in a situation where, in attempting to meet and confer with an opposing "expert," that they were literally incapable of interpreting the keyword search results report we had provided because it was NOT in the form of an STR. In fact, they demanded we provide one, and to such an extent that we decided that the most expedient course of action was just to create a new column that provided those numbers (whether they provided any further insight or not).

So in responding to Jon's question, I believe the answer is NO. In such a case, within the paradigm of the STR, a document that contains 5 different keywords from the KW list would actually be counted ZERO times. Again, what the STR does is communicate the number of documents identified by a particular search term, and no other search term in the list.

I think it's a misleading approach with limited value, and is a way to communicate information outside of software. Further, and perhaps why it actually exists, is that it sidesteps the issue of hit totals in columns that add up to more more documents than the total number identified by all search criteria. In other words, it doesn't address totals for documents that contain more than one keyword. This is in contrast to the reports Intella creates, where I am constantly warning users not to start totaling the columns to arrive at document counts, as real world search results almost inevitably contain huge numbers of hits for multiple terms per document. Instead, I point them to both a total and unique count, which I manually add to the end of an Intella keyword hit report, and advise them that full document families will increase this number if we proceed to a review set based on this criteria.

Hopefully that clarified the issue and provided a little more context to the situation!

Jason

Keyword statistics

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation