Jump to content

Keyword statistics


Chris

Recommended Posts

Hello all,
 
We have recently been working on a much-requested new feature: advanced keyword list statistics, automating what is otherwise a very time-consuming manual operation.
 
The Statistics component has been extended with a Keywords tab. In this tab the user can choose a keyword list, specify a number of calculation criteria and click Calculate. This will produce a table showing the keyword list and several statistics for every query in the list.
 
All controls are placed on the right hand side (see attached image). The user can choose a previously uploaded keyword list or add one here. A second drop-down list controls what field is searched. By default all fields are searched, but you can choose to restrict searches to e.g. the document text, email headers, etc.
 
The four checkboxes determine what columns the table should contain:
  • Items adds columns indicating the total number of items that contain the keyword, what percentage of the total items this is, and the deduplicated amount of items.
     
  • Hits counts the number of occurrences of the search term in the texts. For example, when a keyword produces a document that contains the keyword 3 times and another document that contains the keyword 5 times, this column will show 8.
     
  • Custodians adds a column for every custodian in the case. Each custodian column indicates how many of the matching items originate from that custodian.
     
  • Families adds two columns: "Families" and "Family items". A family is a set consisting of a top-level item (e.g. a mail in a PST file) and all its nested items (e.g. attachments, embedded items, archive entries). The Families column shows in how many families the keyword occurs. For example, if a mail and two of its attachments all contain the keyword, that counts as a single family. The Family Items column shows the total number of items that are contained in these families. This may (and usually will) include items that do not contain the keyword at all; they just belong to a family that has a hit in one of its other items. In cases where you are not directly exporting search results but rather their top-level parents (= the default when exporting to PST), this will tell you how much of the case you are conceptually exporting.
One can click on a row in the table and see the items from that result set in the Details view beneath the Statistics. The table can be exported to CSV format.
 
Although we call this functionality "keyword statistics", you can use the complete full-text search syntax here: wildcards, Boolean operators, phrase queries etc. are all available.
 
This new functionality is brand new and is still being fine-tuned. Let us know if you can think of any improvements! To give some ideas: would you expect to see percentages and deduplicated amounts for custodians and families? Would you even expect to see percentages for the "Deduplicated" column? Should we include other information types, e.g. related to tagging and reviewing?
 
keyword%20statistics.png

 

Link to comment
Share on other sites

This looks interesting Christiaan.

 

In the graphic, you show three names (green columns) and referr to them as "custodians."  Where is this "custodian" information being pulled from?  I don't recall being able to assign a custodian name to indexed items (although that would be fantastic!).

Link to comment
Share on other sites

Hello,

 

At the moment not, but I find this a very interesting addition! I will see what we can do.

 

I am wondering how we should label this column. It's not that easy to come up with something that is intuitively clear and does not conflict with the current vocabulary: "hits" is already being used for counting the actual occurrences in the text, "unique" could also be interpreted as related to "duplicates", etc. Perhaps "Exclusive items"?

Link to comment
Share on other sites

  • 1 year later...

If I am right, it is only possible to use this very nice feature with pre defined keyword lists or saved searches.

It would be great to see all keywords that were used within the investigation.

 

If this is not an option, it would be nice if it is possible to hit some button to list all the used keywords (collected search history). That list could then be saved as a keyword list to import within the Keyword TAB.

 

Regards,

Hans

Link to comment
Share on other sites

Hans,

 

This is a good idea to have an option to export the search history to a KW list file. We will consider this in feature planning for future versions, 

 

In the current version, it's possible to extract a list of all used keywords manually from "case.prefs" file located in "prefs" subfolder of the case. The entry to search in this file is "SearchHistory". The keywords in the list are separated by special "_." character sequences, which need to be removed and replaced by line breaks.

 

Thanks!

Link to comment
Share on other sites

  • 1 year later...

I'm following up on this topic as I still don't see the ability to calculate "unique" hits. Is this feature still on the roadmap?

As mentioned above, this is very useful in evaluating the effectiveness of a given search term. Generally, when we find a large number of documents are hitting on only one term in the list, we determine that particular term is not effective at returning potentially responsive documents.

For what it's worth, I know that several other eDiscovery tools use the phrase "Unique Hits" in this situation.

Link to comment
Share on other sites

  • 1 month later...

I think what Todd is likely referring to is a Relativity-centric concept rooted in the so-called search term report (STR), which calculates hits on search terms differently than Intella.  I know I have communicated about this issue in the past via a support ticket, and created such a report manually in Intella, which is at least possible with some additional effort involving keyword lists, exclusion of all other items in the list, and recording the results manually. 

What the STR does is communicate the number of documents identified by a particular search term, and no other search term in the list.

It is specifically defined as this: 

  • Unique hits - counts the number of documents in the searchable set returned by only that particular term. If more than one term returns a particular document, that document is not counted as a unique hit. Unique hits reflect the total number of documents returned by a particular term and only that particular term.

I have been aware of this issue for years, and although I strongly disagree regarding the value of such data as presented in the STR (and have written about extensively to my users), the fact is that, in ediscovery, groupthink is extremely common.  The effect is that a kind of "requirement" is created that all practitioners must either use the exact same tools, or that all tools are required to function exactly the same (which I find to be in stark contrast to the forensics world).

I actually found myself in a situation where, in attempting to meet and confer with an opposing "expert," that they were literally incapable of interpreting the keyword search results report we had provided because it was NOT in the form of an STR.  In fact, they demanded we provide one, and to such an extent that we decided that the most expedient course of action was just to create a new column that provided those numbers (whether they provided any further insight or not).

So in responding to Jon's question, I believe the answer is NO.  In such a case, within the paradigm of the STR, a document that contains 5 different keywords from the KW list would actually be counted ZERO times.  Again, what the STR does is communicate the number of documents identified by a particular search term, and no other search term in the list.

I think it's a misleading approach with limited value, and is a way to communicate information outside of software.  Further, and perhaps why it actually exists, is that it sidesteps the issue of hit totals in columns that add up to more more documents than the total number identified by all search criteria.  In other words, it doesn't address totals for documents that contain more than one keyword.  This is in contrast to the reports Intella creates, where I am constantly warning users not to start totaling the columns to arrive at document counts, as real world search results almost inevitably contain huge numbers of hits for multiple terms per document.  Instead, I point them to both a total and unique count, which I manually add to the end of an Intella keyword hit report, and advise them that full document families will increase this number if we proceed to a review set based on this criteria.

Hopefully that clarified the issue and provided a little more context to the situation!

 

Jason     

    

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...