Jump to content

Additional Deduping


webel

Recommended Posts

Currenty, Intella dedupes based on message hash.  We typically work with PST's and OST's received from various custodians in a typical client organization.  We run searches and export the Intella unique results to PST.  When we run those Intella PST's through another application with different dedupe options, we typically find a duplicate rate of 25-50+ percent.  Most of our clients are attorneys and they are interested in reviewing a particular message only once.  We have found with other apps the most effective way to get to actual unique content is to compare the "to", "from", "cc", "bcc", "subject", "date/time (strip milliseconds)". 

 

Support is looking to add this capability in 1.8.4 and I wanted to ask the community if there are any other deduping options you would like to see?

 

Thank you.

Link to comment
Share on other sites

I am very familiar with this situation and also work for attorneys.  They tend to get extremely frustrated by the difference between what is a duplicate from a forensic perspective vs. an email that obviously contains the exact same text.  I would love to see Intella calculate a hash value for the body text, only, and/or introduce the capability to perform email threading in a manner similar to Equivio.  The reduction in review time would be extremely significant.  I don't know how it might integrate with the existing Smart Search functionality and/or the new paragraph analysis features, but the closer integration among those features would seem like a logical next step. 

 

I wonder if the dialogue for a feature like this might include something that looks like the current Smart Search functionality, which could accomplish something like the OP describes (and with which I have achieved the most accurate results, as well), but with user-selected options, perhaps including a body text field or hash value created during indexing.

 

This is a really interesting subject to me because, although we can now handle most any task with Intella in-house, we are beholden to vendors for global threading.  The Show Conversation feature is great on an individual-item basis in an investigation context, but is just not practical for a larger scale review.  If however, this functionality was included such that it could be leveraged in Connect, giving reviewers the ability to review related email in the context in which it was created, it would be a gigantic leap forward.

Link to comment
Share on other sites

Hi,

 

I would quite like to see the Smart Search (which to me is the near dup detection) to be more automated during processing.  At the moment, Connect doesn't support Smart Search.  If it did, and you have a large population of emails you need to review, your are affectively asking the end users to not only review the email, but also manually review any possible near dups, which can seriously slow down your review. 

 

It would be better if Intella did near dup detection automatically, perhaps a near dup hash, so those that process the data, can exclude near dups from the population that gets reviewed, and attorneys don't tell you that they have alredy seen this document, and that document. 

 

Regards

Link to comment
Share on other sites

Hello all,
 
Thank you all for your insights! Some ideas I am taking from this:
  • It is certainly possible to make the message hash algorithm configurable, e.g. as a list of "ingredients" that you can choose from: From/Sender/To/Cc/Bcc/Date/Subject/body/attachments. This would probably be our first step towards better deduplication. We can allow for optionally reducing the precision of the date. It's interesting to see that some want to leave the body out while others want to base it only on the body - I have heard both variants before. Someone also once suggested to me to only use the Message-ID header for the hashing.
     
  • I like the idea of Smart Search using the new paragraph search functionality. Below the surface paragraph analysis essentially calculates a hash for each paragraph. These go into a database, enabling quick searching for other occurrences of that paragraph in the case. Smart Search could look at documents with the same set of paragraphs (= same hashes), ordered by how many paragraphs are in common, or even apply a tf.idf-like weighing mechanism: a paragraph that occurs less often (e.g. the core topic discussed in an email thread) is more likely to be important than paragraphs occurring more often (e.g. email signatures).
     
  • Using Smart Search and/or Show Conversation to partition the case into logical subsets could indeed reduce the workload. This is not entirely trivial. I believe that Show Conversation essentially turns the case into buckets (each email is part of at most one conversation, or is part of the "others" category), but Smart Search produces many overlapping sets: a smart search on item A may produce item B, item B may produce item C, but item A does not necessarily produce C as well. This makes using Smart Search for partitioning the case upfront tricky - but not impossible.
Link to comment
Share on other sites

 "date/time (strip milliseconds)". 

 

Can you elaborate on what stripping the milliseconds achieves? I can understand comparing date attributes and allowing for some minor variances, but the Date header is sent by the sending side and should be transmitted as-is to the receiving side. I.e., we're not comparing Date and Received headers here. Also all mail formats that we support store the SMTP headers in full; the Date is not recreated from some database-internal value. Therefore, a different Date header looks to me as a strong indication that something/someone has altered the message during transmission or in storage.

Link to comment
Share on other sites

  • 6 months later...

So was the de-duplicating feature discussed here added in 1.8.4? If so, how does it work? 

 

When I use the regular de-duplication option in Intella, it apparently still de-dups on hash. I have two identical emails, on in the OST and one in the PST containers that came off the same laptop. The message IDs are the same for both messages but the hashes are different. 

 

The messages look identical to the naked eye. If in fact they are identical, what causes different MD5 hashes? Is the file path part of the hash? If so, what's the rationale for including the file path in calculating the hash? 

 

Please let me know. Thanks. 

 

Best regards, Phil 

Link to comment
Share on other sites

  • 1 year later...

 

Hello all,
 
Thank you all for your insights! Some ideas I am taking from this:
  • It is certainly possible to make the message hash algorithm configurable, e.g. as a list of "ingredients" that you can choose from: From/Sender/To/Cc/Bcc/Date/Subject/body/attachments. This would probably be our first step towards better deduplication. We can allow for optionally reducing the precision of the date. It's interesting to see that some want to leave the body out while others want to base it only on the body - I have heard both variants before. Someone also once suggested to me to only use the Message-ID header for the hashing.

 

Coming back to this topic a year and some change later, has anything been added (or plan to be added) similar to what is mentioned here? Or any changes to deduplication in general between the 1.8 versions and 1.9 versions (particularly 1.9.2 and/or 1.9.3)? A configurable hashing formula to fit specific needs for deduplication would be phenomenal on many fronts, allowing the user to fine-tune how deduping is performed.

 

I apologize for bringing an old topic back to life, but this problem apparently still persists for my company (same as OP).

Link to comment
Share on other sites

I would certainly vote for a configurable message hash algorithm.

 

We've just conducted test this week and have exported all of the emails from an intella case (455,000) and run a python script which essentially read the first file and then looked at the other raw emails files where the subject was the same and if it was then looked at the sender and sent time to see if these matched. Any 'duplicates' had there MD5 recorded in a text file that we took back into Intella and used as an Exclude list. It took 24 hours to run but will have saved the lawyers days of reviewers time.

Link to comment
Share on other sites

  • 4 months later...
  • 1 year later...
  • 5 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...