My apologies if this has already been address, but I could not find it through search.  I am dealing with MST Exchange emails.  The emails contain a mix of standard SMTP email address as well as Exchange X.400-style addresses.  De-duplication becomes a big problem here.  Emails that are otherwise identical have different message hashes when one email has the SMTP address and another email has an X-400-style address.  Is there any way currently to de-duplicate these?

I know that as of the latest version of Intella, you can configure Message Hash to ignore certain attributes (including headers and recipients).  This should work, but I'd really like to have more fine-tuned control than this.  Ideally, it would be amazing if Intella could intelligently recognize that two emails are identical even if they use a mix of SMTP and X.400-style addresses.  From my experience, this issue is very common in dealing with Exchange exports.

Any thoughts would be greatly appreciated.

Thank you!


As an update, here is the technique I used to deal with this.  I don't think this is quite ideal, but it seems to have worked reasonably well in this case:

For all Top-Level Parent emails, I exported the following fields to a CSV file:

  • DocID
  • Subject
  • Sent
  • Attachments
  • From
  • To
  • CC
  • BCC
  • Conversation Index

Using Excel, I identified all emails that contained the exact same values for ALL of the fields in red above.  Using a spot check, I confirmed that the resulting documents indeed appeared to all be duplicates.

Note this technique does not actually compare the email bodies.  A better technique would certainly consider the bodies as well.


