Jump to content

De-dupping New data for review against data already reviewed

Recommended Posts



Normally what I find is that I tend to get data in batches. So as an example, I might receive batch 1 on Monday, process it, run keywords and review search hits, and then receive batch 2 on, let's say, Thursday. What happens more often than not is that batch 2 contains duplicates from the batch 1, which have already have been reviewed.


At the moment, I've run my multiple keyword list as a search and include on the batch 1 and batch 2 locations, then sort by message hash for emails and MD5 hash for files. I then have to manually remove the duplicates already reviewed by adding a tag to exclude them.


Is there an easy way or a method in using tags to automatically deduplicate an email, for example, from batch 2 against batch 1?



Link to comment
Share on other sites

Hi Mark,


What I would do is to select all items/emails that have already been reviewed, export their MD5 and message hashes to csv files and then import those csv's back in using the 'MD5 and Message Hash' facet. These hash lists can then be used as an exclude filter.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Create New...