De-dupping New data for review against data already reviewed

markjrouse · October 29, 2015

Hi,

Normally what I find is that I tend to get data in batches. So as an example, I might receive batch 1 on Monday, process it, run keywords and review search hits, and then receive batch 2 on, let's say, Thursday. What happens more often than not is that batch 2 contains duplicates from the batch 1, which have already have been reviewed.

At the moment, I've run my multiple keyword list as a search and include on the batch 1 and batch 2 locations, then sort by message hash for emails and MD5 hash for files. I then have to manually remove the duplicates already reviewed by adding a tag to exclude them.

Is there an easy way or a method in using tags to automatically deduplicate an email, for example, from batch 2 against batch 1?

arjohn · October 30, 2015

Hi Mark,

What I would do is to select all items/emails that have already been reviewed, export their MD5 and message hashes to csv files and then import those csv's back in using the 'MD5 and Message Hash' facet. These hash lists can then be used as an exclude filter.

Sign In

De-dupping New data for review against data already reviewed

Recommended Posts

markjrouse

Link to comment

Share on other sites

arjohn

Link to comment

Share on other sites

Browse

Activity