Jump to content

Managing Duplicates


Recommended Posts

One of the biggest issues I've had to deal with to date is managing duplicates across a case and across a custodian.


For example, in a current case the entire company has a policy of emailing their logo as an attachment. I now have 100,000+ copies of this image in the case. However, if one of my reviewers marks an email as Relevant and another users marks a different email as Non-Relevant, but they both have Tag duplicates and Tag families turned on, this creates a situation in which I have a lot of files tagged with both Relevant and Non-Relevant.


On the other hand, if I don't tag with duplicates and my search includes something like "Exclude tagged items," then the duplicates of tagged items appear to populate the list after one of the duplicates has been tagged, so the count of untagged items remains confusingly high. For example:


Main search: 6,438 documents

Deduplicated: 4,011 documents

Tagged: ~3,000 documents

Remaining untagged items: 2,271


But there should be only ~1,000 untagged documents remaining.


I'm not sure what the right answer is, but it seems like there should be some way of de-duplicating the dataset (at the custodian level at least) and then operating on that deduplicated set.

Link to comment
Share on other sites

I've been giving this some thought lately, and sadly as it happens it kept me from sleeping last night.


If you are reviewing emails only, then simply turn off the tag family, that way you are not tagging email signatures, attachments etc, however in the final export you will still retain any children of the tagged emails as they are embedded items and always come out with the parent in the export.


With regards to duplicates I deal on a case by case basis, generally unless the client wants to know how many other people have seen or been copied into an email chain I will always leave out tagging duplicates.


If you are reviewing a mix of documents and emails I would still leave the tag family option disabled. Once you have finished the review then you go back to the tagged items, highlight them all and select 'show parents' you can then tag any parent emails that are applicable for the documents (if they were attachments).


You may also want to consider unchecking the 'index content embedded in documents' option. If I understand correctly this won't effect emails as the attachments and indeed the documents themselves are still indexed making the text searchable, however it means the pictures will not be indexed individually.


I'm finding myself using this option more and more dealing with documents as there is generally a lot of clutter which I'm not necessarily interested in.

Link to comment
Share on other sites

It appears to me that Adam's workflow is indeed the best solution for making sure that the Relevant and Non-Relevant sets don't overlap: only tag duplicates but not families during review, and use a combination of Show top-level parent and Show Children to determine the family items of one of these sets at a later point in time.


Doing this Show Parent + Show Children procedure on both the Relevant and Non-Relevant sets will still result in them overlapping though. I think that "Relevant beats Non-Relevant", so you can use the Cluster Map to see the overlap between these two tags and remove the Non-Relevant tag from those items that have both tags.

Link to comment
Share on other sites

  • Create New...