Search the Community
Showing results for tags 'near duplicates'.
Intella does paragraph-level deduplication. What we'd like to stipulate here is the identification of near-duplicate items (and paragraphs). This could be done using shingles, calculating the ratio of shared shingles amongst items (shingles from item A contained in item B and vice-versa). See also "Jaccard Similarity."
In the ediscovery world, we are bombarded by both vendors and developers heralding the promise of advanced text analytics capabilities to effectively and intelligently reduce review volumes. First it was called predictive coding, then CAR, then TAR, then CAL, and now it's AI. Although Google and Facebook and Amazon and Apple and Samsung all admit to having major hurdles ahead in perfecting AI, in ediscovery, magical marketing tells us that everyone but me now has it, that it's completely amazing and accurate and that we are Neanderthals if we do not immediately institute and trust it. And a