Search the Community
Showing results for tags 'near duplicates'.
Intella does paragraph-level deduplication. What we'd like to stipulate here is the identification of near-duplicate items (and paragraphs). This could be done using shingles, calculating the ratio of shared shingles amongst items (shingles from item A contained in item B and vice-versa). See also "Jaccard Similarity."
In the ediscovery world, we are bombarded by both vendors and developers heralding the promise of advanced text analytics capabilities to effectively and intelligently reduce review volumes. First it was called predictive coding, then CAR, then TAR, then CAL, and now it's AI. Although Google and Facebook and Amazon and Apple and Samsung all admit to having major hurdles ahead in perfecting AI, in ediscovery, magical marketing tells us that everyone but me now has it, that it's completely amazing and accurate and that we are Neanderthals if we do not immediately institute and trust it. And all this happened in a matter of months. It totally didn't exist, and now it apparently does, and these relatively tiny developers have mastered it when the tech giants have not. Back in reality, I have routinely achieved with Intella that which I'm told is completely impossible. As Intella has evolved its own capabilities, I have been able to continually evolve my processes and workflows to take full advantage of its new capabilities. As a single user, and with Intella Pro, I have been able to effectively cull data in data sets up to 500 GB into digestible review sets, from which only a far smaller number of documents are actually produced. PSTs, OSTs, NSFs, file share content, DropBox, 365, Gmail, forensic images - literally anything made up of 1s and 0s. These same vendors claim I can not and should not be doing this, it's not possible, not defensible, I need their help, etc. My response is always, in using Intella with messy, real-world data in at least 300 litigation matters, why has there not been a single circumstance where a key document in my firm's possession has ever been produced by an opposing party, that was also in our possession, in Intella, but that we were unaware of? Of course, the answer is that, used to its fullest, with effectively designed, iterative workflows and QC and competent reviewers, Intella is THAT GOOD. In the process, I have made believers out of others, who had no choice but to reverse course and accept that what they had written off as impossible was in fact very possible, when they sat there and watched me do it, firsthand. However, where I see the greatest need for expanded capabilities with Intella is in the area of more advanced text analytics, to further leverage both its existing feature set, and the quality of Connect as a review platform. Over time, I have seen email deduplication become less effective, with the presence of functional/near duplicates plaguing review sets and frustrating reviewers. After all, the ediscovery marketing tells them they should never see a near duplicate document, so what's wrong with Intella? You told us how great it is! The ability to intelligently rank and categorize documents is also badly needed. I realize these are the tallest of orders, but after hanging around as Intella matured from version 1.5.2 to the nearly unrecognizable state of affairs today (and I literally just received an email touting AI for law firms as I'm writing this), I think that some gradual steps toward these types of features is no longer magical thinking. Email threading was a great start, but I need improved near duplicate detection. From there, the ability to identify and rank documents based on similarity of content is needed, but intelligently - simple metadata comparison is no longer adequate with ever-growing data volumes (which Intella can now process with previously unimaginable ease). So that's my highest priority wishlist contribution request for the desktop software, which we see and use as the "administrative" component of Intella, with Connect being its review-optimized counterpart. And with more marketing materials touting the "innovative" time saving of processing and review in a unified platform, I can't help but think to respond, "Oh - you mean like Intella has been from very first release of Connect?" Would love to hear others share their opinions on this subject, as I see very little of this type of thing discussed here. Jason