Search the Community
Showing results for tags 'near-duplicate'.
Intella does paragraph-level deduplication. What we'd like to stipulate here is the identification of near-duplicate items (and paragraphs). This could be done using shingles, calculating the ratio of shared shingles amongst items (shingles from item A contained in item B and vice-versa). See also "Jaccard Similarity."
We are in a situation where would like to identify the near-duplicates of files of varying type, based on the file's content alone. Intella's Smart Search feature will allow us to do this one file at a time, but not in mass. For example, we would like to compare all files in Set A to all files in Set B and identify which files are near duplicates of one another. Has anyone successfully tackled a problem like this using Intella?