dale Posted June 8, 2020 Report Share Posted June 8, 2020 Intella does paragraph-level deduplication. What we'd like to stipulate here is the identification of near-duplicate items (and paragraphs). This could be done using shingles, calculating the ratio of shared shingles amongst items (shingles from item A contained in item B and vice-versa). See also "Jaccard Similarity." Quote Link to comment Share on other sites More sharing options...
ngreenou Posted June 8, 2020 Report Share Posted June 8, 2020 +1 Quote Link to comment Share on other sites More sharing options...
SebastianMeszyński Posted June 9, 2020 Report Share Posted June 9, 2020 Hello,We are working on this topic and we are planning to add this functionality to Intella product in the next release. Thank you for your suggestion about Jaccard similarity, this metric is one of the metrics which we are testing to improve our near-duplicates analyzer. 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.