Jump to content
Sign in to follow this  
dale

Near-duplicate identification using shingles

Recommended Posts

Intella does paragraph-level deduplication. What we'd like to stipulate here is the identification of near-duplicate items (and paragraphs). This could be done using shingles, calculating the ratio of shared shingles amongst items (shingles from item A contained in item B and vice-versa). See also "Jaccard Similarity."

 

Share this post


Link to post
Share on other sites

Hello,

We are working on this topic and we are planning to add this functionality to Intella product in the next release. Thank you for your suggestion about Jaccard similarity, this metric is one of the metrics which we are testing to improve our near-duplicates analyzer.

  • Like 1

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
Sign in to follow this  

×
×
  • Create New...