Jump to content

New feature - Paragraph level deduplication a.k.a. "Shingles"


Recommended Posts

Hi all!


Our team has been working on a new feature that can be very helpful to all the users who are using Intella primarily for reviewing large data sets. We are calling it "paragraph level deduplication" or sometimes "shingles" and today I'd like to show you one possible use case for it. But first things first, let's start with a short introduction to what we are after here.


Up until now the text extracted from various items was in reality a flat data structure. We were preserving line breaks and separators where possible just to make the text easier to read, but in reality it was one continuous flow of the data. Now, we all know that while doing a review it's not only the raw text that matters. It's also the syntax that yields information and therefore we should present it to the reviewer so he can examine it's value. For instance, a unique, weekly report written by an employee is much more important than yet another joke about the LOL-cats sent across the organization. A forwarded email that you already saw and marked as non-relevant should not get more of your attention - it's the new stuff that matters and needs the most out of your focus. 

This is something that we want to start addressing in next releases.


The idea is to dive one level deeper when analyzing item's contents and focus on every single paragraph. As a starting point we want to compute uniqueness of each of paragraphs and allow you to hide duplicated content from plain sight. This implies that item's textual contents will no longer be presented as a one flow of data, but you will be able to see each paragraph and interact with it. That opens up a lot of new doors, for implementing features like:

  • finding items containing the same paragraphs
  • expanding/collapsing paragraphs
  • marking paragraphs as "seen" which can greatly aid your workflow of finding relevant content
  • colorizing paragraphs (again, faster review)
  • tagging / flagging paragraphs
  • commenting certain paragraphs
  • searching for phrases inside paragraphs
  • showing statistics including paragraphs count (like the % of unique paragraphs in a case)
  • etc.

Not all of those features are ready yet or even exist on our roadmap, but these are the things we are thinking of doing. There might be more! So please share your thoughts with us - let us know if and how you are planning to use it. The more feedback we get from the community, the more impact it will have on our priorities for upcoming releases! 


Attached is a little showcase of how it currently looks like inside the Previewer. It shows a set of 5 emails sent from the same person.

  • In the left column there are original contents with structured text extracted (one can use arrows to toggle each paragraph).
  • We marked 6 paragraphs in this set as "Seen" and turned on "Automatically hide seen paragraphs" option.
  • The right column shows you the same set when we browsed it again. Results: clearly less text to review. 


Link to comment
Share on other sites

Thanks for the information on the upcoming features it looks really good, I have one question, can you use this expanded analysis to maybe provide a summary tab of the contents of the documents/emails? Can the shingles be used to create a seperate metadata field or tab to create the summary or theme of the document contents, thanks.

Link to comment
Share on other sites

@Dougee, we want to add more statistics that would give reviewers a better overview how unique content really is (looking at the whole case). See attached image to get an overview.

We are also considering showing a ratio of unique content within the item (for instance: 4/10 meaning that only four paragraphs are unique).



Link to comment
Share on other sites

  • Create New...