Jump to content

Search vs Compare


AdamS

Recommended Posts

I have a curly one and am having trouble getting my head around a way I might be able to do this.

 

The scenario is this; my client thinks someone at their business has been double invoicing subcontractors. This has been identified with invoices that have duplicate descriptions of the work, but different numbers, dates etc.

 

I have been asked if I can examine all the invoices and identify the ones that have duplicate descriptions. 

 

If this was simply one description that had been used over and over it would be easy, but there are potentially many different descriptions that each may have been used multiple times.

 

I don't know if I'm looking at word, excel or PDF documents, and I don't know if all the invoices are templated or free form. 

 

Hopefully there is some sort of template used and I may be able to do a field based search and comparison, but it may be they are all non searchable PDF's and I'm going to have to OCR and import.

 

Any idea's on how I might be able to use Intella to identify duplicate fields or text to assist here?

 

The process I'm thinking is some way to identify which documents share duplicate words within a short proximity, say 30 duplicate words within 100 bytes (or something like that)

 

I have a headache :(

Link to comment
Share on other sites

Hi

 

I would recommend to take a look at the Smart Search feature of Intella.

 

If at least one relevant document is known, the Smart Search can assist you to identify which documents are potentially relevant as well. Unlike duplicates identification, the Smart Search ignores document format and minor textual differences, returning documents that share only significant keywords with the sample. 

 

Repeating this procedure for more documents found (and perhaps, combining it with other search tools, like proximity or wildcard search), it should be possible to to identify the entire set of relevant documents.

 

This however will not work for non-searchable PDFs - they should be OCRed beforehand.

 

HTH

Link to comment
Share on other sites

Unfortunately smart search doesn't really assist in this instance. The invoices in question were all scanned paper documents so I had to OCR them which means without spending many hours manuallly coding them there are no 'fields' as such. This means as most invoices tend to follow very similar patterns they are all potentially a match with eachother according to smart search.

 

Looks like there might not be any way around this other than a manual look and compare, or maybe ultra compare....but 8,000 invoices...sheesh :\

Link to comment
Share on other sites

I had another thought for a possible way to do this but I'm not sure if Intella can do what I need.

 

Can I use a proximity search or a boolean string to show me all instances of 'total' and the next say, 20 characters. Then export a spread sheet showing nothing more than the document ID, total and the next 20 characters. I could use this to very quickly identify duplicate dollar amounts and this would go a long way to accomplishing what I need.

 

Just not sure I can export a spreadsheet with just those results can I?

Link to comment
Share on other sites

Hi Adam,

 

Proximity search doesn't work on characters level. Perhaps, you might use phrase or proximity search with wildcards to find instances of "total" and the word next to it:

 

"total *" - any word

"total $*" - any word beginning with "$" (e.g. "total $1000")

"total $???" - any word of 3 characters beginning with "$" (e.g. "total $999")

"total $???"~2 - same as above but possibly separated by another word (e.g. "total USD $999")

 

and so on.

 

Unfortunately, there is no functionality to export the results in a form that would preserve the word hits.

Link to comment
Share on other sites

Thanks Alex, I think I have my next request for the wish list regarding exporting keyword results :)

 

Admittedly this is rare as it's more data analytics area and there is software that can deal with that, PROVIDED the documents are coded with fields. This is of course an option here, but an expensive option for the client and if I can find a way to get the results without coding the documents then I will explore that.

Link to comment
Share on other sites

Hello Adam,

 

How about just searching for "total" and using the List view to see the context in which it appears? Since version 1.7.2 we show search snippets there (see the attached image). Unfortunately you can't copy or export them, but it may be a step closer towards what you need.

 

From there you can easily flag them as well, and later export the flagged items.

list view.png

Link to comment
Share on other sites

×
×
  • Create New...