Jump to content

Smart filtering options for selecting files for OCR

Questa Integrity

Recommended Posts

I was wondering whether there is a smart way of filtering for external OCR engine 
files other than PDF & TIFFs.

Some scanners save files as regular pictures (eg. jpg/png) rather than PDF/ TIFFs. You can also make a picture of text with your camera.
The problem is that ESI usually contain thousands of pictures, but how to select only the ones potentially containing text in them? Do you have to OCR all of them to ensure (reasonable) completeness?
If you had a skin-tone analysis, you could apply the brightest tone in order to filter out files with white/ bright background, but I understand that Intella doesn't support this feature
The only filter that comes to my mind is file size, say >50kB - this would ignore numerous minor graphical objects (email footers, website banners, thumbnails, etc.) and thus reduce a number of files processed by the OCR engine.
Any other ideas how to reduce a number of pictures for OCR?


Link to comment
Share on other sites

It would certainly be a great addition if there was a way to reliably scan for these types of pictures, maybe some sort of reduced OCR functionality that simply scans for a large percentage of 'text' within pictures much the same way skin tone analysis works. The results are then graded by percentage and a final manual review is required to select the documents for full OCR.

Link to comment
Share on other sites

Hi Adam,


The problem with this is one of pricing. If we added even basic OCR we would then need to charge more for Intella to cover the OEM OCR package.  Most people won't like that as they don't use it. 


I would suggest getting a AABBY recognition server. Then is is a one click to export and process all images in case they have text. 

Link to comment
Share on other sites

That doesn't seem feasible to me, either, without great expense.  I've never run across more than a tiny smattering of actual documents that were JPG or PNG file types.  Perhaps if you did find some you could maybe isolate the source (a particular custodian or organization) and use that as a point of departure to export a subset?  Maybe in conjunction with the thumbnail view?


On a related note, in recently looking into some other options for OCR, I came across CVision Technologies.  I demo'd their Maestro Recognition Server, which out performs Abbyy, at a lower cost.  It was extremely robust, but pricey if you want to take advantage of multiple cores (I love being punished by software companies because I might happen to have a more powerful computer than the next guy).  I believe it's still less expensive than Abbyy, though.  They also offer another product that's priced differently that is a per-page model, but with no core limitation.  However, I was asked, "How many pages do you need OCRd in a year?"  Not exactly a question that can be answered in e-discovery...


With regard to OCR in conjunction with Intella, here's the problem I see.  We need high volume, high-speed OCR capabilities.  If Intella can index 350GB in 24 hours or less, it's no help at all if it then requires 3 weeks for a single-threaded app to OCR 75K PDFs.  Thus, the only practical option are enterprise OCR solutions, whose price is completely out of line for most Intella users, or e-discovery vendors, who really don't want to do only OCR work, and certainly don't appreciate your daring to use Intella in the first place.  When you need the OCR processing, you really need it, but there will also be weeks or months where it isn't used at all, but you still have to pay for it.  Some of these tools would exceed the cost of the Connect/Pro package, entirely.  With per-page pricing, you certainly don't want to eat up processing for docs that don't even have any text, have embedded text, etc.  I don't know if there is a powerful tool that would only count files that actually had text and were successfully OCRd. 


It's a real problem, as paper hasn't exactly gone away despite everything that's said (I'm sitting in a giant law firm with millions of pages in boxes onsite and probably 10s of millions offsite).  I don't have any very good answers, but that's the scope of the problem as I see it!   

Link to comment
Share on other sites

I looked at Maestro a while back too and was very impressed, I think from memory the unlimited single core package was either $7k or $12k for a 12 month license, not sure on what the renewal was but think it was SMS type rather than full cost.


Next big job I get requiring OCR I will definitely be purchasing Maestro :)


@admin - I fully understand and you are correct having it as standard within Intella would be expensive, and given the standard of OCR software that is out there not really needed.


@Jasoncovey - I thought that myself until recently when I started looking closer at the pictures. My last few jobs I would estimate several hundred up to more than a thousand scanned pictures of documents are present. It's very common in the corporate world to scan important documents for archiving and scan to JPG or TIFF seems to be as common if not more so that scanning to PDF. So while it may have limited value for some people it's certainly something I would be interested in. Right now though all the software I can find is about OCR so rather than OCR thousands of pictures it's still a manual process to export out the genuine documents, but having the ability to rate pictures based on the amount of text content would be extremely useful.

Link to comment
Share on other sites

  • 3 weeks later...

I think the OCR Export and Import feature with the md5 hash as file names is a good one, and also usefull for graphical images.


In several cases scanned documents are very important. As said, pdf and Tiff formatted files, contain text in most cases. So doing OCR is not the question (you should do this always in my opinion).


JPG is also a file format that is used for scanning output. OCRing all the pictures in a case won't work, because of the very time consuming process.


We do a little handwork by exporting all the images (except tiff, because we already do Tiff in our main process) and name them by the md5 hash (same as OCR export). After exporting we do a manual analysis by viewing the images in a thumbnail view, with f.i. Microsoft Office Picture Manager or Acdsee, IrfanView and others. First sort the pictures on the file size. Smaller files are mostly not scanned documents.


Now scroll trough all the pictures and move ‘white pictures with black lines on it’ to another folder. When this job is done, we OCR these files and import all the OCRed ‘white files’ by using the “Import OCRed Files” option within Intella. While importing, get a cup of coffee and give your eyes some rest :)


Analysing the pictures by viewing them is time consuming but it can be very valuable. We had a case in a foreign country. We used Intella and indexed a large pst-file. We did not have an OCR-tool on the laptop, but by using the thumbnail-view we discovered a whole bunch of very important scanned documents (jpg and pdf). Keyword searching is not always necessary to find your evidence.




Link to comment
Share on other sites

  • 3 years later...

I searched for an OCR candidate procedure and ran across this older conversation (as well as the "Sample checklist for users" post). I still am wondering ...

Using Intella, how best to identify the files I should OCR?

Perhaps, use the Images "Type" facet and preview all of them?

Preview all of the "Empty Documents" in the "Features" facet?

Are the above 2 steps alone satisfactory? Any guidance on a procedure(s) to locate files needing OCR would be very welcome :)

Link to comment
Share on other sites

Hi llanowar

Ultimately, which files to OCR depends on the customer's requirements. This should be discussed with the customer, and agreed prior to running the OCR process. From experience I have had different settings from different customers. E.g. some customers want just empty PDFs and top level Tiff file. Others want those types, plus other image formats.  

When you know which files to search for, you can search manually, e.g. search for PDFs and Empty files etc., then tag those item and run then through OCR. Or, you can use the Tasks feature (File - Tasks) to select the OCR candidates, and automatically OCR the items. 



Link to comment
Share on other sites

  • 4 years later...

On this topic, I have an interesting edge-case that I assume someone has probably already looked at before:

I have a PDF which is a scanned copy of a contract that has been printed, signed and re-scanned.

However someone has then opened the PDF in an editor and added the names of the signatories under their signatures on the final page.

This means that Intella does not consider this an empty document, so if you choose the "Empty Only" PDF flag, it won't OCR.

No problem I thought to myself, I'll leave it ticked, but tick everything on the Images side, assuming "Include embedded images" makes Intella consider the pages of a PDF embedded?

I also selected "PNG" as for this specific document the pages were PNG format.

I re-ran OCR and the file was still not processed - Does this mean that Intella first evaluates the PDF as "not empty" and completely discards it and the embedded images from further OCR consideration?

I hoped it would take a "second pass" on the embedded images and OCR them as individual images in their own right?

What's the safest option? Untick "Empty Only" and tick all on the right under "Images" and let it possibly process things that don't need to be?


Link to comment
Share on other sites

Hi Shaun,

The embedded images option should have OCRed the embedded images. We are not sure why that did not occur. Possibly the images were not .png, or ABBYY did not find anything in them? If you want us to take a look at the PDF then you can open a support ticket and send it to us for testing. 

Note that the PDF document will not show any OCR if the image was OCRed. Only the embedded image will have OCR text attached to it. To OCR that PDF you will need to remove the 'empty only' filter as this PDF does have text and it is not empty.  

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...