Jump to content

OCR Configuration


wmfiske

Recommended Posts

I would like to open a community discussion on OCR settings and programs as I have been doing some performance testing recently.

 

There are two versions of ABBYY that I have been testing: FineReader Corporate (4 core) and Recognition Server (RS v4).

 

My first assumption was that RS v4 would be faster since it is 4-5x the cost of the 4-core Corporate version. I was using an unlimited core version and I liked the idea that I could export/import files directly from Intella v1.9.

 

In one test, I sent 100 non-searchable PDF files to RS using the Intella interface. I preconfigured a workflow in RS to export to Text format. The PDF files were random sizes, 4 had errors (corrupted) and they totaled 1,067 pages.

 

TEST #1 (Good):

 

RS server, which was running on a separate server than Intella, completed the task in 26 minutes. (Note: One downside to using the Intella interface to export/import to RS was I could not use Intella while it was processing)

 

TEST #2 (Better):

 

Corporate, which was running the Hot Folder function on a separate server, completed the task in less than 19 minutes. The output and other settings was equivalent to the RS workflow.

 

TEST #3 (Best):

 

I then wanted to figure out a way to squeeze more performance from Corporate Hot Folder. I created a batch file that split my PDF files into 4 subfolders. I did this based on the starting value of the MD5 filename (16 variables split 4 ways). Of course that will not equally balance the workload but it was good enough for testing.

 

I started the 4 jobs on the Hot Folder interface at the same time (one job per subfolder). Although it was still limited to 4 cores, the split did make a difference. All jobs were completed in less than 10 minutes.

 

This made me consider the option of buying two Corporate 4-core licenses running on separate servers instead of using RS. If you wait, ABBYY often sells 4-core at a 40% discount for $359/license. So roughly $700 for unlimited OCR compared to RS pricing.

 

Questions for the community:

 

1) What do you use for OCR? Has it been a good ROI?

 

2) What OCR settings do you use? What works best for an eD environment?

 

Thanks for reading,

 

Wm

 

Link to comment
Share on other sites

  • 2 weeks later...

Hi Wm,

 

Thank you for taking the time to share your testing results with the group. These are indeed interesting results which show that performance can be affected by configuration settings.

 

Being from a Corporate environment, I have been involved in a lot of ED work. We used ABBYY RS v4 for OCR processing however, we did not do this automatically from Intella. Instead we exported the documents to be OCRed to a HOT folder on the ABBYY system where they were OCRed automatically, then we imported the text back into the case. The reason (more than anything) is that we had not got around to linking or setting up the two systems to work as one workflow. I guess the advantage in our case is that the Intella case can still be used while documents were being OCRed in ABBYY.

 

One advantage with RS v4 is that you can add processing stations which share the work load. We were OCRing large volumes of documents so we had the RS plus 4 or so processing stations. This really cuts down the time when processing large volumes of documents.

 

In terms of cost, we would purchased bulk page licenses, e.g. 1 million page license. This cost would be disbursed across the many jobs which required OCR work. There was no initial outlay to purchase RS v4. 

 

I can say that the quality from RS v4 is very good. Unfortunately I have not used any other OCR tools so I cant comment on other products. 

 

Regards

 

Jon

Link to comment
Share on other sites

  • 3 weeks later...

Hi,

 

We evaluated Abbyy but for HPC it was a little pricey.  We also looked at an eDiscovery tool (Vound competitor) but it failed some basic tests.  Finally we pulled the trigger on Aquaforest Autobahn DX.  For $5500 +yr2 SMS you get unlimited pages on unlimited cores.  We run Intella Pro and Intella Connect on two monster hosts with 32 cores each, and Autobahn runs on third VM.  If you'd like to create a sample data set for this thread maybe we can get some benchmarks going...I'd be curious what a comparable system to Jon's would do (would be easy to shut down and specify 4 cores).

 

As far as ROI, we haven't done any analysis; To be honest I think we're getting terrible use out of it.  It makes sense to use the tool for large jobs, and otherwise people tend to let Acrobat at the data.  The reason we got it though is that when a big job does come in, and you need everything OCRed now, this is really the way to do it.  Outside vendors just took a day or two to spec out the job.  Typically I process the files before processing in Intella, since it's just so easy to do now.  Previously, I had to worry about processing and OCRing later as time permits.

 

Regards,

 

Gabriel

Link to comment
Share on other sites

Hi Gabriel,

 

Thanks for your post. I'll look at getting a dataset together for benchmark testing, this is a good idea.

 

It is interesting that you mention that you OCR before processing in Intella. Does this also save time? I'm also curious to know how you determine whether a file (or an attachment to an email) needs to be OCRed?

 

Regards

 

Jon

Link to comment
Share on other sites

Hi Gabriel,

 

Thanks for your post. I'll look at getting a dataset together for benchmark testing, this is a good idea.

 

It is interesting that you mention that you OCR before processing in Intella. Does this also save time? I'm also curious to know how you determine whether a file (or an attachment to an email) needs to be OCRed?

 

Regards

 

Jon

 

 

It saves time because most of our cases are PDF production and not email.  For those email cases I've been using FTK but it lacks any sort of email deduplication, although it OCRs as part of processing.  The option of email dedupe far outweighs the additional time required so I'll be phasing Intella in now that I have a handle on 1.9.

 

I sample a couple files to determine if already OCRed.  Under the odd instance where only some of the production is OCRed, I just OCR all of it.  With 32 cores doing a page each it's really quick.

 

Autobahn does have an option for what to do with Non-Image PDFs: OCR, Raise Error, Skip, and Pass-through.  We are currently using OCR and see no adverse effects, but Pass-through is probably the most elegant way of dealing with these.  We only see a mixture of 'native' PDFs and scanned PDFs in rare cases.

 

Cheers,

 

Gabriel

Link to comment
Share on other sites

Hi all,

 

I have created some sample datasets for testing OCR processing. These are from the Enron dataset.

 

There are two datasets (one for scanned PDFs and one for TIFF files). The content of the documents is the same, it is just that they are in different formats.

 

They are approx. 170MB each and can be download from these links.

http://vound-software.com/files/OCR%20dataset%201%20-%20TIFF.zip

http://vound-software.com/files/OCR%20dataset%202%20-%20PDF.zip

 

Regards

 

Jon

Link to comment
Share on other sites

  • 2 years later...
  • 1 year later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...