I would like to open a community discussion on OCR settings and programs as I have been doing some performance testing recently.
There are two versions of ABBYY that I have been testing: FineReader Corporate (4 core) and Recognition Server (RS v4).
My first assumption was that RS v4 would be faster since it is 4-5x the cost of the 4-core Corporate version. I was using an unlimited core version and I liked the idea that I could export/import files directly from Intella v1.9.
In one test, I sent 100 non-searchable PDF files to RS using the Intella interface. I preconfigured a workflow in RS to export to Text format. The PDF files were random sizes, 4 had errors (corrupted) and they totaled 1,067 pages.
TEST #1 (Good):
RS server, which was running on a separate server than Intella, completed the task in 26 minutes. (Note: One downside to using the Intella interface to export/import to RS was I could not use Intella while it was processing)
TEST #2 (Better):
Corporate, which was running the Hot Folder function on a separate server, completed the task in less than 19 minutes. The output and other settings was equivalent to the RS workflow.
TEST #3 (Best):
I then wanted to figure out a way to squeeze more performance from Corporate Hot Folder. I created a batch file that split my PDF files into 4 subfolders. I did this based on the starting value of the MD5 filename (16 variables split 4 ways). Of course that will not equally balance the workload but it was good enough for testing.
I started the 4 jobs on the Hot Folder interface at the same time (one job per subfolder). Although it was still limited to 4 cores, the split did make a difference. All jobs were completed in less than 10 minutes.
This made me consider the option of buying two Corporate 4-core licenses running on separate servers instead of using RS. If you wait, ABBYY often sells 4-core at a 40% discount for $359/license. So roughly $700 for unlimited OCR compared to RS pricing.
Questions for the community:
1) What do you use for OCR? Has it been a good ROI?
2) What OCR settings do you use? What works best for an eD environment?
Thanks for reading,
Wm