Search the Community
Showing results for tags 'ocr'.
Hi! I have two questions regarding OCR. First, is there any easy way to keep track of progress and see how many docs remain to be OCRed? I am usually OCRing via a command-line script immediately after processing (using a Task file). The command output simply says "Post-processing", so I don't know how many OCR candidates were identified. I see that files are being created in the following folder in my current case: .\tmp\ocr-service9057841158103284728. It looks like the final OCR results are being placed in the "ocr-results" folder here, so that seems to be a good number as to how many files have been OCRed thus far. I just don't know how many files are still going to be processed. Also, I notice that when OCR finishes, this "ocr-results" folder is immediately deleted. Is there any way to prevent this? I like to keep keep OCR results for future use. Sometimes, we need to ingest new data that contains a lot of duplicates of files already OCRed. It would be fantastic to just be able to import the OCR results for these rather than need to OCR them all over again. I'd appreciate any ideas for the above. Thank you! Bryan
Hello, I was wondering whether there is a smart way of filtering for external OCR engine files other than PDF & TIFFs. Some scanners save files as regular pictures (eg. jpg/png) rather than PDF/ TIFFs. You can also make a picture of text with your camera. The problem is that ESI usually contain thousands of pictures, but how to select only the ones potentially containing text in them? Do you have to OCR all of them to ensure (reasonable) completeness? If you had a skin-tone analysis, you could apply the brightest tone in order to filter out files with white/ bright background, but I understand that Intella doesn't support this feature The only filter that comes to my mind is file size, say >50kB - this would ignore numerous minor graphical objects (email footers, website banners, thumbnails, etc.) and thus reduce a number of files processed by the OCR engine. Any other ideas how to reduce a number of pictures for OCR? Thanks
Hello Intella-Team, as far as we know performing OCR on an item will work as follows: Export item for OCR, e.g. an image or PDF... Perform OCR externally, e.g. using OmniPage or ABBYY Import OCR'd itemsExtract textual info from OCR'd items Index extracted textual data within Intella There are at least two disadvantages, we have noticed: Opening an "original" PDF will not include the text data from the OCR process Reindexing the Case means loosing all imported OCR texts A possible enhancement could be this approach: Append the OCR'd files as an attachment to the item Add an option to open the Attachment instead of the original item What do you think about this? Thanks Stephan