OCR Progress & Temp Files

Bryan La Rock · February 28, 2019

Hi!

I have two questions regarding OCR. First, is there any easy way to keep track of progress and see how many docs remain to be OCRed? I am usually OCRing via a command-line script immediately after processing (using a Task file). The command output simply says "Post-processing", so I don't know how many OCR candidates were identified. I see that files are being created in the following folder in my current case: .\tmp\ocr-service9057841158103284728. It looks like the final OCR results are being placed in the "ocr-results" folder here, so that seems to be a good number as to how many files have been OCRed thus far. I just don't know how many files are still going to be processed.

Also, I notice that when OCR finishes, this "ocr-results" folder is immediately deleted. Is there any way to prevent this? I like to keep keep OCR results for future use. Sometimes, we need to ingest new data that contains a lot of duplicates of files already OCRed. It would be fantastic to just be able to import the OCR results for these rather than need to OCR them all over again.

I'd appreciate any ideas for the above.

Thank you!

Bryan

ŁukaszBachman · March 28, 2019

Hi Bryan,

It's true that the output of CMD when processing tasks could be improved, however there is also another option available. Instead of analyzing the output in the console, you might preffer to open case logs and monitor the progress there. Here is a snippet showing when OCRing is starting, progressing and finishing:

[INFO ] 2019-03-28 13:40:07,100 [CrawlThread] Total page count: 101
[INFO ] 2019-03-28 13:40:07,109 [CrawlThread] Started OCRing 101 items. Using: ABBYY FineReader Engine
[INFO ] 2019-03-28 13:40:07,109 [CrawlThread] Settings:
    Profile: Accuracy
    Export format: Plain text
    Languages: English
    Number of workers: 10
    Detect page orientation: true
    Correct inverted images: true
    Skip OCRed: true
[WARN ] 2019-03-28 13:40:07,115 [CrawlThread] Skipped encrypted content item: 1373
[INFO ] 2019-03-28 13:40:07,116 [OcrServiceProcessor1] OCRing item: 1243
[INFO ] 2019-03-28 13:40:07,116 [OcrServiceProcessor2] OCRing item: 1244
...
[INFO ] 2019-03-28 13:40:32,470 [CrawlThread] Collecting OCR crawl results
[INFO ] 2019-03-28 13:40:32,619 [CrawlThread] Collected 0 records.
[INFO ] 2019-03-28 13:40:32,620 [CrawlThread] Importing OCRed text and extracted entities
[INFO ] 2019-03-28 13:40:32,889 [CrawlThread] Imported OCR text into 150 items.
[INFO ] 2019-03-28 13:40:32,938 [CrawlThread] Updating OCR database
[INFO ] 2019-03-28 13:40:33,182 [CrawlThread] Finished OCR. Total time: 0:26. Items processed: 99

You could of course monitor the entire log, or perhaps use some command line programs to grep their contents live for regular expressions of your choosing. That way you can only get information about OCR process itself.

As for the second question about preserving temporary files generated during OCRing. It looks like a risky operation for me and if one is not careful enough, it may produce errors which would be very hard to find. Fortunately, it shouldn't be needed once we extend Intella so that it re-applies OCRed text to duplicated items discovered when new sources are being added. This is already on our radar.

Bryan La Rock · April 19, 2019

Thanks so much, Lukasz!

In my head, I had replied to this. My apologies that I apparently failed to reply in reality. This is great info and extremely helpful. I'm running an OCR job now and this is giving me the info I need.

Thanks again!

Bryan

Sign In

OCR Progress & Temp Files

Recommended Posts

Bryan La Rock

Link to comment

Share on other sites

ŁukaszBachman

Link to comment

Share on other sites

Bryan La Rock

Link to comment

Share on other sites

Join the conversation

Browse

Activity