Jump to content

OCR Progress & Temp Files


Bryan La Rock

Recommended Posts

Hi!

I have two questions regarding OCR.  First, is there any easy way to keep track of progress and see how many docs remain to be OCRed?  I am usually OCRing via a command-line script immediately after processing (using a Task file).  The command output simply says "Post-processing", so I don't know how many OCR candidates were identified.  I see that files are being created in the following folder in my current case: .\tmp\ocr-service9057841158103284728.  It looks like the final OCR results are being placed in the "ocr-results" folder here, so that seems to be a good number as to how many files have been OCRed thus far.  I just don't know how many files are still going to be processed.

Also, I notice that when OCR finishes, this "ocr-results" folder is immediately deleted.  Is there any way to prevent this?  I like to keep keep OCR results for future use.  Sometimes, we need to ingest new data that contains a lot of duplicates of files already OCRed.  It would be fantastic to just be able to import the OCR results for these rather than need to OCR them all over again.

I'd appreciate any ideas for the above.

Thank you!

Bryan

Link to comment
Share on other sites

  • 1 month later...

Hi Bryan,

It's true that the output of CMD when processing tasks could be improved, however there is also another option available. Instead of analyzing the output in the console, you might preffer to open case logs and monitor the progress there. Here is a snippet showing when OCRing is starting, progressing and finishing:

[INFO ] 2019-03-28 13:40:07,100 [CrawlThread] Total page count: 101
[INFO ] 2019-03-28 13:40:07,109 [CrawlThread] Started OCRing 101 items. Using: ABBYY FineReader Engine
[INFO ] 2019-03-28 13:40:07,109 [CrawlThread] Settings:
    Profile: Accuracy
    Export format: Plain text
    Languages: English
    Number of workers: 10
    Detect page orientation: true
    Correct inverted images: true
    Skip OCRed: true
[WARN ] 2019-03-28 13:40:07,115 [CrawlThread] Skipped encrypted content item: 1373
[INFO ] 2019-03-28 13:40:07,116 [OcrServiceProcessor1] OCRing item: 1243
[INFO ] 2019-03-28 13:40:07,116 [OcrServiceProcessor2] OCRing item: 1244
...
[INFO ] 2019-03-28 13:40:32,470 [CrawlThread] Collecting OCR crawl results
[INFO ] 2019-03-28 13:40:32,619 [CrawlThread] Collected 0 records.
[INFO ] 2019-03-28 13:40:32,620 [CrawlThread] Importing OCRed text and extracted entities
[INFO ] 2019-03-28 13:40:32,889 [CrawlThread] Imported OCR text into 150 items.
[INFO ] 2019-03-28 13:40:32,938 [CrawlThread] Updating OCR database
[INFO ] 2019-03-28 13:40:33,182 [CrawlThread] Finished OCR. Total time: 0:26. Items processed: 99

You could of course monitor the entire log, or perhaps use some command line programs to grep their contents live for regular expressions of your choosing. That way you can only get information about OCR process itself.

As for the second question about preserving temporary files generated during OCRing. It looks like a risky operation for me and if one is not careful enough, it may produce errors which would be very hard to find. Fortunately, it shouldn't be needed once we extend Intella so that it re-applies OCRed text to duplicated items discovered when new sources are being added. This is already on our radar.

  • Like 1
Link to comment
Share on other sites

  • 4 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...