Bryan La Rock Posted February 28, 2019 Report Share Posted February 28, 2019 Hi! I have two questions regarding OCR. First, is there any easy way to keep track of progress and see how many docs remain to be OCRed? I am usually OCRing via a command-line script immediately after processing (using a Task file). The command output simply says "Post-processing", so I don't know how many OCR candidates were identified. I see that files are being created in the following folder in my current case: .\tmp\ocr-service9057841158103284728. It looks like the final OCR results are being placed in the "ocr-results" folder here, so that seems to be a good number as to how many files have been OCRed thus far. I just don't know how many files are still going to be processed. Also, I notice that when OCR finishes, this "ocr-results" folder is immediately deleted. Is there any way to prevent this? I like to keep keep OCR results for future use. Sometimes, we need to ingest new data that contains a lot of duplicates of files already OCRed. It would be fantastic to just be able to import the OCR results for these rather than need to OCR them all over again. I'd appreciate any ideas for the above. Thank you! Bryan Quote Link to comment Share on other sites More sharing options...
ŁukaszBachman Posted March 28, 2019 Report Share Posted March 28, 2019 Hi Bryan, It's true that the output of CMD when processing tasks could be improved, however there is also another option available. Instead of analyzing the output in the console, you might preffer to open case logs and monitor the progress there. Here is a snippet showing when OCRing is starting, progressing and finishing: [INFO ] 2019-03-28 13:40:07,100 [CrawlThread] Total page count: 101 [INFO ] 2019-03-28 13:40:07,109 [CrawlThread] Started OCRing 101 items. Using: ABBYY FineReader Engine [INFO ] 2019-03-28 13:40:07,109 [CrawlThread] Settings: Profile: Accuracy Export format: Plain text Languages: English Number of workers: 10 Detect page orientation: true Correct inverted images: true Skip OCRed: true [WARN ] 2019-03-28 13:40:07,115 [CrawlThread] Skipped encrypted content item: 1373 [INFO ] 2019-03-28 13:40:07,116 [OcrServiceProcessor1] OCRing item: 1243 [INFO ] 2019-03-28 13:40:07,116 [OcrServiceProcessor2] OCRing item: 1244 ... [INFO ] 2019-03-28 13:40:32,470 [CrawlThread] Collecting OCR crawl results [INFO ] 2019-03-28 13:40:32,619 [CrawlThread] Collected 0 records. [INFO ] 2019-03-28 13:40:32,620 [CrawlThread] Importing OCRed text and extracted entities [INFO ] 2019-03-28 13:40:32,889 [CrawlThread] Imported OCR text into 150 items. [INFO ] 2019-03-28 13:40:32,938 [CrawlThread] Updating OCR database [INFO ] 2019-03-28 13:40:33,182 [CrawlThread] Finished OCR. Total time: 0:26. Items processed: 99 You could of course monitor the entire log, or perhaps use some command line programs to grep their contents live for regular expressions of your choosing. That way you can only get information about OCR process itself. As for the second question about preserving temporary files generated during OCRing. It looks like a risky operation for me and if one is not careful enough, it may produce errors which would be very hard to find. Fortunately, it shouldn't be needed once we extend Intella so that it re-applies OCRed text to duplicated items discovered when new sources are being added. This is already on our radar. 1 Quote Link to comment Share on other sites More sharing options...
Bryan La Rock Posted April 19, 2019 Author Report Share Posted April 19, 2019 Thanks so much, Lukasz! In my head, I had replied to this. My apologies that I apparently failed to reply in reality. This is great info and extremely helpful. I'm running an OCR job now and this is giving me the info I need. Thanks again! Bryan Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.