Jump to content

Attach PDFs with embedded OCR texts


heffner

Recommended Posts

Hello Intella-Team,

as far as we know performing OCR on an item will work as follows:

  • Export item for OCR, e.g. an image or PDF...
  • Perform OCR externally, e.g. using OmniPage or ABBYY
  • Import OCR'd items
    • Extract textual info from OCR'd items
    • Index extracted textual data within Intella

There are at least two disadvantages, we have noticed:

  • Opening an "original" PDF will not include the text data from the OCR process
  • Reindexing the Case means loosing all imported OCR texts

A possible enhancement could be this approach:

  • Append the OCR'd files as an attachment to the item
  • Add an option to open the Attachment instead of the original item

What do you think about this?

Thanks
Stephan

Link to comment
Share on other sites

The OCR and Non OCR documents are already linked within Intella as they both have the same MD5 value (used when exporting for OCR process). If you want to identify them you can simply 'show duplicates' of any OCR'd file and you will see the original non searchable version.

 

Just curious as to why you would open the original non searchable PDF document when you have the OCR'd version in the case as well?

 

Any searches you do will obviously ignore the non searchable duplicate but will run across the OCR'd version, so any hits you get will be relevant. If you want to include the original version in any final reports then you can do this by including the duplicates with the method above.

Link to comment
Share on other sites

  • 3 weeks later...

Hi heffner, AdamS,

 

I tried to understand some comments that you made but I'm struggling to understand what it is you want to achieve. Let me draw a simple test scenario here and please tell me when my reasoning is not matching your expectations.

 

Let's say that my case is composed of a single data source which is a folder with a set of loose PDFs, like:

D:\evidence\sample1.pdf -> Item with ID 2

D:\evidence\sample2.pdf -> Item with ID 4

 

Both contain scanned invoices, so I want to OCR them so they become searchable. I index my case and I end up with few items in my case (5 if I'm counting right). So I export my two PDFs (item #2 & item 4) to another folder and use external OCR to get their corresponding textual content in separate TXT files.

Then I make use of "Import OCRed text" feature (NOT indexing the TXT files!) and end up with the same number of items in my case, but #2 and #4 have been appended with additional text content visible and searchable.

 

Few things to notice here:

  1. Original evidence files (PDFs) have not been modified
  2. OCRed text is appended to the existing text of those items. My PDFs did not have any text in them before (scanned invoices) so everything that got OCRed will become a new content of these items (otherwise previous & extracted text would be merged together).
  3. No additional items are added in my case, so we still end up with 5 items in it.

 

@heffner, let me address the two disadvantages you brought up

 

> Opening an "original" PDF will not include the text data from the OCR process

I'm not sure what do you mean by "Opening" here. Opening in Intella? If so then you will surely see OCRed text in Previewer. It's also searchable, so it seems to me that you ought to have anything that you need.

Indeed the original PDFs in D:\evidence folder have not been modified, but I doubt that any software used in legal cases would do that. 

 

> Reindexing the Case means loosing all imported OCR texts

Yes, that is correct. It's how we designed this process for a mix of technical and non-technical reasons. Is that something that becomes a serious challenge for you? Please note that Tags are surviving reindexing of cases, so if you ever find yourself in a situation that you need to reindex a case containing OCRed items, you can make use of them to achieve consistent results.

 

@AdamS, I couldn't understand your remarks about duplicates. Going back to my example, if you had 2 PDFs in evidence folder, then after importing OCRed text then you still have two. No additional items are created. Therefore there are no multiple representations of the same item. If you would index OCRed TXT documents, or even PDFs with OCRed text that might be produced by an OCR software, then hashes won't likely match because the binary contents of these files are different. I wasn't sure what your intention was here, hence my humble ask for clarification.

 

Thanks,

Łukasz

Link to comment
Share on other sites

×
×
  • Create New...