Chinese Documents


Hi Guys,


We are just about to start a case where a significant number of the documents are in Mandarin as well as English. This will include Electronic documents and about 27000 sheets of paper which we plan to OCR to text overlaid images held within a PDF format.


We are going to be provided with two sets of search terms, one English the other which is stated to be Simplified Chinese.


Is any of the above likely to cause Intella and problems and/or do we need to configure our systems differently to handle the search and review (planned to be Team for Indexing then Connect 1.8.2 for the review)?


Any help or advice would be appreciated. 



Hi Jason!


I don't foresee any issues with this case. As I'm sure you already know Intella offers support for both OCRed text and CJK languages, so I think that it is a perfect match for you case!

I believe that most of the time you will be working with Keywords Lists, so you'll be glad to know that any type of query supported by Intella will also work for CJK languages.


Before you start, though, I would like to point you to few relevant parts of our Intella User Manual, which contains everything you need to avoid any bumps on the road.


Chapter 11: Optical Character Recognition


It will guide you through the process of importing OCRed documents back to Intella or setting up an OCR server. Most of our users use an external tool to OCR documents and then import the text back, which I understand is your preferred way of doing it as well. Just make sure that your tool is producing a valid UTF-8 output and produces one file per document it processes.


5 Frequently Asked Questions


FAQ contains two important information regarding CJK support. Concretely:

  1. Why do Chinese/Japanese/Korean queries give imprecise search results?
    Explains in details how Intella is treating CJK languages by using bi-grams. This knowledge can be very handy in you are having some doubts if results of your queries are correct.


  2. How can I print and export PDF reports with characters of my language?

    Lets you know how to supply custom fonts to Intella so that CJK characters are properly rendered when exporting.


