Jump to content
Neil G

CLI processing of text data

Recommended Posts

Hi. I'm aware that Intella v2.2+ allows users to export all items as text files using the -exportText parameter. Is there a way to add a further parameter prior to this so that only text of a specific language is exported? For example, I may have a case with 10,000,000 text items but only 5000 are Spanish and I want to selectively export these 5000 using -exportText and translate them using a third-party provider. Is it possible to add this extra layer specifying a language before running -exportText?

Thanks

Share this post


Link to post
Share on other sites

Hi Neil,

The user manual has more details about using the CLI feature. You could try some of the follow options mentioned in the manual. That would allow you to use any facets including the language facet:
> 27.2 Command-line arguments
> -et, -exportText – Export the extracted texts to a folder. The options -matchQuery, -savedSearch, -deduplicate and -exportDir can be used to control this operation. The resulting files will be named based on their item ID, e.g. 123.txt.
> -ss, -savedSearch [File] – Can be used to limit the exported items to those that match the specified saved search. The argument is the path to an XML file holding the saved search. Such a file can be exported from the Saved Searches facet. This allows for using other facets, such as the Date and Type facets, and to combine queries.

Share this post


Link to post
Share on other sites

Neil it would be a simple matter to use Intella's inbuilt ability to detect foreign language documents, isolate the 5000 Spanish documents, then export only those documents into the load file.

Share this post


Link to post
Share on other sites

@AdamS and @jon.pearse, firstly many thanks for responding to my query below. I'd neglected this project a little but I'm looking at it again and I'm almost there. Just one query though.

So, I have used an AI translation provider to translate documents and have the translated files named by their ItemID ready for importing via the --importText CLI option. This works fine however the content isn't as I hoped when verifying it in Intella. Checking one of the .txt files that has been imported, it has the below phrase in:-

"Dans les années 2000, la société pharmaceutique"

however when this is imported via the --importText CLI option, it reads as the below in the 'Imported Text' tab:-

"Dans les ann es 2000, la soci t pharmaceutique"

It would appear that the handling of foreign characters such as the 'é' aren't being imported correctly and are being replaced by a whitespace. I'd imagine this is an encoding issue. Is there anything I can do to address this? I'm just conscious that if a reviewer searches on any 'foreign' characters, it may not return hits, so for example, if the reviewer searches on 'société pharmaceutique', there would be no hits when technically this is correct but isn't correct at the same time too (if that makes sense).

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...