CLI processing of text data

ngreenou · October 4, 2019

Hi. I'm aware that Intella v2.2+ allows users to export all items as text files using the -exportText parameter. Is there a way to add a further parameter prior to this so that only text of a specific language is exported? For example, I may have a case with 10,000,000 text items but only 5000 are Spanish and I want to selectively export these 5000 using -exportText and translate them using a third-party provider. Is it possible to add this extra layer specifying a language before running -exportText?

Thanks

jon.pearse · October 8, 2019

Hi Neil,

The user manual has more details about using the CLI feature. You could try some of the follow options mentioned in the manual. That would allow you to use any facets including the language facet:
> 27.2 Command-line arguments
> -et, -exportText – Export the extracted texts to a folder. The options -matchQuery, -savedSearch, -deduplicate and -exportDir can be used to control this operation. The resulting files will be named based on their item ID, e.g. 123.txt.
> -ss, -savedSearch [File] – Can be used to limit the exported items to those that match the specified saved search. The argument is the path to an XML file holding the saved search. Such a file can be exported from the Saved Searches facet. This allows for using other facets, such as the Date and Type facets, and to combine queries.

AdamS · October 11, 2019

Neil it would be a simple matter to use Intella's inbuilt ability to detect foreign language documents, isolate the 5000 Spanish documents, then export only those documents into the load file.

ngreenou · December 2, 2019

@AdamS and @jon.pearse, firstly many thanks for responding to my query below. I'd neglected this project a little but I'm looking at it again and I'm almost there. Just one query though.

So, I have used an AI translation provider to translate documents and have the translated files named by their ItemID ready for importing via the --importText CLI option. This works fine however the content isn't as I hoped when verifying it in Intella. Checking one of the .txt files that has been imported, it has the below phrase in:-

"Dans les années 2000, la société pharmaceutique"

however when this is imported via the --importText CLI option, it reads as the below in the 'Imported Text' tab:-

"Dans les ann es 2000, la soci t pharmaceutique"

It would appear that the handling of foreign characters such as the 'é' aren't being imported correctly and are being replaced by a whitespace. I'd imagine this is an encoding issue. Is there anything I can do to address this? I'm just conscious that if a reviewer searches on any 'foreign' characters, it may not return hits, so for example, if the reviewer searches on 'société pharmaceutique', there would be no hits when technically this is correct but isn't correct at the same time too (if that makes sense).

Sign In

CLI processing of text data

Recommended Posts

ngreenou

Link to comment

Share on other sites

jon.pearse

Link to comment

Share on other sites

AdamS

Link to comment

Share on other sites

ngreenou

Link to comment

Share on other sites

Join the conversation

Browse

Activity