We have a case in which we've identified PII was stolen.  I have the e-mails and documents in Intella and I have used the (experimental) regular expression searches to identify social security numbers and potential credit card numbers contained within the items.


My next step is to dedupe the results.  Not deduplicate items, but regular expression hits.  For instance, the same social security number may appear in multiple different e-mails, but that's only one person to notify their information has been breached.


My thought is to simply export a list of the text that matches the regular expression, but I do not see a way to do this.  I would then remove duplicate SSNs using an external tool, like Excel.  Is it possible to export the words that hit on a regular expression search?


My fall back option is to export the entire index of words and use grep (or something similar) to pull out the SSNs and credit card numbers and then dedupe that using another tool.  I have not been able to determine a way to do this.  Does anyone know how to pull the words out of the index?


Thanks for your help.

Hello Buldawg,


You can use the Content Analysis facet that contains all Credit Cards and Social Security numbers found in this case. These values are extracted automatically during indexing, there is no need to run the Content Analysis process to identify them.


The values can be exported to a CSV file using "Export values" option in the context menu.


Hope this helps.

Jon and Alex,

Thank you both.  I cannot believe I missed the export words option.


I did look at the content analysis facet, but the problem I have with that is it does not take into consideration the OCR'd files I've imported.  It is showing one fewer hit than I had for SSN before I imported the OCR'd files, but it still shows that same count, which is about 200 than the regular expression search, after the OCR'd files import.

