Export word list or search hits

Bulldawg · October 11, 2016

We have a case in which we've identified PII was stolen. I have the e-mails and documents in Intella and I have used the (experimental) regular expression searches to identify social security numbers and potential credit card numbers contained within the items.

My next step is to dedupe the results. Not deduplicate items, but regular expression hits. For instance, the same social security number may appear in multiple different e-mails, but that's only one person to notify their information has been breached.

My thought is to simply export a list of the text that matches the regular expression, but I do not see a way to do this. I would then remove duplicate SSNs using an external tool, like Excel. Is it possible to export the words that hit on a regular expression search?

My fall back option is to export the entire index of words and use grep (or something similar) to pull out the SSNs and credit card numbers and then dedupe that using another tool. I have not been able to determine a way to do this. Does anyone know how to pull the words out of the index?

Thanks for your help.

jon.pearse · October 11, 2016

Hi Bulldawg,

All words can be exported from a case by selecting Export from the file menu and then the Words option. You will have the option to export all of the words in the case.

Jon

Alex · October 12, 2016

Hello Buldawg,

You can use the Content Analysis facet that contains all Credit Cards and Social Security numbers found in this case. These values are extracted automatically during indexing, there is no need to run the Content Analysis process to identify them.

The values can be exported to a CSV file using "Export values" option in the context menu.

Hope this helps.

Bulldawg · October 12, 2016

Jon and Alex,

Thank you both. I cannot believe I missed the export words option.

I did look at the content analysis facet, but the problem I have with that is it does not take into consideration the OCR'd files I've imported. It is showing one fewer hit than I had for SSN before I imported the OCR'd files, but it still shows that same count, which is about 200 than the regular expression search, after the OCR'd files import.

GeorgeP · March 14, 2022

On 10/12/2016 at 6:16 AM, Alex said:

Hello Buldawg,

You can use the Content Analysis facet that contains all Credit Cards and Social Security numbers found in this case. These values are extracted automatically during indexing, there is no need to run the Content Analysis process to identify them.

The values can be exported to a CSV file using "Export values" option in the context menu.

Hope this helps.

On 10/11/2016 at 8:20 PM, jon.pearse said:

Hi Bulldawg,

All words can be exported from a case by selecting Export from the file menu and then the Words option. You will have the option to export all of the words in the case.

Jon

Hello All, I'm having a similar issue, I'm looking for PII information in a large data set, I leverage Content Analysis, but I'm getting tons of false positives on SS and CC. Looks like most of it coming from raw data or metadata. Is there a way to modify the query search in Content Analysis to cut down on the noise? Or if you use a separate query can you share? Thanks George

Sign In

Export word list or search hits

Recommended Posts

Bulldawg

Link to comment

Share on other sites

jon.pearse

Link to comment

Share on other sites

Alex

Link to comment

Share on other sites

Bulldawg

Link to comment

Share on other sites

GeorgeP

Link to comment

Share on other sites

Join the conversation

Browse

Activity