Jump to content

Export word list or search hits


Bulldawg

Recommended Posts

We have a case in which we've identified PII was stolen.  I have the e-mails and documents in Intella and I have used the (experimental) regular expression searches to identify social security numbers and potential credit card numbers contained within the items.

 

My next step is to dedupe the results.  Not deduplicate items, but regular expression hits.  For instance, the same social security number may appear in multiple different e-mails, but that's only one person to notify their information has been breached.

 

My thought is to simply export a list of the text that matches the regular expression, but I do not see a way to do this.  I would then remove duplicate SSNs using an external tool, like Excel.  Is it possible to export the words that hit on a regular expression search?

 

My fall back option is to export the entire index of words and use grep (or something similar) to pull out the SSNs and credit card numbers and then dedupe that using another tool.  I have not been able to determine a way to do this.  Does anyone know how to pull the words out of the index?

 

Thanks for your help.

Link to comment
Share on other sites

Hello Buldawg,

 

You can use the Content Analysis facet that contains all Credit Cards and Social Security numbers found in this case. These values are extracted automatically during indexing, there is no need to run the Content Analysis process to identify them.

 

The values can be exported to a CSV file using "Export values" option in the context menu.

 

Hope this helps.

Link to comment
Share on other sites

Jon and Alex,

Thank you both.  I cannot believe I missed the export words option.

 

I did look at the content analysis facet, but the problem I have with that is it does not take into consideration the OCR'd files I've imported.  It is showing one fewer hit than I had for SSN before I imported the OCR'd files, but it still shows that same count, which is about 200 than the regular expression search, after the OCR'd files import.

Link to comment
Share on other sites

  • 5 years later...
On 10/12/2016 at 6:16 AM, Alex said:

Hello Buldawg,

 

You can use the Content Analysis facet that contains all Credit Cards and Social Security numbers found in this case. These values are extracted automatically during indexing, there is no need to run the Content Analysis process to identify them.

 

The values can be exported to a CSV file using "Export values" option in the context menu.

 

Hope this helps.

 

On 10/11/2016 at 8:20 PM, jon.pearse said:

Hi Bulldawg,

 

All words can be exported from a case by selecting Export from the file menu and then the Words option. You will have the option to export all of the words in the case.

 

Jon

Hello All, I'm having a similar issue, I'm looking for PII information in a large data set, I leverage Content Analysis, but I'm getting tons of false positives on SS and CC. Looks like most of it coming from raw data or metadata. Is there a way to modify the query search in Content Analysis to cut down on the noise? Or if you use a separate query can you share? Thanks George

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...