Bulldawg Posted October 11, 2016 Report Share Posted October 11, 2016 We have a case in which we've identified PII was stolen. I have the e-mails and documents in Intella and I have used the (experimental) regular expression searches to identify social security numbers and potential credit card numbers contained within the items. My next step is to dedupe the results. Not deduplicate items, but regular expression hits. For instance, the same social security number may appear in multiple different e-mails, but that's only one person to notify their information has been breached. My thought is to simply export a list of the text that matches the regular expression, but I do not see a way to do this. I would then remove duplicate SSNs using an external tool, like Excel. Is it possible to export the words that hit on a regular expression search? My fall back option is to export the entire index of words and use grep (or something similar) to pull out the SSNs and credit card numbers and then dedupe that using another tool. I have not been able to determine a way to do this. Does anyone know how to pull the words out of the index? Thanks for your help. Quote Link to comment Share on other sites More sharing options...
jon.pearse Posted October 11, 2016 Report Share Posted October 11, 2016 Hi Bulldawg, All words can be exported from a case by selecting Export from the file menu and then the Words option. You will have the option to export all of the words in the case. Jon Quote Link to comment Share on other sites More sharing options...
Alex Posted October 12, 2016 Report Share Posted October 12, 2016 Hello Buldawg, You can use the Content Analysis facet that contains all Credit Cards and Social Security numbers found in this case. These values are extracted automatically during indexing, there is no need to run the Content Analysis process to identify them. The values can be exported to a CSV file using "Export values" option in the context menu. Hope this helps. Quote Link to comment Share on other sites More sharing options...
Bulldawg Posted October 12, 2016 Author Report Share Posted October 12, 2016 Jon and Alex, Thank you both. I cannot believe I missed the export words option. I did look at the content analysis facet, but the problem I have with that is it does not take into consideration the OCR'd files I've imported. It is showing one fewer hit than I had for SSN before I imported the OCR'd files, but it still shows that same count, which is about 200 than the regular expression search, after the OCR'd files import. Quote Link to comment Share on other sites More sharing options...
GeorgeP Posted March 14, 2022 Report Share Posted March 14, 2022 On 10/12/2016 at 6:16 AM, Alex said: Hello Buldawg, You can use the Content Analysis facet that contains all Credit Cards and Social Security numbers found in this case. These values are extracted automatically during indexing, there is no need to run the Content Analysis process to identify them. The values can be exported to a CSV file using "Export values" option in the context menu. Hope this helps. On 10/11/2016 at 8:20 PM, jon.pearse said: Hi Bulldawg, All words can be exported from a case by selecting Export from the file menu and then the Words option. You will have the option to export all of the words in the case. Jon Hello All, I'm having a similar issue, I'm looking for PII information in a large data set, I leverage Content Analysis, but I'm getting tons of false positives on SS and CC. Looks like most of it coming from raw data or metadata. Is there a way to modify the query search in Content Analysis to cut down on the noise? Or if you use a separate query can you share? Thanks George Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.