Jump to content

ShaunC

Members
  • Posts

    33
  • Joined

  • Last visited

  • Days Won

    1

Everything posted by ShaunC

  1. No worries Marco - good to know the options. I'll need to speak to our case management system provider to see if they can watch folders - I suspect not at this stage. Thanks
  2. Hi all, My organisation is not yet using Connect, but it's currently our planned future state once we have an ICT environment that allows me to use it. Our investigators and legal team use a case management system that is also accessed via web browser. My understanding (from having a look at the Connect Reviewer manual) is that exports are handled similarly to Pro in terms of how you want the output files to be treated, and the output is a zip file that is shipped off to the browser's download functionality? Our case management system supports Drag-and-Drop from the local filesystem. Which leads me to wonder whether it could be possible to Drag from Connect and Drop directly into the case management system? If so, this should make the task of saving relevant documents into the case management system a bit quicker/easier for individual files? I'm very far from a web developer - is this even possible? I just opened a OneDrive tab and an online Outlook tab and tried to Drag/Drop a file from OneDrive to an email and it didn't work, so maybe it's just not technically possible?
  3. That's odd - I've just done a test and I get all that metadata in the PDF render - v2.6.0.3. Is your "type" listed as "Calendar/vCalendar File" ? My settings are pretty close to default I'd think (I clicked all the "Configure" boxes and in all of them everything was selected):
  4. If you're only looking at email data, you might consider "agent:" as well You will get more results as it returns those hits in document metadata, but you can also just limit the search to Type>Communication>Email as well?
  5. This is directly related to my recent post as it would give me a pretty decent option in that situation to accomplish what I need, but can see how it would be useful in other situations too. I would love another grouping in the facet which groups the email addresses by domain gmail.com (10,000) hotmail.com (7,500) hotmail.co.uk (2,222) Cheers!
  6. If it's simpler to implement, instead of having to do it make changes to the main GUI, you could maybe expose the options into their own config file which would be read on program execution, like how you can go in and change the debug level in logback.xml I daresay most people wouldn't be wanting to play around with it to much - either purely alphabetical, or favs at the top and then alphabetical, etc - they'd probably figure out what works for them best in a couple of iterations and never have to touch it again The other place that could make sense (without having to change the main GUI would be to add it as a tab in Preferences where you could drag and drop the order
  7. Hi Brains Trust - just throwing this out there in case someone smarter than I can figure this out, because I've been going over it for an hour and am giving up! I have a case with primarily PST files as the source material. We have 6 "subgroups of interest" based on email domains. Emails To @domain1 Emails From @domain1 Emails To @domain2 Emails From @domain2 Emails To all other domains Emails From all other domains 2, 4 and 6 are simple - restrict to Type>Email and use the Check boxes in the search "Options" to filter From/Sender on domain1 and tag #2, filter on domain2 and tag #4, exclude those 2 tags and the remainder is #6 Likewise 1 and 3 are simple - use the Check boxes in the search "Options" to filter To/CC/BCC on domain1 and tag #1, filter on domain2 and tag #2 Where it gets tricky is that any given email can be To/CC/BCC multiple domains. Tags groups #1 and #3 will already include emails that were also sent to any other domain as well as the domain of interest (and this is ok, we're happy for this - if we want only that domain we can mess with recipient count, To only, etc). I can't just exclude #1 and #3 and tag the remainder as #5 as there are emails in #1 and #3 that have also been sent to "other domains". The only way I can think of to do this is to use the "Email Address" facet and expand "All Receivers" and manually ctrl click to select the "all other" addresses. There are the two filtering options down the bottom of that pane where I know I can put in say @domain1 but that filters to include that domain, I tried "!@domain1" and "not:domain1" to try to do a negative filter on the list but that didn't work. There are some 553k addresses in "Email Address" > "All Receivers" so the manual approach is not, shall we say, "appealing" or "tenable". Has anyone had to solve this problem before or can think of a way to do this? Cheers!
  8. I'm getting a 404 error for the EXE link on the downloads page actually
  9. I re-indexed the sources in both source cases (remembering I'm using a Compound case here) I also re-OCR'd the items in both source cases, however I believe it was set to skip items already OCR'd. I then opened the compound case and searched the phrase again and it still did not return the PDF I expected I then re-OCR'd the item itself directly (while in the compound case), and that did then enable the PDF to be returned in the phrase search Is this intended and should be an extra step in the case conversion steps, or is this a bug and the PDF should have been responsive without having to re-OCR it?
  10. Hi David, I found a PDF in one of my test cases that has a DJI Drone user guide that has the % symbol present many times (where it is talking about battery life) When I check the "words" tab, just the numbers are present - ie the manual shows "88%", but the index just has "88" recorded. I suspect that the index just discards the symbols entirely.
  11. Hi all, I have a compound case that I have upgraded to 2.6 (I'm using Intella Pro). I have yet to re-index the sources as it's hundreds of GB and we're actively using the case. I've performed a keyword search for a phrase, and I get matches in several Word Documents. There is a PDF document (that was OCR'd prior to case conversion) in which I can see the phrase in the OCR tab of that document, however the PDF is "unresponsive" to the search. I tried copying the text out of the OCR tab of the PDF and pasted that into the search box in case there's something funny going on with a character being substituted (like a lower case l for a 1 or something) and it still doesn't get returned. I don't have anything de-selected in the searching options drop-down. Checking the "Words" tab for the document just shows the words from the metadata of the file (this could well be normal behaviour - I've never looked to see if the OCR words get added to the "words" tab before to be honest.) Something a bit interesting/weird is that once I search the phrase (with my "unresponsive" PDF already open in a preview window), that phrase gets highlighted in the document. I then tried other OCR'd PDF files and the same thing happens - they are unresponsive but when previewed the phrase is highlighted anyway. I'll kick off the re-index overnight and see if that helps. Can anyone else replicate this out of curiosity? Cheers!
  12. No worries at all and I agree; it would be best solved with that sort of mechanism - the permutations of what you would come across would be way too complex to script effectively.
  13. I wonder if you could script it as part of initial processing? It would be pretty unintelligent, but I wonder if you could do something like (100% pseudo-code): if item.encrypted = true wordlist = get-content item.parent (separator 'whitespace') foreach word in wordlist try item.decrypt word You could build your wordlist in a way that makes sense. The above is hoping the parent is an email and they've supplied the password in the email for example
  14. Personally, I would just use the facet to filter to, and then export the encrypted items and add them as a new source.
  15. That makes sense - thank you very much 🙂
  16. Apologies if this has been requested previously and my search did not locate it. I am doing a lot of reviewing recently with multiple keywords. When previewing the item on the Contents (or Preview) tab it shows me the located keywords and the count down the bottom, along with the controls to cycle through the matches. What I'm finding is that I'll get a document that has 150 hits, and 149 of them are for one keyword (which I can't just exclude) and what I really want to see is that 1 hit for one of the other keywords. I can work around this by flagging/tagging the items, clearing the search, bringing the flagged/tagged items back and then searching for the keyword(s) that wasn't the 149 hits. What would be nice is if the interface down the bottom would allow you to somehow jump to specific keywords. In a simplistic sense, imagine those shown keywords down the bottom left were actually clickable and if you did, the controls on the right would just "filter" to the clicked keyword instead? There's likely a better way to do this - if I haven't thought of something please let me know. While I'm here I have a vague recollection of maybe asking this somewhere before, so forgive me if I have. One feature I miss (about the only one to be honest) from Nuix is the ability to select items and "exclude" them from the case entirely. When dealing with Legally Privileged material what I'm currently doing is tagging them and then excluding that tag every time I search. I know it probably gets tricky with families of items that would be "excluded" - how does that work with doing exports; if an email is not excluded but an attachment is, and the email is exported, can you tell it to not have that attachment - likely not as it is "changing" the parent item right?
  17. On this topic, I have an interesting edge-case that I assume someone has probably already looked at before: I have a PDF which is a scanned copy of a contract that has been printed, signed and re-scanned. However someone has then opened the PDF in an editor and added the names of the signatories under their signatures on the final page. This means that Intella does not consider this an empty document, so if you choose the "Empty Only" PDF flag, it won't OCR. No problem I thought to myself, I'll leave it ticked, but tick everything on the Images side, assuming "Include embedded images" makes Intella consider the pages of a PDF embedded? I also selected "PNG" as for this specific document the pages were PNG format. I re-ran OCR and the file was still not processed - Does this mean that Intella first evaluates the PDF as "not empty" and completely discards it and the embedded images from further OCR consideration? I hoped it would take a "second pass" on the embedded images and OCR them as individual images in their own right? What's the safest option? Untick "Empty Only" and tick all on the right under "Images" and let it possibly process things that don't need to be? Cheers!
  18. Thanks Chris, now I know what [Reactor#] means in the logs! I'm seeing it noting 16 so I will open a support ticket as requested - posting below for anyone else interested in what I'm seeing. Here's the relevant bits from the case-main log for the run where it only used 2 crawlers. [INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Detected OS: Windows 10 [INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Detected JVM: OpenJDK 64-Bit Server VM 11.0.12 [INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Detected CPU cores: 32 [INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Detected RAM: 127 GB [INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * [INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Max heap size: 13 GB (MANUAL) [INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Services max heap size: 6,144 MB (MANUAL) [INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Crawlers count: 16 (MANUAL) [INFO ] 2022-06-28 15:27:42,825 [SwingWorker-pool-3-thread-1] * Crawler timeout: 2h 0m (AUTO) and [INFO ] 2022-06-28 15:28:30,372 [CrawlThread] Using 16 crawlers as specified by the CrawlersCount system property [INFO ] 2022-06-28 15:28:30,372 [CrawlThread] Using 6,144 MB memory per crawler as specified by the ServiceMaxHeap system property and then in the rest of the logs I just see crawler1 and crawler 2 alternating with a couple of lines between batches per crawler A bunch of crawler2 lines, ending with [INFO ] 2022-06-28 15:40:20,759 [crawler2_3d91b8cf-log] [INFO ] 2022-06-28 15:40:20,759 [reactor2] Found new: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/35_ [md5:aa858b4f3b5aba2977fb262b817d5bdc, id:325999] [INFO ] 2022-06-28 15:40:20,759 [crawler2_3d91b8cf-log] [INFO ] 2022-06-28 15:40:20,759 [reactor1] Completed processing item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/32_ (message/rfc822) [INFO ] 2022-06-28 15:40:20,759 [crawler2_3d91b8cf-log] [INFO ] 2022-06-28 15:40:20,759 [reactor3] Completed processing item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/31_ (message/rfc822) [INFO ] 2022-06-28 15:40:20,759 [crawler2_3d91b8cf-log] [INFO ] 2022-06-28 15:40:20,759 [reactor2] Completed processing item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/35_ (message/rfc822) [INFO ] 2022-06-28 15:40:21,091 [CrawlThread-12] [crawler2] Committed: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/martin-t/sent_items/130_ [INFO ] 2022-06-28 15:40:21,709 [CrawlThread-14] [crawler1] Processing: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/36_ [INFO ] 2022-06-28 15:40:21,709 [crawler1_7fbaea2e-log] [INFO ] 2022-06-28 15:40:21,709 [CrawlServer-worker-1] Processing path collection: 500 files, Size: 2 MB [INFO ] 2022-06-28 15:40:21,709 [crawler1_7fbaea2e-log] [INFO ] 2022-06-28 15:40:21,709 [reactor1] Found new: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/39_ [md5:8144a2316c7079a829c9a152e90644cf, id:326000] [INFO ] 2022-06-28 15:40:21,709 [crawler1_7fbaea2e-log] [INFO ] 2022-06-28 15:40:21,709 [reactor3] Found new: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/37_ [md5:774d49c2d95f1c0f7ad0441f8c695081, id:326001] [INFO ] 2022-06-28 15:40:21,709 [crawler1_7fbaea2e-log] [INFO ] 2022-06-28 15:40:21,709 [reactor2] Found new: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/36_ [md5:0717b2decb7cc7011b01666116ceb762, id:326002] [INFO ] 2022-06-28 15:40:21,709 [crawler1_7fbaea2e-log] [INFO ] 2022-06-28 15:40:21,709 [reactor4] Found new: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/38_ [md5:f6a0d962234f13f41c2beee5637d065d, id:326003] and then the crawler1 items continue, so it seems to alternate between them Interestingly when 16 crawlers are used, it's all mixed up: [INFO ] 2022-06-28 16:02:20,360 [crawler7_4295ae71-log] [INFO ] 2022-06-28 16:02:20,360 [reactor3] Found new: item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyatt-k/personal/75_ [md5:a2e8b890d50f97aa2f193f4de06f2566, id:175461] [INFO ] 2022-06-28 16:02:20,360 [crawler3_577469a5-log] [INFO ] 2022-06-28 16:02:20,360 [reactor3] Found new: item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyvl-d/all_documents/1509_ [md5:a3ccc006b4971c0f4735bf434aefa369, id:175462] [INFO ] 2022-06-28 16:02:20,360 [crawler7_4295ae71-log] [INFO ] 2022-06-28 16:02:20,360 [reactor1] Found new: item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyatt-k/personal/74_ [md5:225bab5c8436788652da4a3690516e7c, id:175464] [INFO ] 2022-06-28 16:02:20,360 [crawler1_7e8f59fd-log] [INFO ] 2022-06-28 16:02:20,360 [reactor3] Found new: item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/horton-s/sent_items/78_ [md5:155fb61298fb258e1d87fcf0d5a6d057, id:175463] [INFO ] 2022-06-28 16:02:20,360 [crawler1_7e8f59fd-log] [INFO ] 2022-06-28 16:02:20,360 [reactor4] Found new: item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/horton-s/sent_items/77_ [md5:a761f57f331cdc9125952a8df67375da, id:175465] [INFO ] 2022-06-28 16:02:20,360 [crawler3_577469a5-log] [INFO ] 2022-06-28 16:02:20,360 [reactor4] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyvl-d/all_documents/1506_ (message/rfc822) [INFO ] 2022-06-28 16:02:20,360 [crawler14_2c2481c5-log] [INFO ] 2022-06-28 16:02:20,360 [reactor2] Found new: item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyatt-k/deleted_items/212_ [md5:0b5471fcf0f4649b10491f251b2b5963, id:175466] [INFO ] 2022-06-28 16:02:20,360 [crawler11_6b96d80f-log] [INFO ] 2022-06-28 16:02:20,360 [reactor4] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/horton-s/deleted_items/_sent_mail/7_ (message/rfc822) [INFO ] 2022-06-28 16:02:20,360 [crawler12_735bb3c2-log] [INFO ] 2022-06-28 16:02:20,360 [reactor1] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyatt-k/sent_items/270_ (message/rfc822) [INFO ] 2022-06-28 16:02:20,360 [crawler1_7e8f59fd-log] [INFO ] 2022-06-28 16:02:20,360 [reactor1] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/horton-s/sent_items/76_ (message/rfc822) [INFO ] 2022-06-28 16:02:20,360 [crawler15_49c14435-log] [INFO ] 2022-06-28 16:02:20,360 [reactor4] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyvl-d/all_documents/1047_ (message/rfc822) [INFO ] 2022-06-28 16:02:20,360 [crawler14_2c2481c5-log] [INFO ] 2022-06-28 16:02:20,360 [reactor3] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyatt-k/deleted_items/20_ (message/rfc822) [INFO ] 2022-06-28 16:02:20,360 [crawler3_577469a5-log] [INFO ] 2022-06-28 16:02:20,360 [reactor2] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyvl-d/all_documents/1508_ (message/rfc822) [INFO ] 2022-06-28 16:02:20,360 [crawler15_49c14435-log] [INFO ] 2022-06-28 16:02:20,360 [reactor2] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyvl-d/all_documents/1046_ (message/rfc822) [INFO ] 2022-06-28 16:02:20,360 [crawler12_735bb3c2-log] [INFO ] 2022-06-28 16:02:20,360 [reactor4] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyatt-k/sent_items/271_ (message/rfc822) [INFO ] 2022-06-28 16:02:20,360 [crawler6_73f59153-log] [INFO ] 2022-06-28 16:02:20,360 [reactor1] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/horton-s/inbox/31_ (message/rfc822) So with more than "x" crawlers are used it truly multi-threads everything but when 2 are in use it alternates batches?
  19. Nope - all 16 crawlers were utilised when 10 direct child items were present in the top-level source. Another question/observation (because why not? ) I have noticed previously that there tends to be an equivalent number of "Console Window Host" processes spawned along with the crawler processes. Typically I take this into consideration when ingesting and select the number of crawlers based on half of the number of threads the CPU has, as the other half would in theory be utilised by these "Console Window Host" processes. I don't believe they use anywhere near as much CPU % as the crawlers do, so I'm likely being conservative and could maybe go to something like 24 crawlers on a 32T CPU and leave 8T for the "Console Window Host" processes. When doing the above testing today, when I got Intella to use 16 crawlers for the first time (in the third test), there was only one "Console Window Host" process running instead of the expected 16. When I did the final test with only 10 sub-folders in the top-level source, it again spawned 16 "Console Window Host" processes - one per crawler. Any ideas there?
  20. Hi all, I'm doing a bit of performance testing/bench-marking on open source data so I can compare between different machines and eventually software offerings. I'm currently using Intella 2.5.0.1 to process the Enron mail dataset - https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz The file is 422MB and contains over 500,000 files. I'm aware 2.5.1 is out, but reviewing earlier bench-marking I did last year with the same datasets, the behavior has not changed. Happy to re-test all below scenarios with 2.5.1 if requested. I am using a machine with 16C/32T, 128GB RAM and NVMe SSDs (separate evidence, case and temp drives). I elected to set the number of crawlers to 16, with 6GB of RAM per crawler and left the default of 15GB for Intella itself. Intella processed the file in 8m 43s. Reviewing the log shows that while 16 crawlers were allocated, it only used one. There is likely a simple explanation for this; I would assume that as the source is "one file", it gets assigned to one crawler, and as it gets processed and split apart to the child items within, the same crawler "owns" the child items and is therefore automatically assigned to process them. I then used 7zip to decompress the file. This resulted in a folder structure: ../enronmail/maildir/ and then an individual sub-folder for each user - 150 of them. Each of those has 10 or so sub-folders, which finally contain the email files. I created a new case and used the same settings and selected the root - "enronmail" - as my folder to be processed (it is the top-level folder of the archive - it has 1 direct child item, the sub-folder "maildir"). This time Intella utilised 2 crawlers and took 22m 29s, which to me is very strange that utilising twice the number of crawlers, with data that doesn't need to be decompressed, took almost 3 times as long to process. While typing this up I had a suspicion and created a new case with the same settings again, but this time I selected the "maildir" sub-folder, which has the 150 direct child items. Intella utilised all 16 crawlers and processed the content in 6m 31s. So this would tend to indicate to me that Intella assigns crawlers based on the number of child items of the source's top-level parent? Is that somewhere in the ball-park? To test, I could delete all bar 10 of the user folders (out of the 150) and see if only 10 crawlers are utilised I'd imagine.
  21. Hi all, I've checked the release notes from 2.4.2 through to 2.5.1 in an attempt to answer this for myself, but I didn't see anything obvious, so apologies if I've missed something. I have two cases that both have the same PST files as the source data. One was created in 2.4.2 and the other in 2.5.0.1 When comparing the two cases, the email MD5 checksums do not match for the same items. Reviewing the properties, the calculated sizes are different between the two cases for the emails. Has the processing of emails changed in some way which would perhaps result in the size of the email being different, and therefore explain this difference in MD5 value? *Edit* This seems to be for MIME type "message/rfc822" (EML?) as opposed to "application/vnd.ms-outlook" (MSG?) - the latter seem to be fine so far.
  22. In a competitor's product, in the history pane that showed actions taken in the case (effectively the "Export>Event Log" function in Intella), each search was able to be directly recalled from that tab. So for example if you searched and it returned 1000 items, from memory you could double click that line in the log and it would preform the same search and give you the same 1000 items. The main reason I would be wanting something similar is that I recently had a support case open for a variance between searching and tagged items, and if there was a way that the event log could record some sort of hash for that action, so that it could be recalled exactly in the future, that could be helpful. Happy to elaborate further, I feel like I haven't really explained it well.
×
×
  • Create New...