Jump to content

"Unexpected" Crawler behaviour


Recommended Posts

Hi all,

I'm doing a bit of performance testing/bench-marking on open source data so I can compare between different machines and eventually software offerings.

I'm currently using Intella 2.5.0.1 to process the Enron mail dataset - https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz

The file is 422MB and contains over 500,000 files.

I'm aware 2.5.1 is out, but reviewing earlier bench-marking I did last year with the same datasets, the behavior has not changed. Happy to re-test all below scenarios with 2.5.1 if requested.

I am using a machine with 16C/32T, 128GB RAM and NVMe SSDs (separate evidence, case and temp drives).

I elected to set the number of crawlers to 16, with 6GB of RAM per crawler and left the default of 15GB for Intella itself.

Intella processed the file in 8m 43s. Reviewing the log shows that while 16 crawlers were allocated, it only used one.

There is likely a simple explanation for this; I would assume that as the source is "one file", it gets assigned to one crawler, and as it gets processed and split apart to the child items within, the same crawler "owns" the child items and is therefore automatically assigned to process them.

I then used 7zip to decompress the file. This resulted in a folder structure: ../enronmail/maildir/ and then an individual sub-folder for each user - 150 of them. Each of those has 10 or so sub-folders, which finally contain the email files.

I created a new case and used the same settings and selected the root - "enronmail" - as my folder to be processed (it is the top-level folder of the archive - it has 1 direct child item, the sub-folder "maildir").

This time Intella utilised 2 crawlers and took 22m 29s, which to me is very strange that utilising twice the number of crawlers, with data that doesn't need to be decompressed, took almost 3 times as long to process.

While typing this up I had a suspicion and created a new case with the same settings again, but this time I selected the "maildir" sub-folder, which has the 150 direct child items.

Intella utilised all 16 crawlers and processed the content in 6m 31s.

So this would tend to indicate to me that Intella assigns crawlers based on the number of child items of the source's top-level parent? Is that somewhere in the ball-park?

To test, I could delete all bar 10 of the user folders (out of the 150) and see if only 10 crawlers are utilised I'd imagine.

Link to comment
Share on other sites

Nope - all 16 crawlers were utilised when 10 direct child items were present in the top-level source.

Another question/observation (because why not? :))

I have noticed previously that there tends to be an equivalent number of "Console Window Host" processes spawned along with the crawler processes.
Typically I take this into consideration when ingesting and select the number of crawlers based on half of the number of threads the CPU has, as the other half would in theory be utilised by these "Console Window Host" processes.

I don't believe they use anywhere near as much CPU % as the crawlers do, so I'm likely being conservative and could maybe go to something like 24 crawlers on a 32T CPU and leave 8T for the "Console Window Host" processes.

When doing the above testing today, when I got Intella to use 16 crawlers for the first time (in the third test), there was only one "Console Window Host" process running instead of the expected 16.

When I did the final test with only 10 sub-folders in the top-level source, it again spawned 16 "Console Window Host" processes - one per crawler.

Any ideas there?

Link to comment
Share on other sites

Hello Shaun,

Thank you for this detailed analysis! This may also help other users with finding out how to optimally utilize their hardware.

I believe I can provide most of the answers.

When processing a single archive such as a .tar.gz file, a single crawler is indeed assigned to it. There are a few exceptions, e.g. when a PST or disk image is encountered in it: it will be extracted and assigned to a separate crawler.

Note though that a single crawler still typically runs 4 crawl threads internally, so a decent amount of multi-core processing should still be seen. I tested the enron_mail_20150507.tar.gz file using Intella 2.5.1 on my 4C/8T machine and often had somewhat above 50% CPU utilization with only 1 crawler. I suspect it could have been higher if I had used an SSD for the case drive - the disk seemed to be a bottleneck here.

I could not replicate your observation that only 2 crawlers are used when specifying the enronmail folder, and 16 crawlers when specifying the underlying enronmail/maildir folder. I also cannot think of a reason why this would happen. The number of immediately nested items is not a factor in determining how many crawlers are launched. Can you double-check the "Crawlers count" line in the case-main-[date].log file of that test run? It that turns out to be 16 or "AUTO", and still only 2 crawlers are used, I would invite you to open a support ticket and share the log files with us.

With this particular data set (100Ks of EML files), extracting all files will mean that accessing the evidence will incur quite some overhead on the file system, compared to when reading (streaming) everything from the same large file. Which setup is faster can't always be predicted up-front, but every call to a different file in the file system has an impact on performance. With lots of small files this becomes noticeable, and keeping everything in an archive may in fact be faster.

Your memory setup looks OK, though probably on the high side for the crawlers. Note that "left-over RAM" will be utilized by Windows for disk caching. This helps a lot with indexing large cases, where the various case indices can grow to considerable sizes.

The Windows Console Host processes are the result of Intella launching new sub-processes. Nothing to worry about.

Thank you again for your info! Let us know how we can be of further help.

Link to comment
Share on other sites

Thanks Chris, now I know what [Reactor#] means in the logs! :)

I'm seeing it noting 16 so I will open a support ticket as requested - posting below for anyone else interested in what I'm seeing.

Here's the relevant bits from the case-main log for the run where it only used 2 crawlers.

[INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Detected OS:               Windows 10
[INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Detected JVM:              OpenJDK 64-Bit Server VM 11.0.12
[INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Detected CPU cores:        32
[INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Detected RAM:              127 GB
[INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] *
[INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Max heap size:             13 GB (MANUAL)
[INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Services max heap size:    6,144 MB (MANUAL)
[INFO ] 2022-06-28 15:27:42,809 [SwingWorker-pool-3-thread-1] * Crawlers count:            16 (MANUAL)
[INFO ] 2022-06-28 15:27:42,825 [SwingWorker-pool-3-thread-1] * Crawler timeout:           2h 0m (AUTO)

and

[INFO ] 2022-06-28 15:28:30,372 [CrawlThread] Using 16 crawlers as specified by the CrawlersCount system property
[INFO ] 2022-06-28 15:28:30,372 [CrawlThread] Using 6,144 MB memory per crawler as specified by the ServiceMaxHeap system property

 

and then in the rest of the logs I just see crawler1 and crawler 2 alternating with a couple of lines between batches per crawler

 

A bunch of crawler2 lines, ending with

[INFO ] 2022-06-28 15:40:20,759 [crawler2_3d91b8cf-log] [INFO ] 2022-06-28 15:40:20,759 [reactor2] Found new: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/35_ [md5:aa858b4f3b5aba2977fb262b817d5bdc, id:325999]
[INFO ] 2022-06-28 15:40:20,759 [crawler2_3d91b8cf-log] [INFO ] 2022-06-28 15:40:20,759 [reactor1] Completed processing item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/32_ (message/rfc822)
[INFO ] 2022-06-28 15:40:20,759 [crawler2_3d91b8cf-log] [INFO ] 2022-06-28 15:40:20,759 [reactor3] Completed processing item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/31_ (message/rfc822)
[INFO ] 2022-06-28 15:40:20,759 [crawler2_3d91b8cf-log] [INFO ] 2022-06-28 15:40:20,759 [reactor2] Completed processing item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/35_ (message/rfc822)
[INFO ] 2022-06-28 15:40:21,091 [CrawlThread-12] [crawler2] Committed: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/martin-t/sent_items/130_
[INFO ] 2022-06-28 15:40:21,709 [CrawlThread-14] [crawler1] Processing: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/36_
[INFO ] 2022-06-28 15:40:21,709 [crawler1_7fbaea2e-log] [INFO ] 2022-06-28 15:40:21,709 [CrawlServer-worker-1] Processing path collection: 500 files, Size: 2 MB
[INFO ] 2022-06-28 15:40:21,709 [crawler1_7fbaea2e-log] [INFO ] 2022-06-28 15:40:21,709 [reactor1] Found new: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/39_ [md5:8144a2316c7079a829c9a152e90644cf, id:326000]
[INFO ] 2022-06-28 15:40:21,709 [crawler1_7fbaea2e-log] [INFO ] 2022-06-28 15:40:21,709 [reactor3] Found new: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/37_ [md5:774d49c2d95f1c0f7ad0441f8c695081, id:326001]
[INFO ] 2022-06-28 15:40:21,709 [crawler1_7fbaea2e-log] [INFO ] 2022-06-28 15:40:21,709 [reactor2] Found new: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/36_ [md5:0717b2decb7cc7011b01666116ceb762, id:326002]
[INFO ] 2022-06-28 15:40:21,709 [crawler1_7fbaea2e-log] [INFO ] 2022-06-28 15:40:21,709 [reactor4] Found new: item://2b909744-aa50-4e61-bcd6-0c2aed953da7/maildir/may-l/discussion_threads/38_ [md5:f6a0d962234f13f41c2beee5637d065d, id:326003]

and then the crawler1 items continue, so it seems to alternate between them

 

Interestingly when 16 crawlers are used, it's all mixed up:

[INFO ] 2022-06-28 16:02:20,360 [crawler7_4295ae71-log] [INFO ] 2022-06-28 16:02:20,360 [reactor3] Found new: item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyatt-k/personal/75_ [md5:a2e8b890d50f97aa2f193f4de06f2566, id:175461]
[INFO ] 2022-06-28 16:02:20,360 [crawler3_577469a5-log] [INFO ] 2022-06-28 16:02:20,360 [reactor3] Found new: item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyvl-d/all_documents/1509_ [md5:a3ccc006b4971c0f4735bf434aefa369, id:175462]
[INFO ] 2022-06-28 16:02:20,360 [crawler7_4295ae71-log] [INFO ] 2022-06-28 16:02:20,360 [reactor1] Found new: item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyatt-k/personal/74_ [md5:225bab5c8436788652da4a3690516e7c, id:175464]
[INFO ] 2022-06-28 16:02:20,360 [crawler1_7e8f59fd-log] [INFO ] 2022-06-28 16:02:20,360 [reactor3] Found new: item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/horton-s/sent_items/78_ [md5:155fb61298fb258e1d87fcf0d5a6d057, id:175463]
[INFO ] 2022-06-28 16:02:20,360 [crawler1_7e8f59fd-log] [INFO ] 2022-06-28 16:02:20,360 [reactor4] Found new: item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/horton-s/sent_items/77_ [md5:a761f57f331cdc9125952a8df67375da, id:175465]
[INFO ] 2022-06-28 16:02:20,360 [crawler3_577469a5-log] [INFO ] 2022-06-28 16:02:20,360 [reactor4] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyvl-d/all_documents/1506_ (message/rfc822)
[INFO ] 2022-06-28 16:02:20,360 [crawler14_2c2481c5-log] [INFO ] 2022-06-28 16:02:20,360 [reactor2] Found new: item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyatt-k/deleted_items/212_ [md5:0b5471fcf0f4649b10491f251b2b5963, id:175466]
[INFO ] 2022-06-28 16:02:20,360 [crawler11_6b96d80f-log] [INFO ] 2022-06-28 16:02:20,360 [reactor4] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/horton-s/deleted_items/_sent_mail/7_ (message/rfc822)
[INFO ] 2022-06-28 16:02:20,360 [crawler12_735bb3c2-log] [INFO ] 2022-06-28 16:02:20,360 [reactor1] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyatt-k/sent_items/270_ (message/rfc822)
[INFO ] 2022-06-28 16:02:20,360 [crawler1_7e8f59fd-log] [INFO ] 2022-06-28 16:02:20,360 [reactor1] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/horton-s/sent_items/76_ (message/rfc822)
[INFO ] 2022-06-28 16:02:20,360 [crawler15_49c14435-log] [INFO ] 2022-06-28 16:02:20,360 [reactor4] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyvl-d/all_documents/1047_ (message/rfc822)
[INFO ] 2022-06-28 16:02:20,360 [crawler14_2c2481c5-log] [INFO ] 2022-06-28 16:02:20,360 [reactor3] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyatt-k/deleted_items/20_ (message/rfc822)
[INFO ] 2022-06-28 16:02:20,360 [crawler3_577469a5-log] [INFO ] 2022-06-28 16:02:20,360 [reactor2] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyvl-d/all_documents/1508_ (message/rfc822)
[INFO ] 2022-06-28 16:02:20,360 [crawler15_49c14435-log] [INFO ] 2022-06-28 16:02:20,360 [reactor2] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyvl-d/all_documents/1046_ (message/rfc822)
[INFO ] 2022-06-28 16:02:20,360 [crawler12_735bb3c2-log] [INFO ] 2022-06-28 16:02:20,360 [reactor4] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/hyatt-k/sent_items/271_ (message/rfc822)
[INFO ] 2022-06-28 16:02:20,360 [crawler6_73f59153-log] [INFO ] 2022-06-28 16:02:20,360 [reactor1] Completed processing item://2f038731-c7b6-4eb3-80bd-3bfc5fe23011/horton-s/inbox/31_ (message/rfc822)

So with more than "x" crawlers are used it truly multi-threads everything but when 2 are in use it alternates batches?

Link to comment
Share on other sites

  • 3 weeks later...

Hi Shaun,

Thanks for submitting a support ticket and providing additional information.

The reason Intella didn't use all the crawlers in some cases was because the files are so small and quick to process so that the time needed to discover them in the file system was comparable to the time needed to index them. In other words, Intella simply didn't need all 16 crawlers because it was enough to use just 2-3 depending on how fast new files are being discovered. We suspect that an antivirus might have been slowing down I/O access to the disk.

Note that this is sort of a corner case. Normally, when indexing larger files such as PSTs or disk images, it should never happen.

In this particular case, a folder with large number of very small files, it would be better to index a container such as ZIP or logical disk image (L01, AD1, etc).

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share

×
×
  • Create New...