ShaunC Posted June 28, 2022 Report Share Posted June 28, 2022 Hi all, I'm doing a bit of performance testing/bench-marking on open source data so I can compare between different machines and eventually software offerings. I'm currently using Intella 126.96.36.199 to process the Enron mail dataset - https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz The file is 422MB and contains over 500,000 files. I'm aware 2.5.1 is out, but reviewing earlier bench-marking I did last year with the same datasets, the behavior has not changed. Happy to re-test all below scenarios with 2.5.1 if requested. I am using a machine with 16C/32T, 128GB RAM and NVMe SSDs (separate evidence, case and temp drives). I elected to set the number of crawlers to 16, with 6GB of RAM per crawler and left the default of 15GB for Intella itself. Intella processed the file in 8m 43s. Reviewing the log shows that while 16 crawlers were allocated, it only used one. There is likely a simple explanation for this; I would assume that as the source is "one file", it gets assigned to one crawler, and as it gets processed and split apart to the child items within, the same crawler "owns" the child items and is therefore automatically assigned to process them. I then used 7zip to decompress the file. This resulted in a folder structure: ../enronmail/maildir/ and then an individual sub-folder for each user - 150 of them. Each of those has 10 or so sub-folders, which finally contain the email files. I created a new case and used the same settings and selected the root - "enronmail" - as my folder to be processed (it is the top-level folder of the archive - it has 1 direct child item, the sub-folder "maildir"). This time Intella utilised 2 crawlers and took 22m 29s, which to me is very strange that utilising twice the number of crawlers, with data that doesn't need to be decompressed, took almost 3 times as long to process. While typing this up I had a suspicion and created a new case with the same settings again, but this time I selected the "maildir" sub-folder, which has the 150 direct child items. Intella utilised all 16 crawlers and processed the content in 6m 31s. So this would tend to indicate to me that Intella assigns crawlers based on the number of child items of the source's top-level parent? Is that somewhere in the ball-park? To test, I could delete all bar 10 of the user folders (out of the 150) and see if only 10 crawlers are utilised I'd imagine. Quote Link to comment Share on other sites More sharing options...
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.