Jump to content

Intella 1.8 beta 2


Chris

Recommended Posts

Today we have completed the work on Intella 1.8 beta 2, following last month's first beta release.

 

Changes in this release are:

  • Merged the Cluster Map and Sets visualizations into a single visualization.
  • Added the ability to exclude certain paragraphs in keyword search, e.g. email signatures and other disclaimers.
  • Many fixes for crawling and extraction results.
  • Merged the Viewer and TEAM Reviewer products into a single Viewer that is capable of handling shared cases (i.e. all of the old TEAM Reviewer functionality).
  • Added support for using Lotus Notes 9 for indexing NSF files.
  • Load file import improvements, e.g. tags are now imported, metadata extracted from binary items and from the load file is now merged.
  • Load file export improvements, e.g. missing parent and child IDs when in-between items such as folders are skipped.
  • Improvements to indexing progress reporting.
  • "Index embedded items", "Index archives" and "Index mail containers" indexing options were ignored.
  • Improvements to the Type facet hierarchy.
  • Brought back the Empty Documents category.
  • Improved display and searching of hyperlinks.

Send me a private message or reply to this topic when you want to try out the new release.

Link to comment
Share on other sites

The following is a screenshot of my new indexing results from the same, 129 GB data set I had processed in 1.8 Beta 1, with "index embedded items" setting ignored, and as posted in the Beta 1 thread.  The processing time in Beta 2 for this job was approximately 53% of the prior job, with total time reduced from 598 minutes (9.96 hours) to 314 minutes (5.23 hours).  This extrapolates to nearly 600 GB in 24 hours! 

 

For sake of perspective vs. 3 years ago, the idea that Intella could ever process 129 GB at ALL would have sounded fantastical.  Over (not that much) time, with future versions and better hardware, I would still have expected no less than 80 hours for such a job.  Maybe more.  Thus, to be able to process 129 GB within a single working day, with a few hours left for searching, on a single machine, is completely astonishing in my book.

 

My only observations were that CPU usage was noticeably lower in Beta 2.  Whereas Beta 1 had the CPU slammed at near 100% for the entire crawling process, Beta 2 hovered closer to 50% the entire time.  I have no idea if that was a resource issue with our host system, or if it's a design tweak.  Maybe Chris can shed some light.  There were also some periods of near zero items processed, which I know can be completely normal. 

 

Apparent RAM usage was always well below 8 GB, which I understand is not the entire story.  In addition, the momentary spikes of item counts were not as high, and the average items per minute was significantly reduced (although those massive spikes of 100K+ items in a single clock cycle certainly skewed those numbers).

 

All in all, just looking at elapsed time in the absence of any other metrics, the results are pretty amazing in my book!

 

post-572-0-01996700-1412270642_thumb.png   

Link to comment
Share on other sites

Hello Jason,

 

Those are fantastic results!

 

The reduced time for indexing will mostly be caused by the embedded items not being processed anymore, although beta 2 also has some extra performance tweaks compared to beta 1. Processing of the embedded items mostly means extra demand for the CPU, so skipping those means that the disk relatively becomes a bigger bottleneck, though there are many other factors at play here.

Link to comment
Share on other sites

Got it.  I certainly understand that there are endless variables at play on every job. 

 

I have one final example to share.  This is from the v2 Enron data set with attachments.  53 GB, and completed in well under 3 hours.  None of the new indexing features were enabled, and index embedded items was again disabled, so it's apples to apples in terms of settings in Intella. 

 

What was different with this job is that I was back to the "new normal" speeds of 1.8 seen on prior tests, and very high, sustained CPU usage.  I happen to know that a large backup job was running on our SAN while the 129 GB job mentioned above was running, so it's possible there was some competition for resources on the hosts.  Even still, both those results and these extrapolate to the 600-700 GB ranges in 24 hours.

 

Any way you slice this, the numbers speak for themselves! 

 

 

post-572-0-64312600-1412349842_thumb.png

Link to comment
Share on other sites

  • 2 weeks later...
Guest
This topic is now closed to further replies.
×
×
  • Create New...