Jump to content

jasoncovey

Members
  • Posts

    34
  • Joined

  • Last visited

  • Days Won

    3

Posts posted by jasoncovey

  1. I have a feature request that has to do with Intella's export functionality.  Being in the legal field, we use Intella for document productions, which can be very large, and are often addressed in successive waves (aka rolling productions).  Although the arrival of the Export ID field was a welcome addition, it doesn't fully address our needs.  In litigation, the production phase is typically followed by depositions and subsequent court filings, all of which require the inclusion of the production, Bates-numbered version of the documents.  Despite the ESI world, that becomes the bottom line "evidence" in the litigation.  

     

    So, when we use the Load File export option (even though we may only be producing searchable PDFs, that's the only way to achieve page-level numbering during export), although it's great that a starting Bates number is recorded as the Export ID, when it comes time to take the next steps, end users are not able to quickly retrieve and print document via bates number UNLESS I re-load the production documents back into the database.  Although that's doable, either as PDFs or with greater complexity via a load file, it's unnecessarily difficult and time-consuming.  In addition, the best practice espoused has always been to keep indexing to a minimum to avoid potential database corruption (which has never been a significant problem for me with Intella).  Still, when necessary, if I have a 250 GB database with 100 tags, and I'm adding a few hundred documents, I always gulp before commencing the process - even with a backup in place.  I often don't have the cushion of time to deal with something going wrong, and a snapshot from 24 hours earlier could be missing a LOT!    

     

    That being the case, I would love to see Intella (and specifically Connect!) expand upon the Export ID feature to include automatic linking to production images of documents in the output location.  It would be great for it to accommodate multiple such images per database item, as it's not uncommon for one database to be used in support of separate, but related cases, all of which would have their own, uniquely-numbered production document sets.

     

    As it stands now, although it's fantastic to be able to search by Export ID (essentially beginning Bates number unless the production is 100% single-page images) in desktop, it would be great if that feature could be duplicated in Connect.  Even better would be hyperlinking to the production document "images" (regardless of type), such that legal personnel could work by Bates number, as the legal world is accustomed.  My users understand the Item ID and a unique database identifier in general, but there comes a point where paper documents have to be compiled to prepare a witness for a deposition, and for better or worse, those will be printed for inclusion in a 3-ring binder, of which multiple copies will usually be made, with documents entered into the record evidence.

     

    As it stands now, without giving end users access to the back end data locations, I would need to create a spreadsheet that cross references Item ID and Export ID, with hyperlinks to the production images to permit easy access and printing.  When you have multiple productions, you have to do multiple CSV exports to capture all of the different Export IDs, as only the one currently displaying will export.  Then you have to combine all the CSVs into a single spreadsheet to enable user access.  

     

    Anyway, I wanted to put that out there in the event others thought it would be a beneficial feature.  This would create a link between the underlying native file, Intella's contents view, preview, redaction view (if applicable) and production image (presumably under a new tab).

  2. To provide a quick update, I was able to get the issue resolved with Kernel for Outlook PST Repair.  Unfortunately, these contain highly confidential information that is under a court's protective order, so I can't share it, but I did want to report back that the repair software had worked.  So at this point, I'll have to revise my statement to read that, I have never had a PST that opened correctly in Outlook then fail to index correctly in Intella that wasn't subsequently corrected with Outlook PST Repair.  Your mileage may vary...

     

     

    Jason

  3. I have just encountered a strange situation indexing a PST that I have never seen before.  To this point, I have never had a PST that opened correctly in Outlook then fail to index correctly in Intella.  Today, I have two small PSTs that repeatedly caused a memory leak or something that resulted in 100% RAM usage, which forced me to kill the main Java process.  From the outset, I believe that the origin of these PSTs are an Exchange-side search result across several custodians, with subfolders representing their source locations in the custodian mailbox. 

     

    I ran them both through ScanPST, which found minor errors and repaired them.  So I indexed again, in a new case even, with the same result.  The largest one is like 95 MB, so this isn't any kind of size issue.  I have yet to throw a more powerful repair utility at it, but since it opens correctly in Outlook, I'm not necessarily convinced that it will even help.  I can go back to the client at this point, but without a specific problem, I'm not sure what I can accomplish with that.  And I seriously doubt he'll have any further insight on the subject.  I hate to pull MSGs out of the PST since there is an extensive location structure, but I guess that should be the next step?  Maybe export from Outlook to a new PST?  Fortunately, this is not a forensic situation, so if I had to do something like that, I could, but I always prefer not to if there is a less intrusive option.  Any ideas?

     

    Thanks!

     

    Jason Covey 

  4. Got it.  I certainly understand that there are endless variables at play on every job. 

     

    I have one final example to share.  This is from the v2 Enron data set with attachments.  53 GB, and completed in well under 3 hours.  None of the new indexing features were enabled, and index embedded items was again disabled, so it's apples to apples in terms of settings in Intella. 

     

    What was different with this job is that I was back to the "new normal" speeds of 1.8 seen on prior tests, and very high, sustained CPU usage.  I happen to know that a large backup job was running on our SAN while the 129 GB job mentioned above was running, so it's possible there was some competition for resources on the hosts.  Even still, both those results and these extrapolate to the 600-700 GB ranges in 24 hours.

     

    Any way you slice this, the numbers speak for themselves! 

     

     

    post-572-0-64312600-1412349842_thumb.png

  5. The following is a screenshot of my new indexing results from the same, 129 GB data set I had processed in 1.8 Beta 1, with "index embedded items" setting ignored, and as posted in the Beta 1 thread.  The processing time in Beta 2 for this job was approximately 53% of the prior job, with total time reduced from 598 minutes (9.96 hours) to 314 minutes (5.23 hours).  This extrapolates to nearly 600 GB in 24 hours! 

     

    For sake of perspective vs. 3 years ago, the idea that Intella could ever process 129 GB at ALL would have sounded fantastical.  Over (not that much) time, with future versions and better hardware, I would still have expected no less than 80 hours for such a job.  Maybe more.  Thus, to be able to process 129 GB within a single working day, with a few hours left for searching, on a single machine, is completely astonishing in my book.

     

    My only observations were that CPU usage was noticeably lower in Beta 2.  Whereas Beta 1 had the CPU slammed at near 100% for the entire crawling process, Beta 2 hovered closer to 50% the entire time.  I have no idea if that was a resource issue with our host system, or if it's a design tweak.  Maybe Chris can shed some light.  There were also some periods of near zero items processed, which I know can be completely normal. 

     

    Apparent RAM usage was always well below 8 GB, which I understand is not the entire story.  In addition, the momentary spikes of item counts were not as high, and the average items per minute was significantly reduced (although those massive spikes of 100K+ items in a single clock cycle certainly skewed those numbers).

     

    All in all, just looking at elapsed time in the absence of any other metrics, the results are pretty amazing in my book!

     

    post-572-0-01996700-1412270642_thumb.png   

  6. That doesn't seem feasible to me, either, without great expense.  I've never run across more than a tiny smattering of actual documents that were JPG or PNG file types.  Perhaps if you did find some you could maybe isolate the source (a particular custodian or organization) and use that as a point of departure to export a subset?  Maybe in conjunction with the thumbnail view?

     

    On a related note, in recently looking into some other options for OCR, I came across CVision Technologies.  I demo'd their Maestro Recognition Server, which out performs Abbyy, at a lower cost.  It was extremely robust, but pricey if you want to take advantage of multiple cores (I love being punished by software companies because I might happen to have a more powerful computer than the next guy).  I believe it's still less expensive than Abbyy, though.  They also offer another product that's priced differently that is a per-page model, but with no core limitation.  However, I was asked, "How many pages do you need OCRd in a year?"  Not exactly a question that can be answered in e-discovery...

     

    With regard to OCR in conjunction with Intella, here's the problem I see.  We need high volume, high-speed OCR capabilities.  If Intella can index 350GB in 24 hours or less, it's no help at all if it then requires 3 weeks for a single-threaded app to OCR 75K PDFs.  Thus, the only practical option are enterprise OCR solutions, whose price is completely out of line for most Intella users, or e-discovery vendors, who really don't want to do only OCR work, and certainly don't appreciate your daring to use Intella in the first place.  When you need the OCR processing, you really need it, but there will also be weeks or months where it isn't used at all, but you still have to pay for it.  Some of these tools would exceed the cost of the Connect/Pro package, entirely.  With per-page pricing, you certainly don't want to eat up processing for docs that don't even have any text, have embedded text, etc.  I don't know if there is a powerful tool that would only count files that actually had text and were successfully OCRd. 

     

    It's a real problem, as paper hasn't exactly gone away despite everything that's said (I'm sitting in a giant law firm with millions of pages in boxes onsite and probably 10s of millions offsite).  I don't have any very good answers, but that's the scope of the problem as I see it!   

  7. I just ran a new job last night, and achieved similarly-spectacular results, so I wanted to share.  This was for a 129 GB data set, culled from a 1.4 TB collection by file extension, maintaining source folder structures.  60 GB of PSTs, 43 of PDFs, and the rest divided among common document and loose email file types.  

     

    The job completed in just under 10 hours.  This includes the extraction of over 1.1 millions TIFF images from the PDFs, which is never desired in the context with which I use Intella.  So, like Chris said, it will probably be faster when not doing all the unnecessary processing (in my case). 

     

    Looking forward to the official release!

     

     

    post-572-0-38380900-1411753854_thumb.png

     

  8. Now that I have had a chance to get everything up and running with 1.8 beta, and getting all of my systems configured at a new employer, I wanted to report my results.  I believe they are on the spectacular side, having used Intella since 2011 at my prior job.

     

    We have Intella Pro and Connect running in a 100% virtual environment, which is a major change for me.  Based on my prior knowledge and experience with Intella and its system requirements, I assumed this would be a complete disaster.  Not that case!

     

    We built a Windows 7 Enterprise VM, with a 250 GB system drive, and 500 GB drives for Case Index and Case Data.  We later added a 500 GB drive for optimization, per Vound's specs.  The drives were segregated for use only with this VM, and are located in a SAN, connected via 8 GB fiber channel.  These drive sizes are all expandable, so these sizes were just specified for testing.  The VM lives on a Fujitsu server with massive RAM (256 GB?), to which it is specifically tethered, and is the least-used of a bank of 4.  The VM has 16 GB RAM assigned.  The dongle is mapped to the VM from this server.

     

    Initially, we only had a single Xeon processor with 4 cores.  We went from 1 to 2 and then to 4 cores when we saw extremely high CPU usage.  The thinking was that we didn't want to ask VMWare to do too much.  Ultimately, we added a second, 4-core Xeon processor.  However, we still see very high system usage, which we have since been advised is a very good thing in 1.8.  We have also seen apparent RAM usage that is much lower than what I was used to, about which Chris advised us is only part of the story, as the remainder of the RAM is being used for disk caching, which is increasing indexing performance.

     

    Although I haven't been able to perform a perfect apples to apples comparison in every single respect, I ran the same indexing job on the same data set.  It consisted of a relatively small amount of foldered data, as well as approx. 397 PSTs of varying size from tiny to 4 GB.  There were probably 80+ that were 1 GB or larger. 

     

    The results are posted below.  Note that 1.8 is indexing items within items even though I would prefer it not, and 1.7.3 was not.  Still, I think the results will speak for themselves.

     

    Before:

     

    post-572-0-49108100-1409069729_thumb.jpg

     

    After:

     

    post-572-0-13784300-1409069731_thumb.jpg

     

    That's 72.5% faster by my math. 

     

    Previously, on a Win 2008 R2 physical server with 16 GB RAM and a single, slower processor but dedicated internal drives, I was thrilled for any clock cycle that maintained 2K items per minute, and frequently had to settle for 1K.

     

    Hopefully that will give some of you a better idea of some real world expectations with regard to indexing with 1.8. 

     

  9. "Is Attachment" is a badly needed field in Intella, which is very standard with dedicated e-discovery tools.  If you have a mixed data set of, say, a group of PSTs as well as a set of foldered data, there are definitely scenarios where you might want to isolate the content that either IS or IS NOT an email attachment.  And usually the latter. 

     

    In my experience, it most often comes up when you are performing exclusion filtering on a search result, and you want to isolate the non-attachment content.  In a perfect world, we would always receive data that is neatly segregated and organized.  It's possible to get there in the scenarios I'm describing with a combination of steps and some tagging, but it would be a welcome addition to have the flexibility to get there more directly.

    • Like 1
×
×
  • Create New...