Jump to content

Indexing PDF Portfolios


Recommended Posts

So you are correct that Intella cannot process PDF Portfolios.  It can neither extract the individual PDFs that make up the Portfolio, or the native file attachments to the individual PDF-converted emails (if that's the manner in which the PDFs were created).  Although there are some workarounds, they are pretty complicated depending on how far you want to take things in order to restore proper functionality.  Before you set off on such a journey, not knowing the context of the production, if metadata was to be provided with the production, you would certainly be better off to go back to the producing party and ask them to produce again in a more accessible format.

Assuming that's not an option, you'll want to check out these two Adobe Acrobat plug-ins from EverMap:  http://evermap.com/AutoPortfolio.asp and http://evermap.com/AutoSplit.asp

The former provides the most advanced functionality for working with PDF Portfolios, whereas the latter's is limited, but also includes a number of other features.

The main problem you're going to run into involves metadata.  If you need to transform the production into a fully functional ESI data set in Intella, it requires the tedious creation of a custom load file.  Although I've done it a few times, if you don't have extensive experience with from-scratch load file creation, it wouldn't be realistic to go down that road.  Nonetheless, with enough effort, some creative RegEx searching and data manipulation, it IS possible.  

A middle ground approach might be this:  use one of the two aforementioned tools to extract the PDF-converted emails and any native attachments to a folder.  Although the file naming options aren't unlimited, you can achieve something that retains the document order/hierarchy with numeric prefixes.  Hopefully the producing party was kind enough to create the portfolio in some kind of chronological order, which would then be preserved by this process.  With that done, you could then just process the resulting files into Intella as a folder source, where proper sorting will be achieved by file name.  Of course, this won't give you accurate family dates or file types or permit permit full functionality of Intella's Tree view or parent-child tracking, all of which would require the load file route.

Although in a perfect world, Intella would support every possible file type.  However, in this case, I'm really on the fence about whether this is worth the effort give that:  (1) it's a very rare production "format"; and (2) it's arguably not a legitimate production format in that it makes essential metadata inaccessible.  That being the case, I would rather see the dev team working on what I think are some much higher priority features.  

Still, in your case, in light of the amount of work that's required to make a PDF Portfolio production of email functional within an ESI platform, as well as the lack of accessible metadata (you basically have to extract it from the body text of each individual "email," you would be in a strong permission to ask the opposing party to re-produce the data in a format that is reasonably accessible.  And the larger the Portfolio or email or volume, the stronger that position would be.

Hope that helps!


Jason Covey   


Link to comment
Share on other sites

Thanks, Jason. I used AutoPortfolio years ago but had forgotten about it. Looks like that's the route I'm going to have to take. I'm comfortable creating custom load files, so I'm doing that as well (although that's not without its headaches!).

Unfortunately, i can't go back to the producing party and demand more--the portfolios were produced by third parties to the government several years ago and now the government is producing them to us.

Thanks again.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...