Jump to content

PDF processing


Jacques B

Recommended Posts

It would be great if Intella processed prior versions of a PDF, and extracted images from each version. I've written a BASH script to run different processes on a PDF, including looking for prior versions within the PDF and extracting them. You can find the script here: https://github.com/jjrboucher/PDF-Processing

I also provide a sample PDF in a subfolder for testing. And in a subfolder of that folder, a copy of the PDF at each edit step so you can compare what is extracted from the final PDF with what each version looked like. You will note the hashes will match. This won't work in every case. But when it does, it's great.

Extracting images from only the latest version of the PDF will result in missing stuff. In the sample file I provided, you will note that the images from the current version of the PDF will not include an image that is extracted from a prior version. Hence why it's important to extract images from every version available.

In Intella, the prior versions could maybe be children of the actual PDF, as would the images (children of their respective version). All associated metadata in each prior version is as it is for that version, so that is additional info you can get when extracting prior versions.

Link to comment
Share on other sites

Hey Jacques, 

First off, thanks for putting all this together to bring to us. We always appreciate the hard work our end users do. We will take a look into the samples you provided and the BASH script. 

We might even look at putting this into a future release. 

Out of curiosity, could you provide some real world examples this would help your workflow, or times you've already used something like this to help with your work?

As always you can check out Intella Latest Version for the release notes on newly released versions. 

Thanks!

Support Team

 

Link to comment
Share on other sites

Hi Chris,

Up until recently, we didn't know about this availability. I had heard about it some years back, but at that time was working in a law enforcement environment so document forensics was not the main focus. In my current role, it's more prevelant. I already had the script to process PDFs to automate my process. I added the feature to carve prior versions recently.

An investigator in another agency successfullly used it on a PDF they collected as part of their investigation relating to a fraudulent expense claim by a staff member. They were able to recover two prior versions of the document, showing how the user edited the document, and the date/time of those edits (because each version contains the metadata from that version, including the modified date/time).

For images, I know of a case where images were extracted from a PDF. A reverse image search (OSINT) of the signature on a medical invoice revealed that it had been copied from a children's website. It wasn't a doctor's signature.

 

As a workflow, extracting prior versions of a PDF and all embedded images from each version would allow an investigator to see (and search) across these prior versions, and see the images in the gallery. If you are looking at a PDF in Intella, it would be nice to have a tab with prior versions, or a link on the left to apply a filter to see prior versions much like you can see a parent item. So when the investigator is reviewing relevant PDF in an email attachment, or from a computer forensic image, they would have a visual indicator of existence of prior versions of the document. Those prior versions (if available) become compelling evidence of tampering/fraud.

Jacques

Link to comment
Share on other sites

  • 3 weeks later...

Good afternoon Jacques,

Do you mind if we share your script on our new forum(below)? We know others would find it useful and we are trying to grow our Script library and people's awareness of how it can be used. Of course you are welcome to shar it yourself or we could add it for you.

https://community.vound-software.com/forum/44-scripting-share-your-scripts/
 
Thanks!
Link to comment
Share on other sites

  • 4 weeks later...

Hi Chris,

Sorry for the late reply, I hadn't logged into the forum in a while. Most definitely, please share it with whomever. It's public on GitHub.

I've since also shared a Python script that I use to extract info from DOCx files as I do some research on them.

https://github.com/jjrboucher/MS-Word-Parser

That is also a public repository you can freely share.

I actually went ahead and posted both to that forum. Great to see others contributing as well!

Best,

Jacques

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...