Jump to content

MS Word Python Parsing Script

Jacques B

Recommended Posts


I wrote a Python script to parse various artifacts from a MS Word document and dump it to 4 different worksheets in an Excel file. Myself and a few colleagues are using the script to help us with some testing of scenarios in MS Word to see how artifacts are impacted by different actions.


For example, did you know that if you upload a DOCx to Google Docs, and later download it back to your computer, Google Docs strips out core.xml and app.xml, thus you lose the author and created date among other metadata? And if you subsequently edit that newly downloaded document with MS Word, MS Word will add core.xml and app.xml, and set the created date as the date it edited the newly download document, as it adds core.xml and app.xml at that point.

I know of someone dealing with a situation where a LNK file shows a created date in June for a DOCx on a USB drive. The document was no longer on it, but they were able to recover it. The metadata of the document shows a created date in July. After I explained the above scenario to them, they said it made perfect sense based on their knowledge of the case.



Link to comment
Share on other sites

  • Jacques B changed the title to MS Word Python Parsing Script

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...