Hi Markjrouse, creating load files from search results is not as easy as it sounds.
I have seen people make mistakes when selecting the data for load files. One mistake I've seen is when load files are created from direct searches only which, as you know, do not include all family items (In NZ the discovery rules state that all family items need to be included with any relevant items).
I have also seen load files where all children items are returned and included. This causes all embedded items to be added as individual items in the load file. This can really blow out the number of unnecessary tiff images that are produced for the load file (1 PDF file can result in100s of tiff files) and also the time to create the load file. The over populated load file will produce a separate line for each item in the review system which is not ideal.
After the searching phase is complete, care must be taken when selecting the documents for the load file so that you don't miss anything and that you don't end up with a whole lot of junk files in the load file. This is how we select these files and exclude the embedded items.
Show all of the top parent items from the search results (some items in the search results may already be top level items).
Using the top level items and all of the search hit items, we show all of the children and tag them. We do not select 'show direct children' as relevant files could be several levels deep (i.e. a word doc in a zip file that is attached to an email which is in another zip file). This step helps to identify all pure parent items.
To locate the pure parent items, run a search on the search results and parent items and also run an exclude search on the children tag. This will display only the pure parent items. Tag these pure parents and dedup as the client does not want to review duplicates.
Show all children items for the pure parent items and tag them.
Now we need to 'clean' the child item tag. We want all email attachments and anything that is in the body of an email (i.e. screen shots that have been pasted in an email etc)
Show the children tag and sort by the URI field. The first thing you can remove from this tag are all image files that are marked 'PDF:Aperture' in the URI column. We will have the original PDF so we don't need all of these bits and pieces. This will remove a large percentage of embedded items. Do the same for WORD:Aperture, POWERPOINT:Aperture, OPENXML:Aperture etc.
Make sure you don't remove the items in ZIP:Aperture or MSG:Aperture. The MSG:Aperture will include all of the linkedin and twitter logos that we don't want however, there could be screen shots in the body of emails that are relevant and we don't want to get rid of these.
Clear the search and search again on the children tag. In facets, view by 'type' then 'image'. Now view in thumbnail view and manually remove items that are clearly 'junk' images.
This should clean up most of the embedded items.
Regards
Jon