Jump to content

Reducing case size by filtering out items based on type and size


igor_r

Recommended Posts

Hello everyone,

 

Today we will look into a very simple yet powerful crawler script. It lets one exclude items from processing based on their type and size. Depending on the data, the script could:

  1. Reduce the case size by many gigabytes by excluding items that are irrelevant.
  2. Potentially reduce the indexing time by skipping very large and complex documents.
  3. Prevent memory and timeout errors by skipping items known to cause problems (e.g. large PDFs).

 

Preparations

The script doesn’t use any dependencies. Therefore, you can use it with any Intella installation.

 

Which items to skip

First of all, let’s look at the list of potential item types we could skip and why we would do that. Let’s put all types into the table below:

Type

Size

Top-Level Parent

Reason

Unknown Binary Files

Any

Yes

Unsupported

Executables (EXE, DLL)

Any

Yes

Unsupported

MS Access Database (mdb)

Any

Any

Unsupported

Video Files

10 MB or larger

Any

Reduce case size

PDF Documents

20 MB or larger

Any

Known to cause problems

Excel Spreadsheets

30 MB or larger

Any

Irrelevant to the case

Image Files

10 KB or less

Children

Small embedded images are usually irrelevant

The above table is just an example. The list of types and their sizes could be different depending on your needs.

 

Creating a crawler script

The easiest way to start is to get this empty crawler script that does nothing, but contains the required definitions and imports. We’ll be using the itemFound function because we don’t need access to the extracted text and metadata and want to exclude items as soon as possible to reduce the indexing time.

 

To exclude a specific type from processing, we check its mediaType and return Action.Stub. That means Intella will not process the item and it will not cache its content in the case folder. The item will still be present in the case with the minimal amount of metadata (name, type, location). For example, to exclude PDF documents that are larger than 20mb we can use this code:

size = get_item_size(item)
if item.mediaType == "application/pdf" and size > 20 * MB:
	return FoundItemResult(action=Action.Stub)

If we wanted to exclude all embedded images that are smaller than 10KB, we could use this code. In order to identify an image we check that its media type starts with the “image/” prefix (e.g. image/png, image/jpeg and so on). Additionally, we check here if the item is not a top-level parent.

if item.mediaType is not None and item.mediaType.startswith("image/") and size < 10*KB:
	if not item.isTopLevelParent:
		return FoundItemResult(action=Action.Stub)

 

The result script can be found here: filter_by_type_and_size.py
 

Testing the script

Download this sample disk image that contains a few files of different types and sizes: Crawler Script Type and Size Filter Test.ad1 (64 mb). Create a new case, add the script and index the image.

Here are some results:

  • Case folder size without the script: 165 MB (127 items)
  • Case folder size with the script: 4 MB (16 items)

 

After indexing the image with the script we can see that the items that we wanted to be excluded were not processed:

image.png

 

List of supported media types

 

You may notice that we need to use media types like “application/pdf” in the script instead of the full type names like “PDF Document”. There is a way to get a list of all media types supported by Intella. The list is located in the Intella program folder:

c:\Program Files\Vound\Intella 2.6.1\languages\mimetype-descriptions_en.properties

You can open this file in a text editor and search for a specific name to find out the corresponding media type. For example, if we search for “PDF Document” we find this line that contains its media type:

application/pdf=PDF Document
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...