Reducing case size by filtering out items based on type and size

igor_r · November 10, 2023

Hello everyone,

Today we will look into a very simple yet powerful crawler script. It lets one exclude items from processing based on their type and size. Depending on the data, the script could:

Reduce the case size by many gigabytes by excluding items that are irrelevant.
Potentially reduce the indexing time by skipping very large and complex documents.
Prevent memory and timeout errors by skipping items known to cause problems (e.g. large PDFs).

Preparations

The script doesn’t use any dependencies. Therefore, you can use it with any Intella installation.

Which items to skip

First of all, let’s look at the list of potential item types we could skip and why we would do that. Let’s put all types into the table below:

Type	Size	Top-Level Parent	Reason
Unknown Binary Files	Any	Yes	Unsupported
Executables (EXE, DLL)	Any	Yes	Unsupported
MS Access Database (mdb)	Any	Any	Unsupported
Video Files	10 MB or larger	Any	Reduce case size
PDF Documents	20 MB or larger	Any	Known to cause problems
Excel Spreadsheets	30 MB or larger	Any	Irrelevant to the case
Image Files	10 KB or less	Children	Small embedded images are usually irrelevant

The above table is just an example. The list of types and their sizes could be different depending on your needs.

Creating a crawler script

The easiest way to start is to get this empty crawler script that does nothing, but contains the required definitions and imports. We’ll be using the itemFound function because we don’t need access to the extracted text and metadata and want to exclude items as soon as possible to reduce the indexing time.

To exclude a specific type from processing, we check its mediaType and return Action.Stub. That means Intella will not process the item and it will not cache its content in the case folder. The item will still be present in the case with the minimal amount of metadata (name, type, location). For example, to exclude PDF documents that are larger than 20mb we can use this code:

size = get_item_size(item)
if item.mediaType == "application/pdf" and size > 20 * MB:
	return FoundItemResult(action=Action.Stub)

If we wanted to exclude all embedded images that are smaller than 10KB, we could use this code. In order to identify an image we check that its media type starts with the “image/” prefix (e.g. image/png, image/jpeg and so on). Additionally, we check here if the item is not a top-level parent.

if item.mediaType is not None and item.mediaType.startswith("image/") and size < 10*KB:
	if not item.isTopLevelParent:
		return FoundItemResult(action=Action.Stub)

The result script can be found here: filter_by_type_and_size.py

Testing the script

Download this sample disk image that contains a few files of different types and sizes: Crawler Script Type and Size Filter Test.ad1 (64 mb). Create a new case, add the script and index the image.

Here are some results:

Case folder size without the script: 165 MB (127 items)
Case folder size with the script: 4 MB (16 items)

After indexing the image with the script we can see that the items that we wanted to be excluded were not processed:

List of supported media types

You may notice that we need to use media types like “application/pdf” in the script instead of the full type names like “PDF Document”. There is a way to get a list of all media types supported by Intella. The list is located in the Intella program folder:

c:\Program Files\Vound\Intella 2.6.1\languages\mimetype-descriptions_en.properties

You can open this file in a text editor and search for a specific name to find out the corresponding media type. For example, if we search for “PDF Document” we find this line that contains its media type:

application/pdf=PDF Document

Sign In

Reducing case size by filtering out items based on type and size

Recommended Posts

igor_r

Preparations

Which items to skip

Testing the script

List of supported media types

Link to comment

Share on other sites

Join the conversation

Browse

Activity