igor_r Posted November 10 Report Share Posted November 10 Hello everyone, Today we will look into a very simple yet powerful crawler script. It lets one exclude items from processing based on their type and size. Depending on the data, the script could: Reduce the case size by many gigabytes by excluding items that are irrelevant. Potentially reduce the indexing time by skipping very large and complex documents. Prevent memory and timeout errors by skipping items known to cause problems (e.g. large PDFs). Preparations The script doesn’t use any dependencies. Therefore, you can use it with any Intella installation. Which items to skip First of all, let’s look at the list of potential item types we could skip and why we would do that. Let’s put all types into the table below: Type Size Top-Level Parent Reason Unknown Binary Files Any Yes Unsupported Executables (EXE, DLL) Any Yes Unsupported MS Access Database (mdb) Any Any Unsupported Video Files 10 MB or larger Any Reduce case size PDF Documents 20 MB or larger Any Known to cause problems Excel Spreadsheets 30 MB or larger Any Irrelevant to the case Image Files 10 KB or less Children Small embedded images are usually irrelevant The above table is just an example. The list of types and their sizes could be different depending on your needs. Creating a crawler script The easiest way to start is to get this empty crawler script that does nothing, but contains the required definitions and imports. We’ll be using the itemFound function because we don’t need access to the extracted text and metadata and want to exclude items as soon as possible to reduce the indexing time. To exclude a specific type from processing, we check its mediaType and return Action.Stub. That means Intella will not process the item and it will not cache its content in the case folder. The item will still be present in the case with the minimal amount of metadata (name, type, location). For example, to exclude PDF documents that are larger than 20mb we can use this code: size = get_item_size(item) if item.mediaType == "application/pdf" and size > 20 * MB: return FoundItemResult(action=Action.Stub) If we wanted to exclude all embedded images that are smaller than 10KB, we could use this code. In order to identify an image we check that its media type starts with the “image/” prefix (e.g. image/png, image/jpeg and so on). Additionally, we check here if the item is not a top-level parent. if item.mediaType is not None and item.mediaType.startswith("image/") and size < 10*KB: if not item.isTopLevelParent: return FoundItemResult(action=Action.Stub) The result script can be found here: filter_by_type_and_size.py Testing the script Download this sample disk image that contains a few files of different types and sizes: Crawler Script Type and Size Filter Test.ad1 (64 mb). Create a new case, add the script and index the image. Here are some results: Case folder size without the script: 165 MB (127 items) Case folder size with the script: 4 MB (16 items) After indexing the image with the script we can see that the items that we wanted to be excluded were not processed: List of supported media types You may notice that we need to use media types like “application/pdf” in the script instead of the full type names like “PDF Document”. There is a way to get a list of all media types supported by Intella. The list is located in the Intella program folder: c:\Program Files\Vound\Intella 2.6.1\languages\mimetype-descriptions_en.properties You can open this file in a text editor and search for a specific name to find out the corresponding media type. For example, if we search for “PDF Document” we find this line that contains its media type: application/pdf=PDF Document 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.