Jump to content

igor_r

Administrators
  • Posts

    60
  • Joined

  • Last visited

  • Days Won

    8

Everything posted by igor_r

  1. Hi Geo, Which category level do you want to see exactly? The top-level one (Communication) or the one below (Email)? Or as two fields?
  2. Hello Geo, Thanks for posting the suggestion. I would like to clarify a few things: 1. Are you referencing the "Duplicate custodians and locations" feature as the one that it should be based on? https://www.vound-software.com/docs/intella/2.6.1/#_duplicate_custodians_and_locations 2. Can you provide a bit more details how you would envision this feature work? ideally, a step-by-step process to determine Near Duplicate Custodians.
  3. Hello Edgar, Did you try moving the "Load File Image" content type to the top of the list in the export settings? (see below)
  4. Hi Jung, Thanks for posting this question here. Let me give you some pointers where you could start. First of all, take a look at the GitHub repo: https://github.com/vound-software/intella-crawler-scripts. It has extensive documentation and a lot of samples that cover many scenarios. Then, the file type filtering (or signature filtering) can be done via UI. So, if you know for sure that you are only interested in certain types, you could use that instead of scripting. The only caveat is that it can only stub the unwanted items, not remove them completely. If you want to remove them from the case, you would need to use scripting. See: https://www.vound-software.com/docs/intella/2.6.1/#_file_type_settings In the upcoming 2.7 release Intella will have a new feature where you can exclude items by their name or extension. To filter items by date please see this example: https://github.com/vound-software/intella-crawler-scripts/blob/main/samples/advanced/filter_date_toplevel.py. This article shows how to filter items by type and size: Now, the only problematic part in your request is filtering by location (path). That is not currently fully supported. If you index disk images, then you could filter the top-level files in the disk image by path. But that's the only option at the moment. Please see an example of how to do that here: https://github.com/vound-software/intella-crawler-scripts/blob/main/samples/advanced/filter_fs_path.py I would recommend to first look at the individual scripts above and do some testing to make sure that it works as you expect. Then, you could combine it all into a final single script. Also, please take a look at the current limitations, specifically handling of duplicates: https://github.com/vound-software/intella-crawler-scripts#current-limitations I hope this helps.
  5. Hi Michael, I don't have any experience with PhotoDNA or pyPhotoDNA. But if you can show a script for generating a hash using Python I can help you with translating it into a crawler script. That should be somewhat similar to the Grayscale detection script that I posted here: The idea is that you take the image content which is stored in item.binaryFile, then call PhotoDNA library to calculate the hash and then store it in a custom column.
  6. Hi Jacques, Yes, you could definitely write such a script. Crawler scripts can be written in Python, Groovy or Java. You can choose any depending on your preferences and needs. You should first write the script and test that it works correctly. Then you can add the script to the source definition in Connect on the Options page, see https://www.vound-software.com/docs/connect/2.6.1/Intella Connect Administrator Manual.html#server-indexing-options. Please take a look at the documentation and examples at the GitHub page: https://github.com/vound-software/intella-crawler-scripts The grayscale detection script would look very similar to what you want (https://github.com/vound-software/intella-crawler-scripts/blob/main/samples/advanced/detect_grayscale.py). You would need to load file PDF binary and check the presence of %%EOF tokens.
  7. Hello everyone, Today we will look into a very simple yet powerful crawler script. It lets one exclude items from processing based on their type and size. Depending on the data, the script could: Reduce the case size by many gigabytes by excluding items that are irrelevant. Potentially reduce the indexing time by skipping very large and complex documents. Prevent memory and timeout errors by skipping items known to cause problems (e.g. large PDFs). Preparations The script doesn’t use any dependencies. Therefore, you can use it with any Intella installation. Which items to skip First of all, let’s look at the list of potential item types we could skip and why we would do that. Let’s put all types into the table below: Type Size Top-Level Parent Reason Unknown Binary Files Any Yes Unsupported Executables (EXE, DLL) Any Yes Unsupported MS Access Database (mdb) Any Any Unsupported Video Files 10 MB or larger Any Reduce case size PDF Documents 20 MB or larger Any Known to cause problems Excel Spreadsheets 30 MB or larger Any Irrelevant to the case Image Files 10 KB or less Children Small embedded images are usually irrelevant The above table is just an example. The list of types and their sizes could be different depending on your needs. Creating a crawler script The easiest way to start is to get this empty crawler script that does nothing, but contains the required definitions and imports. We’ll be using the itemFound function because we don’t need access to the extracted text and metadata and want to exclude items as soon as possible to reduce the indexing time. To exclude a specific type from processing, we check its mediaType and return Action.Stub. That means Intella will not process the item and it will not cache its content in the case folder. The item will still be present in the case with the minimal amount of metadata (name, type, location). For example, to exclude PDF documents that are larger than 20mb we can use this code: size = get_item_size(item) if item.mediaType == "application/pdf" and size > 20 * MB: return FoundItemResult(action=Action.Stub) If we wanted to exclude all embedded images that are smaller than 10KB, we could use this code. In order to identify an image we check that its media type starts with the “image/” prefix (e.g. image/png, image/jpeg and so on). Additionally, we check here if the item is not a top-level parent. if item.mediaType is not None and item.mediaType.startswith("image/") and size < 10*KB: if not item.isTopLevelParent: return FoundItemResult(action=Action.Stub) The result script can be found here: filter_by_type_and_size.py Testing the script Download this sample disk image that contains a few files of different types and sizes: Crawler Script Type and Size Filter Test.ad1 (64 mb). Create a new case, add the script and index the image. Here are some results: Case folder size without the script: 165 MB (127 items) Case folder size with the script: 4 MB (16 items) After indexing the image with the script we can see that the items that we wanted to be excluded were not processed: List of supported media types You may notice that we need to use media types like “application/pdf” in the script instead of the full type names like “PDF Document”. There is a way to get a list of all media types supported by Intella. The list is located in the Intella program folder: c:\Program Files\Vound\Intella 2.6.1\languages\mimetype-descriptions_en.properties You can open this file in a text editor and search for a specific name to find out the corresponding media type. For example, if we search for “PDF Document” we find this line that contains its media type: application/pdf=PDF Document
  8. Hello everyone, Intella can extract web browser history from all modern browsers (Chrome, Firefox, IE, Safari). But the URLs often contain additional information about what happened when the user clicked on a link or searched for a keyword. This metadata is encoded in URL parameters. Today we will look into creating a crawler script that can extract various information from Google search URLs. We will be using this open source tool: https://github.com/obsidianforensics/unfurl. It’s written in Python, so we can use it directly in a crawler script. Preparations Before we begin, we need to add the library to the Python bundled in Intella. To do that, open the command line prompt as Administrator. Change the directory to c:\Program Files\Vound\Intella 2.6.1\python. Due to limitations of the embedded Python, we can’t install this module directly. Instead we will create a “venv” (virtual environment) and install unfurl into it. First, let’s install the venv module, create a new environment (called “unfurl”) and then activate it: > .\python.exe -m pip install virtualenv > .\python.exe -m virtualenv unfurl > .\unfurl\Scripts\activate Now, we are in the “unfurl” environment. Let’s install the library and its dependencies: > pip install flask==2.3.3 > pip install dfir-unfurl > deactivate At the moment of writing this article Intella didn’t have support for virtual environments in Python scripts. So let’s just copy the content of the “unfurl” folder c:\Program Files\Vound\Intella 2.6.1\python\unfurl back to the parent “python” folder (one level up) replacing any existing files: c:\Program Files\Vound\Intella 2.6.1\python Testing the library Before creating the script, let’s first check that the library actually works and can extract information from URLs. Download this test script: unfurl_test.py and save it to a folder on your computer. Then run it using the command prompt from the previous step. .\python.exe C:\temp\unfurl_test.py It should produce output like this: [1] https://www.google.com/search?q=google&rlz=1C1YTUH_en-GBAU1048AU1048 ├─(u)─[2] Scheme: https ├─(u)─[3] www.google.com | ├─(u)─[6] Subdomain: www | ├─(u)─[7] Domain Name: google.com | | ├─(u)─[11] Domain is on list of known Google domains | | └─(u)─[12] Domain is extremely popular (found in "Top 1000" lists) | └─(u)─[8] TLD: com ├─(u)─[4] /search └─(u)─[5] q=google&rlz=1C1YTUH_en-GBAU1048AU1048 ├─(u)─[9] q: google | └─(G)─[13] Search Query: google └─(u)─[10] rlz: 1C1YTUH_en-GBAU1048AU1048 ├─(G)─[14] RLZ used for grouping promotion event signals and anonymous user cohorts ├─(G)─[15] RLZ version: 1 ├─(G)─[16] Application: C1 | └─(G)─[22] Search performed using Chrome Omnibox ├─(G)─[17] Brand Code: YTUH ├─(G)─[18] Cannibalized: No ├─(G)─[19] Language: English (en-GB) ├─(G)─[20] Install Cohort: Installed in Australia the week of 2023-03-06 └─(G)─[21] Search Cohort: First search was in Australia the week of 2023-03-06 It means that the library works fine and we can proceed to the next step. Creating a crawler script To use the library in a script first we need to create an instance of Unfurl object, add a URL and then call the “parse” function: instance = core.Unfurl(remote_lookups=False) instance.add_to_queue(data_type='url', key=None, value=url) instance.parse_queue() After that the instance object is ready to use. All of the parsed properties are stored in the “nodes” field which is a dictionary. Let’s create a function that would extract a property from the object. The function loops over all entries in the dictionary and selects one that satisfies the criteria (key, data type and hover). def get_node_value(unfurl, data_type, key=None, hover=None): for node in unfurl.nodes.values(): key_matched = key is None or node.key == key hover_matched = hover is None or (node.hover is not None and hover in node.hover) if node.data_type == data_type and key_matched and hover_matched: return node.value return None Then, we can extract a property like this: # Query query = get_node_value(instance, "url.query.pair", "q") # Source source = get_node_value(instance, "google.source") # Sxsrf timestamp ts_sxsrf = get_node_value(instance, "epoch-milliseconds", 2, "sxsrf") Then we can create a tag for the extracted property: tags.add("Google Source/" + source) Or create a custom column: custom_columns.append(CustomColumn("Google Query", CustomColumnType.String, CustomColumnValue(value=query))) In the script we use the “itemProcessed” function because we need access to the metadata fields, specifically to the extracted browser history URLs. In Intella, extracted URLs are stored in the “Subject” field. We use “item.mediaType” to distinguish between browser history and other items: if item.subject is not None and item.mediaType is not None and "history-entry" in item.mediaType: The final script can be found here: unfurl_crawler_script.py Testing the script Let’s test the script on this small Chrome history database (History.7z) that I have prepared. Create a new case in Intella, add a new File source and select the unzipped History file. Add the crawler script and index the data. We can see that the script worked fine. There are new tags and custom columns related to Google search metadata: You can easily extend the script by adding new fields. The library supports a lot of web sites such as Google, YouTube, Facebook, Twitter and many others.
  9. Hello everyone, In this article I will demonstrate how to create a crawler script for detecting grayscale images. It might be useful to know if an image is grayscale for various reasons. One of them is that such images are often scanned or faxed documents. Therefore, one may choose to OCR grayscale images only. Preparations We will be using Python and OpenCV library. Before we begin, we need to add OpenCV library to the Python bundled in Intella. To do that, open command line prompt as Administrator. Change the directory to c:\Program Files\Vound\Intella 2.6.1\python and execute this command: > cd c:\Program Files\Vound\Intella 2.6.1\python > .\python.exe -m pip install opencv-python-headless Collecting opencv-python-headless Downloading opencv_python_headless-4.8.1.78-cp37-abi3-win_amd64.whl (38.0 MB) Collecting numpy>=1.21.2 Downloading numpy-1.26.1-cp310-cp310-win_amd64.whl (15.8 MB) Successfully installed numpy-1.26.1 opencv-python-headless-4.8.1.78 If you see the message "Successfully installed" that means it worked fine. Now Intella can use OpenCV. Testing the algorithm Before we begin, let's create a simple app that we can use for testing the algorithm. The idea is very simple: First, we load the image file using OpenCV library. Then, we convert the color space from RGB to HSV (hue, saturation and value). That allows us to access the saturation part of each pixel. The pixels with low saturation appear as gray. Split the image data into three different arrays corresponding for H, S and V Use the maximum value of the S array to detect the maximum saturation of the image. If the max saturation is lower than the threshold that means this is a grayscale image. The following diagram demonstrates how colors with low saturation look like (the middle of the cylinder): Here is the code that we can use: import sys, cv2 img_file = sys.argv[1] img = cv2.imread(img_file) hsv = cv2.cvtColor(img, cv2.COLOR_RGB2HSV) h, s, v = cv2.split(hsv) print('Max saturation: ' + str(s.max())) Copy this text to a text file and save it as "detect_grayscale_test.py" (or download from here). Let's try to run it to analyze the picture of London from Wikipedia. Download the picture and save it to a folder on disk. Let's use the command line prompt from the previous step. Run this command: > .\python.exe C:\temp\detect_grayscale.py C:\temp\London_Skyline_Color.jpg Max saturation: 255 The script returned the maximum value which is 255. That means that the picture is not grayscale which is correct. Now, download this grayscale sample from Wikipedia: Grayscale_8bits_palette_sample_image.png. And run the app again: .\ python.exe C:\temp\detect_grayscale.py C:\temp\Grayscale_8bits_palette_sample_image.png Max saturation: 0 The result is zero now which is again correct. That indicates a grayscale image. Creating a crawler script Now let's create a crawler script. First, let's transform our test code into a function, so that we can use it in the script. The function will return an integer value that represents the max saturation of the image: def get_max_saturation(img_file): img = cv2.imread(img_file) hsv = cv2.cvtColor(img, cv2.COLOR_RGB2HSV) h, s, v = cv2.split(hsv) return s.max() In order to create a crawler script we need to define a class that implements ScriptService.Iface and two functions itemFound and itemProcessed. See detailed description of crawler scripts work on the GitHub page: intella-crawler-scripts. Let's just use this empty crawler script as a starting point: class ScriptHandler(ScriptService.Iface): def itemFound(self, item): return FoundItemResult(action=Action.Include) def itemProcessed(self, item): return ProcessedItemResult(action=Action.Include) For our script we can use either function. Both have access to the item type and binary content. That is all we need. Let's use itemProcessed. We can use item.mediaType property to determine if the item is an image. All image types start with "image/" prefix (examples: image/jpeg, image/png and so on). Then we calculate the max saturation using the function that we have just created and store the value in a custom column "Color Saturation". This will help to check how the algorithm works. If the max saturation is less than a threshold, we tag this item as "Grayscale". The threshold in this example is set to 20, but it could be adjusted if needed. def itemProcessed(self, item): custom_columns = [] tags = set() if item.binaryFile is not None and item.mediaType is not None and item.mediaType.startswith("image/"): max_saturation = get_max_saturation(item.binaryFile) sat_column = CustomColumn("Color Saturation", CustomColumnType.Integer, CustomColumnValue(intValue=max_saturation)) custom_columns = [sat_column] if max_saturation <= 20: tags.add("Color Detection/Grayscale") else: tags.add("Color Detection/Color") return ProcessedItemResult(action=Action.Include, customColumns=custom_columns, tags=tags) The complete script can be downloaded from here: detect_grayscale.py Now let's test the script. Launch Intella and create a new case, add the folder with two images that we have just downloaded. Add the crawler script on the Options page: Click "Validate" to validate the script. Index the source. When the indexing is finished, we can see that the script worked correctly. One image was detected as grayscale and the other one was detected as colored. We can see the new custom column and tags:
  10. igor_r

    W4 Latest Version

    Vound is pleased to announce the official release of W4 1.1.6. For current W4 customers, W4 1.1.6 is available from the Downloads section of our website. You will need your dongle ID to download this update. More information can be found here: https://www.vound-software.com/software-downloads Users with a W4 1.1.x license on their dongle can use this version. If your dongle does not have this version, you will need to update your dongle using 'Dongle Manager.exe' which is located in the folder where W4 is installed on your system. Please read the Release Notes before installing or upgrading, to ensure you do not affect any active cases. Highlights Stability improvements across all components. Release Notes W4-1.1.6-Release-Notes.pdf For additional information, please visit our W4 website website.
  11. Hello Edgar, You can use export sets for that. When you create a load file export and add it to an export set, a new column will appear called "Export: XYZ" that will show you the Beg Doc field. But there is no way to show the End Doc field though. https://www.vound-software.com/docs/intella/2.6/#_export_sets_2
  12. igor_r

    W4 Latest Version

    Vound is pleased to announce the official release of W4 1.1.5. For current W4 customers, W4 1.1.5 is available from the Downloads section of our website. You will need your dongle ID to download this update. More information can be found here: https://www.vound-software.com/software-downloads Users with a W4 1.1.x license on their dongle can use this version. If your dongle does not have this version, you will need to update your dongle using 'Dongle Manager.exe' which is located in the folder where W4 is installed on your system. Please read the Release Notes before installing or upgrading, to ensure you do not affect any active cases. Highlights Stability improvements across all components. Release Notes W4-1.1.5-Release-Notes.pdf For additional information, please visit our W4 website website.
  13. Hi Shaun, Thanks for submitting a support ticket and providing additional information. The reason Intella didn't use all the crawlers in some cases was because the files are so small and quick to process so that the time needed to discover them in the file system was comparable to the time needed to index them. In other words, Intella simply didn't need all 16 crawlers because it was enough to use just 2-3 depending on how fast new files are being discovered. We suspect that an antivirus might have been slowing down I/O access to the disk. Note that this is sort of a corner case. Normally, when indexing larger files such as PSTs or disk images, it should never happen. In this particular case, a folder with large number of very small files, it would be better to index a container such as ZIP or logical disk image (L01, AD1, etc).
  14. igor_r

    Recipe Creation

    Thanks Greg, That definitely makes sense. I have added it to our internal wish list database.
  15. Hi fuzed, That's correct. Intella doesn't support folders in Google takeouts at the moment. I don't know whether the using of Thunderbird will help. It's better to try it on a small dataset first. The best option might be to use Outlook by downloading the data to a PST file and then indexing the PST in Intella.
  16. Hi Calaman, Can you provide a bit more details what exactly you are doing and how these Notes error affects it? Screenshots might help. With regards to this Notes error, it is normal to see it if you don't have IBM Notes installed. But it shouldn't affect anything except being able to index NSF files.
  17. Hi Patrice, If you're talking about exporting items from the Search tab, you can use Reporting feature to do that. Just add the required items to a report and then turn on the option "Export original format files" in the section settings.
  18. igor_r

    What is W4

    Hi llanowar, Thanks for testing W4! At the moment, all timestamps on the "Events" view are always shown in your current timezone. The source timezone setting only applies to timestamps in the table and the properties tab. The reason we did that is because the Events view can display data from multiple sources which might come from different timezones. In the next version we'll add an option to select the timezone for Events view.
  19. Hi Adam, Thanks for your suggestion! I agree that having such a button would be a good idea. We might add it in one of the future versions.
  20. Hi BenW, According to this article http://id-check.artega.biz/info-ca.php, The first digit of a SIN indicates the province in which it was registered. Therefore any SIN number that starts with a zero would be fictitious. The regex that I provided earlier would ignore all SIN numbers that with a zero. That would also filter out a string zeros.
  21. Hi BenW, Can you try this: ([1-9]\d{2}-\d{3}-\d{3})|([1-9]\d{2}\s\d{3}\s\d{3})
  22. igor_r

    What is W4

    Hi Adam, Yes, it depends on what you included in the report. If you report items as a table (one item per row) then it may be ok. I've just tried to create a report that contained 177K items as a table and it worked fine. It took 5 mins to produce. The result PDF file was 28MB and 12,500 pages. But if you report items as a list (1-3 items per page), it might produce an enormous PDF with 200-300K+ pages. I don't think you could even open such file then in Acrobat Reader. So, at the moment reporting is indeed designed to have smaller number of items or pages. What you could do is to change the presentation of certain items to "Table". That might dramatically reduce the number of pages in the report. If you really need to report everything in the case, then exporting to a CSV file might be a better option for now. We'll think how to handle such huge reports better in a future version. Thanks for your feedback!
  23. igor_r

    What is W4

    Hi Adam, Thanks for the feedback! Did you add both images under a single source? That could explain the issue. The "Files" section in the source panel is for adding parts of the same image. If you need to index two different images, you need to create a separate source for each image. I hope that answers your question.
  24. igor_r

    New Beta Available

    Hi Rio, Thanks for your feedback! Yes, we plan to improve the link functionality in future versions. Could you clarify what you mean exactly by extending the link attributes of a file to Registry entries where they exist? A specific example would also help.
  25. Hello Bryan, At the moment there is no way to configure the family definition for "Family Date" column. We will add this feature in a future version. I'm afraid the only option for now is to unzip the evidence.
×
×
  • Create New...