Jung Son Posted October 1 Report Posted October 1 Hi there, I am working on developing a crawler script that can filter out certain file paths and extensions that we don't need, such as dll files and the windows\help folder. Once the data is processed, I can see a nice outcome in CSV format, which shows the files that have been included and the ones that have been skipped. If some items are skipped during the filtering process and we later decide to process those particular items, is there a way to use a URI or ID to reprocess and include those items? For example, if I want to include two items under the prefetch folder, is there a way to re-index the case and include certain items based on their IDs or URIs, assuming the URIs won't change? Any help you can provide or sample script would be greatly appreciated. Thanks! Quote
igor_r Posted October 2 Report Posted October 2 Hello Jung, It is definitely possible. I don't have a ready-to-use script at the moment, but the idea is the following: First, you need to parse the so-called "Script Log" produced by Intella. This is a CSV file where you can find all items that were skipped. After parsing the CSV you can collect the IDs or URIs of the items that you want and save them to a separate file. Let's call this file "items-to-include.csv". Now, you can modify your script to add a new condition: if the item ID is from that list, this item is always included. So this check is done before other checks. Use "item.id" and "item.uri" attributes. Then, you can simply re-index the source with the modified script and it should include the skipped items. It's important to remember that when you re-index an existing case all item IDs and URIs won't change. Here is a useful link if you need to parse a CSV in Python: https://www.digitalocean.com/community/tutorials/parse-csv-files-in-python Quote
Jung Son Posted October 3 Author Report Posted October 3 Great, thanks Igor. If you could share a sample script for this, that would be great. Quote
igor_r Posted November 13 Report Posted November 13 Hello Jung, Here is a sample script that you can use to reprocess items that you want to explicitly include. Let's say we have a script that we used to filter out certain items and later decided that some of those items need to be included. If those items were stubbed by the script and therefore exist in the case, you can simply find them and use "Export URI list" option: If those items were excluded by the script and the Script Log was turned on, you can open the script log file which is a CSV and manually extract the second column (URI) to a separate file. Now that we have a file with item URIs that we want to include, we can load them in the script and use it as a filter to always include. Please make sure to do that before you do any other filters. Sample code: https://github.com/vound-software/intella-crawler-scripts/blob/main/samples/advanced/reprocess_excluded_items.py Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.