Jump to content

Extracting metadata from Google search URLs


igor_r

Recommended Posts

Hello everyone,

 

Intella can extract web browser history from all modern browsers (Chrome, Firefox, IE, Safari). But the URLs often contain additional information about what happened when the user clicked on a link or searched for a keyword. This metadata is encoded in URL parameters.

 

Today we will look into creating a crawler script that can extract various information from Google search URLs. We will be using this open source tool: https://github.com/obsidianforensics/unfurl. It’s written in Python, so we can use it directly in a crawler script.

 

 

Preparations

 

Before we begin, we need to add the library to the Python bundled in Intella. To do that, open the command line prompt as Administrator. Change the directory to c:\Program Files\Vound\Intella 2.6.1\python. Due to limitations of the embedded Python, we can’t install this module directly. Instead we will create a “venv” (virtual environment) and install unfurl into it.

 

First, let’s install the venv module, create a new environment (called “unfurl”) and then activate it:

> .\python.exe  -m pip install virtualenv
> .\python.exe -m virtualenv unfurl
> .\unfurl\Scripts\activate

Now, we are in the “unfurl” environment. Let’s install the library and its dependencies:

> pip install flask==2.3.3
> pip install dfir-unfurl
> deactivate

At the moment of writing this article Intella didn’t have support for virtual environments in Python scripts. So let’s just copy the content of the “unfurl” folder

c:\Program Files\Vound\Intella 2.6.1\python\unfurl

back to the parent “python” folder (one level up) replacing any existing files:

c:\Program Files\Vound\Intella 2.6.1\python

 

Testing the library

 

Before creating the script, let’s first check that the library actually works and can extract information from URLs. Download this test script: unfurl_test.py and save it to a folder on your computer. Then run it using the command prompt from the previous step.

 

.\python.exe C:\temp\unfurl_test.py

It should produce output like this:

[1] https://www.google.com/search?q=google&rlz=1C1YTUH_en-GBAU1048AU1048
 ├─(u)─[2] Scheme: https
 ├─(u)─[3] www.google.com
 |  ├─(u)─[6] Subdomain: www
 |  ├─(u)─[7] Domain Name: google.com
 |  |  ├─(u)─[11] Domain is on list of known Google domains
 |  |  └─(u)─[12] Domain is extremely popular (found in "Top 1000" lists)
 |  └─(u)─[8] TLD: com
 ├─(u)─[4] /search
 └─(u)─[5] q=google&rlz=1C1YTUH_en-GBAU1048AU1048
    ├─(u)─[9] q: google
    |  └─(G)─[13] Search Query: google
    └─(u)─[10] rlz: 1C1YTUH_en-GBAU1048AU1048
       ├─(G)─[14] RLZ used for grouping promotion event signals and anonymous user cohorts
       ├─(G)─[15] RLZ version: 1
       ├─(G)─[16] Application: C1
       |  └─(G)─[22] Search performed using Chrome Omnibox
       ├─(G)─[17] Brand Code: YTUH
       ├─(G)─[18] Cannibalized: No
       ├─(G)─[19] Language: English (en-GB)
       ├─(G)─[20] Install Cohort: Installed in Australia the week of 2023-03-06
       └─(G)─[21] Search Cohort: First search was in Australia the week of 2023-03-06

It means that the library works fine and we can proceed to the next step.

 

Creating a crawler script

To use the library in a script first we need to create an instance of Unfurl object, add a URL and then call the “parse” function:

instance = core.Unfurl(remote_lookups=False)
instance.add_to_queue(data_type='url', key=None, value=url)
instance.parse_queue()

After that the instance object is ready to use. All of the parsed properties are stored in the “nodes” field which is a dictionary. Let’s create a function that would extract a property from the object. The function loops over all entries in the dictionary and selects one that satisfies the criteria (key, data type and hover).

def get_node_value(unfurl, data_type, key=None, hover=None):
    for node in unfurl.nodes.values():
        key_matched = key is None or node.key == key
        hover_matched = hover is None or (node.hover is not None and hover in node.hover)
        if node.data_type == data_type and key_matched and hover_matched:
            return node.value

    return None

Then, we can extract a property like this:

# Query
query = get_node_value(instance, "url.query.pair", "q")

# Source
source = get_node_value(instance, "google.source")

# Sxsrf timestamp
ts_sxsrf = get_node_value(instance, "epoch-milliseconds", 2, "sxsrf")

 

Then we can create a tag for the extracted property:

 

tags.add("Google Source/" + source)

Or create a custom column:

 

custom_columns.append(CustomColumn("Google Query", CustomColumnType.String, CustomColumnValue(value=query)))

In the script we use the “itemProcessed” function because we need access to the metadata fields, specifically to the extracted browser history URLs. In Intella, extracted URLs are stored in the “Subject” field. We use “item.mediaType” to distinguish between browser history and other items:

 

if item.subject is not None and item.mediaType is not None and "history-entry" in item.mediaType:

The final script can be found here: unfurl_crawler_script.py

 

Testing the script

Let’s test the script on this small Chrome history database (History.7z) that I have prepared. Create a new case in Intella, add a new File source and select the unzipped History file. Add the crawler script and index the data.

We can see that the script worked fine. There are new tags and custom columns related to Google search metadata:

image.png

 

You can easily extend the script by adding new fields. The library supports a lot of web sites such as Google, YouTube, Facebook, Twitter and many others.

  • Like 2
Link to comment
Share on other sites

  • 2 weeks later...

This is really interesting, and I can think of other use cases where you call upon a Python script to enrich the ouput from Intella. e.g., a script to search for "%%EOF" in a PDF and add a custom column denoting the count. That would alert the reviewer of the existence of prior versions of the PDF. They could copy it out and run the BASH script I've shared in this section to parse that out.

What I'm not clear from the above is do I write a Python script and somehow call that script within a crawler script? Or are crawler scripts written in Python and added to Intella Connect?

Link to comment
Share on other sites

Hi Jacques,

Yes, you could definitely write such a script. Crawler scripts can be written in Python, Groovy or Java. You can choose any depending on your preferences and needs.

You should first write the script and test that it works correctly. Then you can add the script to the source definition in Connect on the Options page, see https://www.vound-software.com/docs/connect/2.6.1/Intella Connect Administrator Manual.html#server-indexing-options.

Please take a look at the documentation and examples at the GitHub page: https://github.com/vound-software/intella-crawler-scripts

The grayscale detection script would look very similar to what you want (https://github.com/vound-software/intella-crawler-scripts/blob/main/samples/advanced/detect_grayscale.py). You would need to load file PDF binary and check the presence of %%EOF tokens.

image.png

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...