Search the Community
Showing results for tags 'proximity search'.
Introduction We receive numerous support tickets from our customers in regards to advice for using Proximity searches. The user manual provides the basic syntax and there is additional information at these Forum posts. http://community.vound-software.com/index.php?/topic/245-proximity-search-using-more-than-two-words/?hl=prox%2A http://community.vound-software.com/index.php?/topic/359-proximity-search-with-a-phrase-search/?hl=proximity There is also a webinar on using proximity searches in Intella here. In most cases we are provided with examples of the syntax which the customer has used. In some cases the syntax is very complex and, often the syntax is incorrect. Some customers ask us whether the syntax is correct or ask why their proximity search is not working. This is something that we cannot answer on an individual basis. The point of this document is to provide examples to help our customers to get a better understanding of proximity search syntax so that they can create the correct search syntax for the search that they want to perform. Note: Most of this information applies to all versions of Intella which support Proximity searching. There is a known issue with hit highlighting in versions prior to 1.9.1. We recommend that you update to version 1.9.1 if you encounter this issue. What is a proximity search? Proximity searches are search syntax specifically crafted to find items based on words that are within a specified maximum distance from each other in the item’s text. For example, if I wanted to find all items that have the words 'desktop' and 'application' within 10 words of each other then I would use the following: “desktop application”~10 A proximity search differs from a phrase search in that it does not matter whether 'desktop' is before or after the term 'application' in the text. For example, documents containing either of the passages of text below will be respondent to the proximity search above. "You must turn on your desktop computer before you can open an application." "I have copied the shortcut for the application onto the desktop." Using the Correct Proximity Syntax As mention above we receive proximity search syntax from customers. A lot of the time we see that the customer has created search strings such as the examples provided below: (Baxter Jason) ~20 (article) OR (paper) OR (presentation) OR (public) OR (report) "national OR fire OR service"~30 (truck) OR (department) These examples have been sanitized and shortened however, the original search strings contained several lines of OR statements. This makes the search string complex, cumbersome, prone for errors and difficult to troubleshoot. Example 1 If we look at the first example above, we can see immediately that there are several issues which make this syntax incorrect. One issue is that the terms to be searched are not encased in double quotes. Another issue is that the number of words to be within (~20 in this case) is not at the end of the proximity search syntax as there are several OR statements after this number. The user manual shows a basic example of the syntax “desktop application”~10. Note that the structure is to have two (or more) search terms encased in double quotes followed by the number of words that the terms must be within. The proximity string can be made more useful for larger queries by adding more search terms. The additional search terms need to be separated by the OR operator and encased in parentheses. For example, the first example above could be rewritten this way: "(Baxter OR Jason) (article OR paper OR presentation OR public OR report)"~20. Because the user is looking for one of two terms within 20 words of one of several other terms, we have grouped the keywords by placing them in parentheses and separating the terms with the OR operator, e.g: (Baxter OR Jason) and (article OR paper OR presentation OR public OR report). Note: All of the search terms are still encased in double quotes, followed by the number of words that the terms must be within. This syntax will return any items where Baxter or Jason is within 20 words of article, paper, presentation, public or report. Example 2 Again we see that there are issues with the search syntax in example 2. This time double quotes are used however, they do not encase all of the search terms. Also, we see a similar trend to example 1 where there are several search terms within parentheses and separated by the OR operator. We see a lot of samples like this and wonder whether this format of proximity search has come from another tool. The way I read this example is as follows: Find all items that have national, fire, or service within 30 words of truck or department. The syntax can be rewritten this way: "(national OR fire OR service) (truck OR department)"~30. Again we use the parentheses to group the search terms into the two groups and make sure that all terms are encased in double quotes. Limitations Because the double quotes need to encase all of the search terms, you cannot have a search phrase within a proximity search. A search phrase would require double quotes and you can't have nested double quotes within a proximity search. That said, you can use phrases in keyword lists (see below). In the past we have been provided with proximity search strings where the syntax contained over 40 words separated by the OR operator. As mentioned above, this format is not correct. Even if we corrected the syntax, 40 words in a proximity search makes the search string complex, cumbersome, prone for errors and difficult to troubleshoot. We have also received extremely long search syntax where all search terms contained wildcards. Such complex queries with many wildcards are known to have very poor performance, especially for hit highlighting in the Previewer window. Workarounds There are a couple a methods one could use to manage complex proximity searches that contain a large number of search terms separated by the OR operator. One is to break down the search string and two is to use keyword lists. Breaking down the search string A complex search string can be broken down into several shorter proximity search strings. The shorter search strings are then placed into a keyword list. E.g. “Baxter article”~20 “Baxter paper”~20 “Baxter presentation”~20 “Baxter public”~20 “Baxter report”~20 Intella will be able to process the list of shorter proximity searches more efficiently than one large complex search string. With a small amount of Excel work you can create a keyword list that includes all of your shortened proximity searches in a single list Using keyword lists The idea behind using keyword lists is to reduce the number of items that your proximity search needs to search across. Two keyword lists can be created, one list which contains the search terms in the left group of a proximity search and a second list which contains all the other terms in the right group, e.g. Keyword list 1 Keyword list 2 Baxter article Jason paper presentation public report Next, run the two keyword lists and Tag the overlapping cluster. This cluster will contain the items that have search terms from both keyword lists. Set this Tag as an 'Include' search and run the proximity search. This provides faster searching as you are not searching over the entire dataset. However, be aware that hit highlighting can still be slow or hang Intella if the proximity search is complex and contains wildcards. The advantage of using keyword lists is that you can use the following types of searches and operators: Wild cards (article*, paper* etc) Phrases ("national fire", "fire service" etc) Other search operators
We got an interesting question from one of our customer with regards to Proximity search using more that two words. As this might be valuable information also for others I decided to publish the recap of an answer: Proximity is actually the number of other words permitted between words in query phrase. If zero, then this is an exact phrase search. Please note that ordering doesn't matter. Let's look at the example: "vound connect intella"~3 Will match: "vound intella connect" (words in between: 0) "vound extra words here connect intella" (words in between: 3) "vound some words connect separated intella" (words in between: 3) "intella vound connect" (words in between: 0) Will not match: "vound too many extra words here connect intella" (words in between: 5) "vound some words connect further separated intella" (words in between: 4)