Processing Number v Statistics

markjrouse · December 5, 2014

Hi,

I'm having some issues reconciling figures reported during the processing, and those I see in statistics, and was hoping someone could explain the differences.

During processing, the processing status screen, when finished all 11 steps, tells me that there are 1,517,290 items in total, and that there are 1,081,355 duplicate items. So 1,517,290 minus 1,081,355 should leave me a unquie deduplicated population of 435,935. In fact on the processing status screen, Intella tells me that unique items are 435,935. However, when I go into the Statistics screen, yes, I'm told that there are 1,517,290 All items, but after deduplication there are 443,206. So there appears to be a difference of 7,271 as to what the population is after deduplication. Does the processing status screen calculate unique items differently from the statistics screen? So if after deduplication the population is 443,206, then the duplicate items count is: 1,074,084, and not 1,081,355.

Similiarly, with the reported exception items on the processing status screen I get 115,408. I've naturaly assumed that this is after deduplication. In the statistics screen Exceptions Items after Deduplication is 123,404. A difference of 7,996.

Should there be a difference?

Andrej · December 8, 2014

Hello Mark,

during indexing only the MD5 hash is used to deduplicate, while Intella's search functions (and statistics view) also use other ways to deduplicate items (see Intella User Manual, section 13.1.12 MD5 and Message Hash).

markjrouse · December 9, 2014

So when someone asks me how many duplicates, what's the best figure to give: the figure on the processing status screen, or Statistics?

Chris · December 9, 2014

Hi Mark,

This one is hard to provide a general answer too, it really depends on the audience and use case. Are you or your user interested in the most effective deduplication to assess the time needed for review, or perhaps in a deduplicated count that can be verified by other tools that only use MD5 hashing, etc.

I am making a note that we address this in a future user manual and user interface. I see how it is not clear from the user interface that there is a difference in deduplication methods here.

dpmills · December 16, 2014

I have a similar question - why do the processing numbers vary so much between versions of Intella? I recently had a case that I had indexed in 1.6.2 with ~7mil entries, then when I made it a new case in 1.8.1, the number dropped to around half that. Do I need to be wary about the difference in numbers between versions, or are the entries being drastically more de-duplicated?

Primoz · December 17, 2014

Hi dpmills,

It's hard to say what is causing this at first sight.

In order to find a reason why number of discovered items dropped to around half a further investigation will be required.

Could you please open a support ticket - that way you will be able to provide us with all neccesary information.

Sign In

Processing Number v Statistics

Recommended Posts

markjrouse

Link to comment

Share on other sites

Andrej

Link to comment

Share on other sites

markjrouse

Link to comment

Share on other sites

Chris

Link to comment

Share on other sites

dpmills

Link to comment

Share on other sites

Primoz

Link to comment

Share on other sites

Browse

Activity