markjrouse Posted December 5, 2014 Report Share Posted December 5, 2014 Hi, I'm having some issues reconciling figures reported during the processing, and those I see in statistics, and was hoping someone could explain the differences. During processing, the processing status screen, when finished all 11 steps, tells me that there are 1,517,290 items in total, and that there are 1,081,355 duplicate items. So 1,517,290 minus 1,081,355 should leave me a unquie deduplicated population of 435,935. In fact on the processing status screen, Intella tells me that unique items are 435,935. However, when I go into the Statistics screen, yes, I'm told that there are 1,517,290 All items, but after deduplication there are 443,206. So there appears to be a difference of 7,271 as to what the population is after deduplication. Does the processing status screen calculate unique items differently from the statistics screen? So if after deduplication the population is 443,206, then the duplicate items count is: 1,074,084, and not 1,081,355. Similiarly, with the reported exception items on the processing status screen I get 115,408. I've naturaly assumed that this is after deduplication. In the statistics screen Exceptions Items after Deduplication is 123,404. A difference of 7,996. Should there be a difference? Link to comment Share on other sites More sharing options...
Andrej Posted December 8, 2014 Report Share Posted December 8, 2014 Hello Mark, during indexing only the MD5 hash is used to deduplicate, while Intella's search functions (and statistics view) also use other ways to deduplicate items (see Intella User Manual, section 13.1.12 MD5 and Message Hash). Link to comment Share on other sites More sharing options...
markjrouse Posted December 9, 2014 Author Report Share Posted December 9, 2014 So when someone asks me how many duplicates, what's the best figure to give: the figure on the processing status screen, or Statistics? Link to comment Share on other sites More sharing options...
Chris Posted December 9, 2014 Report Share Posted December 9, 2014 Hi Mark, This one is hard to provide a general answer too, it really depends on the audience and use case. Are you or your user interested in the most effective deduplication to assess the time needed for review, or perhaps in a deduplicated count that can be verified by other tools that only use MD5 hashing, etc. I am making a note that we address this in a future user manual and user interface. I see how it is not clear from the user interface that there is a difference in deduplication methods here. Link to comment Share on other sites More sharing options...
dpmills Posted December 16, 2014 Report Share Posted December 16, 2014 I have a similar question - why do the processing numbers vary so much between versions of Intella? I recently had a case that I had indexed in 1.6.2 with ~7mil entries, then when I made it a new case in 1.8.1, the number dropped to around half that. Do I need to be wary about the difference in numbers between versions, or are the entries being drastically more de-duplicated? Link to comment Share on other sites More sharing options...
Primoz Posted December 17, 2014 Report Share Posted December 17, 2014 Hi dpmills, It's hard to say what is causing this at first sight. In order to find a reason why number of discovered items dropped to around half a further investigation will be required. Could you please open a support ticket - that way you will be able to provide us with all neccesary information. Link to comment Share on other sites More sharing options...
Recommended Posts