I'm trying to compute the Average Precision (and Mean Average Precision) on the Oxford Building image dataset.
Below there is the code that they provide for computing Average Precision. Notice that pos_set
is the union of the "optimal" and "good" images from the ground trouth set, while junk_set
is a set of not-relevant images.
void OxfordTest::computeAp(std::vector<std::string> &ranked_list){
float old_recall = 0.0;
float old_precision = 1.0;
float ap = 0.0;
size_t intersect_size = 0;
size_t i = 0;
size_t j = 0;
for ( ; i<ranked_list.size(); ++i) {
if(!pos_set.count(ranked_list[i]))
std::cin.get();
}
if (junk_set.count(ranked_list[i])) continue;
if (pos_set.count(ranked_list[i])) intersect_size++;
float recall = intersect_size / (float)pos_set.size();
float precision = intersect_size / (j + 1.0);
ap += (recall - old_recall)*((old_precision + precision)/2.0);
old_recall = recall;
old_precision = precision;
j++;
}
}
Which is something totally different from the notion given on the linked Wikipedia page. What is the correlation between these notions?
I'm more than sure that Wikipedia's notion is correct, since it corresponds with the one given in this answer and this article.
I don't understand why in the code above it is reported:
- The recall, while the Wikipedia's notion include only precision in the last formula.
- Even considering the formula with the delta recall, nobody talks about `(old_precision + precision) /2
Junk set
The original paper states:
I.e. junk images are not negatives. There are positives (OK+Good), ignores (Junk) and negatives (Absent). Note that all these are per-query, i.e. some images are junk for query 1 but not for query 15. If you look at the images that are 'junk' you'll see ambiguous examples, e.g. some cases have extreme zoom or blurring which will make you think if this image contains the queried landmark or not, and cases where only a tiny part of the object is visible so the image is too hard.
So the authors defined the junk set to be neither positives nor negatives - the images most likely depict the queried object, but for some of them we are not sure, or it would be too harsh to treat them as positives and ask the system to retrieve these examples (and therefore penalise if it doesn't). At the same time, it would also be harsh to treat them as negatives as if the system does retrieve them, it shouldn't be penalised. So all that needs to be done is that (on a per-query basis) you ignore the junks and treat them as if they don't exist. So you take the retrieved list, filter out all junk images for this query, then run normal AP computation on this filtered list. That's what the code is doing effectively - when the example is in amb(=junk), it is just skipped. Then if the example is not in amb, if it's in pos(itives) the intersect_size (current number of positives up until position i) is incremented. The quantity j (well, j-1) is the number of non-skipped elements in the list (it gets incremented only if the current element is not junk).
AP computation
You certainly need the recall in your AP computation, as explained by shiri in the previous answer, and as described in your article, p(r) is the precision at a particular recall. The best way to think of AP is not to examine a random formula but to understand what is the intuition and then see how the formula captures it, i.e. what wikipedia says at the start: you can plot precision as a function of recall, and AP is then simply the area under the curve. You want the precision to be high at all recalls, so the ideal curve is p(r)=1 which would maximise the AP.
So what is the code doing? It's computing the area under the precision-recall curve using the trapezoidal rule, see this equation on Wikipedia and you'll see it's identical to the code. The AP computation for the discrete case from your Wikipedia article is a (commonly used) worse approximation to the area under the precision-recall curve, the rectangle method.