Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating mean Average Recall (mAR), mean Average Precision (mAP) and F1-Score #2513

Open
WillianaLeite opened this issue Mar 22, 2021 · 30 comments

Comments

@WillianaLeite
Copy link

WillianaLeite commented Mar 22, 2021

Hi guys!

I've been looking for a long time to find the correct way to calculate the F1-Score using the lib Mask-RCNN. I created several issues 2178, 2165, 2187, 2189, studied for a long time and I believe I found the right form. Before presenting the code used, let's go to the settings I used.

mAP = mean Average Precision

mAR = mean Average Recall

f1-score = 2 * (((mAP * mAR) / (mAP + mAR))

Calculating mean Average Precision (mAP)

To calculate the mAP, I used the compute_ap function available in the utils.py module. For each image I call the compute_ap function, which returns the Average Recall (AR) and adds it to a list. After going through all the images, I average the Average Recalls.

def evaluate_model(dataset, model, cfg):

  APs = []
  for image_id in dataset.image_ids:
		
    image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
		
    scaled_image = mold_image(image, cfg)
		
    sample = expand_dims(scaled_image, 0)
		
    yhat = model.detect(sample, verbose=0)
		
    r = yhat[0]
		
    AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, 
                                                          r["rois"], r["class_ids"], r["scores"], 
                                                          r['masks'], iou_threshold=0.5)
		
    APs.append(AP)

  mAP = mean(APs)
  return mAP

Where the parameters:

  • dataset, is an object of a class that inherits from the Dataset class in utils.py;
  • model is an object resulting from the MaskRCNN class available in the module model.py;
  • cfg is an object of a class that inherits the super class config.py

Calculating mean Average Recall (mAR)

To calculate the mAR I used the post An Introduction to Evaluation Metrics for Object Detection as a mathematical basis.

The calculation of the mAR is similar to the mAP, except that instead of analyzing precision vs recall, we analyze the recall behavior using different iou thresholds. In the post Average Recall it is defined as:

AR is the recall averaged over all IoU ∈ [0.5, 1.0] and can be computed as two times the area under the recall-IoU curve:

ar_formula

In the code what we need to do is create a function that calculates the Average Recall, and then we follow with the approach similar to mAP, we will go through each of the images, calculate their Average Recall, add it to a list and at the end we make an average and we find the mAR.

from sklearn import metrics

def compute_ar(pred_boxes, gt_boxes, list_iou_thresholds):

    AR = []
    for iou_threshold in list_iou_thresholds:

        try:
            recall, _ = compute_recall(pred_boxes, gt_boxes, iou=iou_threshold)

            AR.append(recall)

        except:
          AR.append(0.0)
          pass

    AUC = 2 * (metrics.auc(list_iou_thresholds, AR))
   return AUC

Basically, we are calling the compute_recall function of the utils.py module for each of the thresholds that we define in the formula.

Where,
pred_boxes: Are the coordinates of the expected bounding box;
gt_boxes: Are the coordinates of the actual bounding box;
list_iou_thresholds: List of thresholds that will be used.

Now let's add mAR to our evaluate_model function.

def evaluate_model(dataset, model, cfg, list_iou_thresholds=None):

  if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)

  APs = []
  ARs = []
  for image_id in dataset.image_ids:
		
    image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
		
    scaled_image = mold_image(image, cfg)
		
    sample = expand_dims(scaled_image, 0)
		
    yhat = model.detect(sample, verbose=0)
		
    r = yhat[0]
		
    AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'], iou_threshold=0.5)
		
    AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
    ARs.append(AR)
    APs.append(AP)

  mAP = mean(APs)
  mAR = mean(ARs)

  return mAP, mAR

Calculating F1-Score

Now that we know our mAP and mAR, just apply the f1-score formula. Let's add the f1-score formula to our evaluate_model function.

def evaluate_model(dataset, model, cfg, list_iou_thresholds=None):

  if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)

  APs = []
  ARs = []
  for image_id in dataset.image_ids:
		
    image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
		
    scaled_image = mold_image(image, cfg)
		
    sample = expand_dims(scaled_image, 0)
		
    yhat = model.detect(sample, verbose=0)
		
    r = yhat[0]
		
    AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'], iou_threshold=0.5)
		
    AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
    ARs.append(AR)
    APs.append(AP)

  mAP = mean(APs)
  mAR = mean(ARs)
  f1_score = 2 * ((mAP * mAR) / (mAP + mAR))

  return mAP, mAR, f1_score

This was the way I found to calculate mAP, mAR and f1-score, what did you think? I believe that I am on the right path, I am not an expert in the area and I had a lot of difficulty in reaching this result, I accept any type of feedback. I hope to contribute in some way!

@sohinimallick
Copy link

Hello, did this method work for you?

@WillianaLeite
Copy link
Author

Hi @sohinimallick !
So far it has worked well

@wiktor-jurek
Copy link

wiktor-jurek commented Mar 24, 2021

Big thanks for this! It's working on my end so far.

Edit: No it's not, whoops. I'm getting an error when calling evaluate_model. Within the utils.compute_ap function, there is a shape mismatch when calculating intersections. Here's the error dump:

~/project/2_MaskRCNN/mrcnn/utils.py in compute_overlaps_masks(masks1, masks2)
    109 
    110     # intersections and union
--> 111     intersections = np.dot(masks1.T, masks2)
    112     union = area1[:, None] + area2[None, :] - intersections
    113     overlaps = intersections / union

<__array_function__ internals> in dot(*args, **kwargs)

ValueError: shapes (2,65536) and (3136,51) not aligned: 65536 (dim 1) != 3136 (dim 0)

I have a feeling that it's either the fact that I'm using a newer version of TF, or the expand_dims function is not working correctly. What is the expected output when calling expand_dims?

Here's my code for reference 👇

def evaluate_model(dataset, model, cfg, list_iou_thresholds=None):

    if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)

    APs = []
    ARs = []
    for image_id in dataset.image_ids:
		
        image, image_meta, gt_class_id, gt_bbox, gt_mask = modellib.load_image_gt(dataset, cfg, image_id)
		
        scaled_image = modellib.mold_image(image, cfg)
		
        sample = np.expand_dims(scaled_image, 0)
		
        yhat = model.detect(sample, verbose=0)
		
        r = yhat[0]
		
        AP, precisions, recalls, overlaps = utils.compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'], iou_threshold=0.5)
		
        AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
        ARs.append(AR)
        APs.append(AP)

    mAP = mean(APs)
    mAR = mean(ARs)
    f1_score = 2 * ((mAP * mAR) / (mAP + mAR))


    return mAP, mAR, f1_score
evaluate_model(dataset,model,config)

@WillianaLeite
Copy link
Author

WillianaLeite commented Mar 25, 2021

Hi @wiktor-jurek

I'm using the Colab environment for training my models, I run this command (magic cell):

%tensorflow_version 1.x

And it returns me the entire environment configured to work with Tensorflow in version 1.15.2. Colab maintains a stable version of Tensorflow 1 and 2. Well, I believe that the version of tensorflow may be the problem, but I also noticed that its function compute_overlaps_masks is slightly different from what I find in my utils.py, so I'm sending you the utils.py da mask rcnn that I have here: https://drive.google.com/file/d/1EWI3kVvBpKGBoBJ-f0rq_NpoURBszlrR/view?usp=sharing.

@sohinimallick
Copy link

sohinimallick commented Mar 28, 2021

Big thanks for this! It's working on my end so far.

Edit: No it's not, whoops. I'm getting an error when calling evaluate_model. Within the utils.compute_ap function, there is a shape mismatch when calculating intersections. Here's the error dump:

~/project/2_MaskRCNN/mrcnn/utils.py in compute_overlaps_masks(masks1, masks2)
    109 
    110     # intersections and union
--> 111     intersections = np.dot(masks1.T, masks2)
    112     union = area1[:, None] + area2[None, :] - intersections
    113     overlaps = intersections / union

<__array_function__ internals> in dot(*args, **kwargs)

ValueError: shapes (2,65536) and (3136,51) not aligned: 65536 (dim 1) != 3136 (dim 0)

I have a feeling that it's either the fact that I'm using a newer version of TF, or the expand_dims function is not working correctly. What is the expected output when calling expand_dims?

Here's my code for reference 👇

def evaluate_model(dataset, model, cfg, list_iou_thresholds=None):

    if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)

    APs = []
    ARs = []
    for image_id in dataset.image_ids:
		
        image, image_meta, gt_class_id, gt_bbox, gt_mask = modellib.load_image_gt(dataset, cfg, image_id)
		
        scaled_image = modellib.mold_image(image, cfg)
		
        sample = np.expand_dims(scaled_image, 0)
		
        yhat = model.detect(sample, verbose=0)
		
        r = yhat[0]
		
        AP, precisions, recalls, overlaps = utils.compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'], iou_threshold=0.5)
		
        AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
        ARs.append(AR)
        APs.append(AP)

    mAP = mean(APs)
    mAR = mean(ARs)
    f1_score = 2 * ((mAP * mAR) / (mAP + mAR))


    return mAP, mAR, f1_score
evaluate_model(dataset,model,config)

@wiktor-jurek I solved this putting USE_MINI_MASK = False in both inference and training

@sohinimallick
Copy link

BTW @WillianaLeite...do you have any suggestions to output number of TP/FP somehow.

@sain0722
Copy link

Hello @WillianaLeite I have a question.
Why use mold_image?
What's the difference from just putting in [image]?

@geomaticsbetul
Copy link

Hello @WillianaLeite I tried the code you wrote in my own work. I have 5 classes in my dataset.
When I found mAP using the method in issues #1839, it was 0.6, but when I tried yours, it got 0.3 and the f1 score was 0. What should I do?
How can I increase these values ?
I need f1 score and mAP value for my thesis please help.
Thanks.

Results:
(0.34649124363358835, 0.0, 0.0)

@wiktor-jurek
Copy link

that its function compute_overlaps_masks is slightly different from what I find in my utils.py, so I'm sending you the utils.py da mask rcnn that I have here: https://drive.google.com/file/d/1EWI3kVvBpKGBoBJ-f0rq_NpoURBszlrR/view?usp=sharing.

Bingo. That made the difference. Thanks.

@CZ2021
Copy link

CZ2021 commented Apr 30, 2021

Hello @sain0722, I had the same question regarding mold_image and [image]. Have you received an answer? Or do you know why it is necessary to mold the images before detection?
Thanks.

@temi92
Copy link

temi92 commented May 24, 2021

@sain0722 i believe mold_image does the normalization i.e subtract from the mean of the dataset

@salma-achour
Copy link

Hello @WillianaLeite I tried the code you wrote in my own work. I have 5 classes in my dataset.
When I found mAP using the method in issues #1839, it was 0.6, but when I tried yours, it got 0.3 and the f1 score was 0. What should I do?
How can I increase these values ?
I need f1 score and mAP value for my thesis please help.
Thanks.

Results:
(0.34649124363358835, 0.0, 0.0)

The compute_ar() function is not able to find the compute_recall(..) so it keeps appending 0.0 to the AR list instead of throwing an exception because of the try-except bloc.
to fix the 0.0 values in the mAR and F1 score you need to replace compute_recall(..) with utils.compute_recall(..) inside of the compute_ar() function.

@kimile599
Copy link

Hi anyone know how to calculate the mAP for bbox. Currently, lots of calculations were found focus on instance but for detection, how to do that? Thank you for the help

@marcodelmoral
Copy link

Problem is, compute_ap function does it with masks and compute_recall does it with bboxes, it does not work

@nataliameira
Copy link

@marcojulioarg
Do you disagree with this implementation? As you did?

@sauravsolanki
Copy link

sauravsolanki commented Aug 29, 2021

Hey Actually, if you are using matterport TF 2.0 version from here , then you have to add USE_MINI_MASK=False inside config then do the things.

@felipetobars
Copy link

felipetobars commented Jan 5, 2022

@sain0722 @CZ2021 I found that in the model.py file in the detect function mold_image() is already applied when an image is passed to it and there is another function called detect_molded() that says: "Runs the detection pipeline, but expect inputs that are molded already. Used mostly for debugging and inspecting the model ". I don't know why you find mold_image() applied before using model.detect(). So should model.detect_molded() be used?

@andreaceruti
Copy link

@WillianaLeite Hi! In doing my master thesis I have came to your same problems! I found this issue and I have currently done same path as you!
I am using detectron2 for my project and avaliable metrics in instance segmentation tasks are AP and AR, in particular there is an evaluator that uses standard COCO metrics. To translate AP and AR in F1 I've ended up using your same F1 formula.
Honestly i do not know if this is the right way to do present a publication, but imo AP@IoU=0.50 should be the standard metric. Average F1 could be appended specifying the method we used to obtain it!

@guilhermemarim
Copy link

. So should model.detect_molded() be used?

I understood the same as you @felipetobars. I think it's not necessary to use mold_image before calling the detection function. I did some testing and got better results by not using the mold_image before.

@felipetobars
Copy link

. So should model.detect_molded() be used?

I understood the same as you @felipetobars. I think it's not necessary to use mold_image before calling the detection function. I did some testing and got better results by not using the mold_image before.

I also got better results without using the function @guilhermemarim

@ghost
Copy link

ghost commented Apr 13, 2022

Hello,
looking at the detect() method in the model.py script, i see:

`def detect(self, images, verbose=0):
assert self.mode == "inference", "Create model in inference mode."
assert len(
images) == self.config.BATCH_SIZE, "len(images) must be equal to BATCH_SIZE"

    if verbose:
        log("Processing {} images".format(len(images)))
        for image in images:
            log("image", image)

    # Mold inputs to format expected by the neural network
    molded_images, image_metas, windows = self.mold_inputs(images)

    # Validate image sizes
    # All images in a batch MUST be of the same size
    image_shape = molded_images[0].shape
    for g in molded_images[1:]:
        assert g.shape == image_shape,\
            "After resizing, all images must have the same size. Check IMAGE_RESIZE_MODE and image sizes."`

why exactly do you mold the image before giving it to the to the detect()-method? The way you do it, the input image gets molded double, right?

@DavidRosen
Copy link

DavidRosen commented May 13, 2022

Are you aware that the weighted-average or micro-average recall is just another name for the ordinary accuracy score? Or that the macro-average recall (equal weight per class irrespective of imbalance in number of instances) is just another name for the balanced accuracy score?

Compare them for yourself. They match exactly:

classification_report(y_tst, y_pred_tst, digits=15) =
                  precision         recall            f1-score           support

              0  0.817246835443038 0.683879510095995 0.744638673634889      3021
              1  0.829678021465236 0.696708463949843 0.757401490947817      2552
              2  0.770103092783505 0.630912162162162 0.693593314763231      1184
              3  0.331294597349643 0.844155844155844 0.475841874084919       385
              4  0.394505494505494 0.920512820512820 0.552307692307692       390

       accuracy                                      0.700345193839618      7532
      macro avg  0.628565608309383 0.755233760175333 0.644756609147710      7532
   weighted avg  0.767319254559894 0.700345193839618 0.717240526308044      7532

balanced_accuracy_score(
              y_tst, y_pred_tst) = 0.755233760175333

@aliffarisqi
Copy link

@WillianaLeite thank you so much, it works on my project

@Akintola-Stephen
Copy link

My question is what needs to be passed into evaluate_model.
I need to see where someone calls the function passing in the parameters into it.

Thank you.

@Natriumpikant
Copy link

Natriumpikant commented Oct 18, 2022

@WillianaLeite Hi! In doing my master thesis I have came to your same problems! I found this issue and I have currently done same path as you! I am using detectron2 for my project and avaliable metrics in instance segmentation tasks are AP and AR, in particular there is an evaluator that uses standard COCO metrics. To translate AP and AR in F1 I've ended up using your same F1 formula. Honestly i do not know if this is the right way to do present a publication, but imo AP@IoU=0.50 should be the standard metric. Average F1 could be appended specifying the method we used to obtain it!

Hey @andreaceruti ! Facing the same issue atm - did you get it work with detectron2? thanks in advance :)

@Testbild
Copy link

Testbild commented Nov 4, 2022

Hi @WillianaLeite ,

thanks for providing your code. I have a question regarding the compute_ap() function from mrcnn that you maybe know:

Does it compute the AP and mAP based on the boxes or based on the segmentation? I made a formula by hand myself, that is using the segmentation and I do get different results from the compute_ap() one built in mrcnn. However I do not know if eventually I am maybe doing something wrong with my function or it they are just using different inputs.

Thanks and regards!

@FiyinfobaO
Copy link

@WillianaLeite I'm sorry but how is the compute_ar function correct? In the compute_ap, the precision is calculated at a specific iou threshold but in compute_ar, it loops through all the thresholds and calculates the AUC which is returned as the AR. How does that work??

I mean, even if we want to get the AP across all the thresholds, then why don't we use compute_ap_range instead?

But back to the main point, the recall formula looks off to me and I would appreciate any response especially if I'm looking at it the wrong way.

@FiyinfobaO
Copy link

I also just observed that the compute_recall doesn't return the average recall like the compute_ap returns the average precision, not to talk of the fact that the compute_recall is implemented purely from the bounding box as it doesn't use any of the mask predictions or the ground truth masks in its calculation. So how do we compare the results from this function with the result of the compute_ap which utilizes both the bounding boxes and the masks?

Would appreciate any response to this and if I'm wrong, please let me know as well.

@aa217
Copy link

aa217 commented Mar 20, 2023

Hi @WillianaLeite,

I believe that your formula for computing F1-score is not accurate and is not applicable to your model. Mean average precision, or average precision for a single class is computed as an estimate of the area under the precision-recall curve. This unification is done because the precision and recall metrics are inversely proportional and change when you alter the IoU threshold.

Furthermore, the F1 score formula is used for binary classification tasks, not for object detection or segmentation. You are better off sticking to mAP and AR scores to compare your different models.

Screenshot 2023-03-20 114939

@qa1511
Copy link

qa1511 commented Aug 10, 2023

Hello.
I am a beginner, I appreciate the code and ideas you provided, in my project I followed your code hoping to output the F1 score and mAP and mAR but he reported an error, I am inexperienced and still hope you can help me.
Thank you very much.
Here is my code:
from tensorflow import expand_dims
from mrcnn.utils import Dataset
from tensorflow.python.keras.backend import mean

from build.lib.mrcnn.model import load_image_gt, mold_image
from build.lib.mrcnn.utils import compute_ap
from mrcnn.utils import compute_recall
def compute_ar(pred_boxes, gt_boxes, list_iou_thresholds):

AR = []
for iou_threshold in list_iou_thresholds:

    try:
        recall, _ = compute_recall(pred_boxes, gt_boxes, iou=iou_threshold)

        AR.append(recall)

    except:
      AR.append(0.0)
      pass

AUC = 2 * (metrics.auc(list_iou_thresholds, AR))
return AUC

def evaluate_model(Dataset, model, cfg, list_iou_thresholds=None):
if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)

APs = []
ARs = []
for image_id in Dataset.image_ids:
    image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(Dataset, cfg, image_id, use_mini_mask=False)

    scaled_image = mold_image(image, cfg)

    sample = expand_dims(scaled_image, 0)

    yhat = model.detect(sample, verbose=0)

    r = yhat[0]

    AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"],
                                                   r["scores"], r['masks'], iou_threshold=0.5)

    AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
    ARs.append(AR)
    APs.append(AP)

mAP = mean(APs)
mAR = mean(ARs)
f1_score = 2 * ((mAP * mAR) / (mAP + mAR))

return mAP, mAR, f1_score

evaluate_model(Dataset,model,config)`

Tips for reporting errors:
for image_id in Dataset.image_ids:
TypeError: 'property' object is not iterable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests