The metrics:

  • Evaluation tool: BOPtoolkit
  • For 2D object detection: precision, recall, IoU, mAP
  • For 6D pose estimation: ADD, ADD, COU, VSD

The first evaluation step consists of computing 2D bounding box detection metrics with the mean Average Precision (mAP) based on the Intersection over Union (IOU) scores, with a threshold of 0.5. In 6D pose estimation the most widely used metric is Average Distance to the model point (ADD) error, (eADD). If the model M has indistinguishable views, the error is calculated as the Average Distance to the Closest model point (ADI). Each estimated pose is considered correct if e < θ = k * d where k is a constant and d is the object diameter. We considered different costant k equal to 0.1, 0.2 and 0.3. In addition, we also computed Complement over Union (CoU) error recalls. This is a metric based on masks: it creates masks of the given rendered object (given the cad model and the estimated pose). Then it compares predicted and ground truth masks. CoU was computed with three threshold: 0.3, 0.5, 0.7.

CoU θ = 0.3 θ = 0.5 θ = 0.7
Box 94.6% 100.0% 100.0%
Cup 94.2% 99.4% 100.0%
Jug 79.6% 97.8% 99.6%
Hotstab 20.2% 65.6% 92.6%
Average 72.15% 90.57% 98.05%

ADI k = 0.1 k = 0.2 k = 0.3
Box 21.6% 51.0% 60.4%
Cup 55.0% 77.8% 85.8%
Jug 72.2% 86.6% 93.0%
Hotstab 44.0% 64.4% 73.8%
Average 48.2% 69.95% 78.25%

The method comparison:

  • YOLOv4 + AAE
  • EfficientPose
  • Yolo6D

Despite our dataset being mostly composed of symmetrical objects, we achieved better results when comparing with similar state-of-the-art methodologies. The following Table for the hotstab and the jug compared three different methods in terms of ADD, ADI and mAP. In particular, we chose for comparison YOLO-6D [1], which is a feature-based approach that uses Perspective’n’Point (PnP) algorithm, and EfficientPose [2], a full-frame, single-shot method which achieves one of the higher results on LineMod Dataset. YOLO-6D and EfficientPose were trained with different input and batch sizes, different learning rates and Adam momentum as optimizer. On the other hand, for EfficientPose the best performance is achieved with a batch size equal to 1, a learning rate of 0.0001 and 500 epochs. To ensure comparability with the other two methods, we opted for lighter versions by selecting φ equal to 0 as scaling hyperparameter. Despite the higher results on LineMod Dataset, EfficientPose achieves very low performance scores. On the other hand, YOLO-6D achieves higher performances with respect to EfficientPose, but it does not reach our pipeline percentage scores.

Hotstab ADD ADI mAP (θ=0.5)
EfficientPose 1.32% 8.65% 0.7521
Yolo6D 9.58% 36.71% 0.77
YOLOv4+AAE 11.4% 44.0% 0.99%
Jug ADD ADI mAP (θ=0.5)
EfficientPose 23% 54.8% 0.7341
Yolo6D 26.50% 58.44% 0.816
YOLOv4+AAE 29.2% 78.2% 0.995%
[1] Tekin, Bugra, Sudipta N. Sinha, and Pascal Fua. "Real-time seamless single shot 6d object pose prediction." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[2] Bukschat, Yannick, and Marcus Vetter. "EfficientPose: An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach." arXiv preprint arXiv:2011.04307 (2020).