1 Introduction
Simultaneous Localization And Mapping (SLAM) is one of the fundamental problems in mobile robotics [2]
and addresses the reconstruction of a previously unseen environment while simultaneously localizing a mobile robot with respect to it. While the representation of the robot pose depends on the degrees of freedom of motion, the representation of the map depends on a multitude of factors including the available sensors, computational resources, intended high level task, and required precision. Many possible representations have been proposed.
For visualSLAM, the simplest representation of the map is a collection of 3D points which correspond to salient image feature points. This representation is sparse and efficient to compute and update. Point based methods have been successfully used to map cityscale environments. However, this sparsity comes at a price: pointsbased maps lack semantic information and are not useful for high level task such as grasping and manipulation. Although methods to compute denser representations have been proposed [6, 7, 23, 24, 22] these representations remain equivalent to a collection of points and therefore carry no additional semantic information.
Manmade environments contain many objects that could potentially be used as landmarks in a SLAM map, encapsulating a higher level of information than a set of points. Previous objectbased SLAM efforts have mostly relied on a database of predefined objects – which must be recognised and a precise 3D model fit to match the observation in the image to establish correspondence [29]. Other work [1] has admitted more general objects (and constraints) but only in a slow, offline structurefrommotion context. In contrast, we are concerned with live (realtime) SLAM, but we seek to represent a wide variety of objects. Like [1] we are not so concerned with highfidelity reconstruction of individual objects, but rather to represent the location, orientation and rough shape of objects. A suitable representation is therefore potentially a quadric [10, 31], which allows a compact representation of rough extent.
In addition to objects, much of the largescale structure of a general scene (especially indoors) comprises dominant planar surfaces. Including planes in a SLAM map has also been explored before [28, 15]. Planes are also a good representation for feature deprived regions, where they provide information complimentary to points and can represent significant portions of the environment with very few parameters, leading to a representation that can be constructed and updated online [15]
. Pertinent to our purpose, such a representation also provides the potential for additional constraints for the points that lie on one of the planes and permits the introduction of useful affordance constraints between objects and their supporting planes, as we explain later in the paper. All these constraints lead to better estimate of the camera pose.
Modern SLAM is usually formulated as an unconstrained sparse nonlinear leastsquare problem [13]. The sparsity structure of the problem greatly effects the computation time of the systems. If planes and quadrics are to be introduced in a SLAM system, they should be represented in a way which is amenable to the nonlinear least squares formulation and respects the sparsity pattern of the SLAM problem.
In this work, we propose a map representation that consists of points, and higher level geometric entities such as planes and objects as landmarks. Unlike previous work such as [1] we explicitly target realtime performance, and integrate within an online SLAM framework. Such performance would be impossible with uncritical choices of representation and to that end we propose a novel representation of objects based on quadrics that decomposes to permit clean, fast and effective realtime implementation. We show that this representation, along with pointplane, planeplane (Manhattan), and planeobject (supporting) constraints, greatly reduces the error in the estimated camera trajectory without incurring great extra computational cost. Because of the higherlevel primitives in the map, the representation remains compact, but carries crucial semantic information about the scene. To the best of authors’ knowledge, this is the first realtime SLAM system proposed in literature that incorporates both higher level primitives of planes and previously unseen objects as landmarks.
The main contributions of the paper are articulated as follows: (1) proposing a novel representation and decomposition of dual quadrics, and its related factors for integrating objects in the SLAM factor graph that is amenable to the nonlinear leastsquares framework and allows CNNbased object detections to be seamlessly integrated in a SLAM framework, (2) introducing a supporting affordance relationship between quadric objects and planes in a SLAM factor graph thanks to the proposed representation, and (3) integrating all of the higher level primitives: planes and quadrics, along with points, and geometric relationships among them, Manhattan assumptions and supporting/tangency constraints, in a complete online keyframe based SLAM system that performs near realtime.
The remainder of the paper is organized as follows. In the next section, we present the background for SLAM as the solution to a factorgraph, and explain how our proposal is integrated into such a framework. In particular, we give detailed descriptions of the mathematical representations of each landmark, and the factors they induce. Section 4 presents an overview of how the preceding section is integrated into an overall SLAM system. Experiments showing the efficacy and comparative performance of our system are presented in Section 7. We conclude with a summary and discussion of future research directions.
2 Related Work
SLAM is well studied problem in mobile robotics and many different solutions have been proposed for solving it. The most recent of these is the graphbased approach that formulates SLAM as a nonlinear least squares problem [13]. SLAM with cameras has also seen advancement in theory and good implementations that have lead to many realtime systems from sparse ([20],[6]) to semidense ([7], [8]) to fully dense ([23], [22], [24]).
Recently, there has been a lot of interest in extending the capability of a pointbased representation by either applying the same techniques to other geometric primitives or fusing points with lines or planes to get better accuracy. Several methods have explored replacing points with lines [17, 11]. However, lines present especial difficulty because of the lack of a good mathematical representation that is amenable to the leastsquares framework. Some works have explored the possibility of using lines and points in the same framework [25, 12] and have been more successful.
Recently, [15] proposed a representation for infinite planes that is amenable for use in a leastsquares framework. Using this representation, they presented a method that work using just information of planes visible in the environment. Similarly, [35]
use a monocular input to generate plane hypothesis using a Convolutional Neural Network (CNN) which is then refined over time using both the planes as well as points in the images.
[33] proposed a method that fuses points and planes using an RGBD sensor. In the latter works, they try to fuse the information of plane entities to increase the accuracy of depth inference.Quadrics based representation was first proposed in [3] and later used in a structure from motion setup [10]. [32] presented a semantic mapping system that uses object detection coupled with RGBD SLAM to reconstruct precise object models in the environment, however object models do not inform localization. [29] presented an object based SLAM system that uses prescanned object models as landmarks for SLAM but can not be generalized to unseen objects. [19] presented a system that fused multiple semantic predictions with a dense map reconstruction. SLAM is used as the backbone to establish multipleview correspondences for fusion of semantic labels without informing the localization.
3 Landmark Representations
For objectoriented SLAM the map comprises not only points but higherlevel entities representing landmarks which aim to be more semantically meaningful than sparse points. However to maintain realtime operation, there is a tradeoff between complexity of the landmark representation and the computational cost of tracking and mapping.
In this work we consider two kinds of landmarks, which admit efficient implementation but can broadly capture the overall structure of many scenes, especially those captured indoors: a) plane landmarks, whose role is to encapsulate highlevel structure of regions; and b) quadrics (more specifically ellipsoids) that serve as a general representation of objects in scene, capturing not detailed shape, but key properties such as size, extent, position and orientation. We introduce representations for both types of primitive that allow for efficient implementation in a SLAM framework, as well as admitting clean and effective constraints between primitives, such as supporting constraint between objects and planes.
3.1 Quadric Representation
As noted above, we represent general objects in a scene using an ellipsoid. Generally speaking, a quadric surface in 3D space can be represented by a homogeneous quadratic form defined on the 3D projective space which satisfies , where is the homogeneous 3D point and is the symmetric matrix representing the quadric surface. However the relationship between a pointquadric and its projection into a camera (a conic) is not straightforward [14]. A widely accepted alternative is to make use of the dual space [3, 10, 31] in which a quadric is represented as the envelope of a set of tangent planes, viz:
(1) 
This greatly simplifies the relationship between the quadric and its projection to a conic, however a further problem remains in the context of optimisation in a graphSLAM framework. The issue is that an update of
, given an 9dim error vector
in the tangent space of , should be constrained to lie along a geodesic of the manifold. But finding these geodesics and updating with respect to them is computationally expensive, making a “straightforward” quadric representation intractable for incremental optimisation.We seek to address both of these issues. For our object representation, we would like to restrict landmarks to belong to the set of bounded quadrics, namely ellipsoids. To do so requires imposing the constraint that
must have 3 positive and 1 negative eigenvalues. Based on this restriction, the representation of dual ellipsoids
can be decomposed as:(2) 
where transforms an axisaligned (canonical) quadric at the origin, , to a desired pose, and , , denote the scale of the canonical ellipsoid along its principal axes.
Optimizing on the space of quadrics must impose constraints on the eignevalues of to force the solution to be an ellipsoid. Recently [27] and [10] have parameterized ellipsoids to overcome this problem. They optimize on the space of ellipsoids, , to localise the quadric by their respective conic observations. However their representation requires solving a constrained least squares problem. While their parametrization is useful for observations of quadrics on camera frames as conics, it can not be used as generic constraints in the graph SLAM problem due to its constrained nature. The authors in [10] decompose the translation part of the representation, mainly for numerical stability in the optimisation because of the different scales of translation and the other parts of the representation, and impose some prior knowledge on the shape of the ellipsoids. For a more efficient representation of ellipsoids in graphbased SLAM, we exploit the underlying structure of to represent the dual quadric as follows:
(3) 
with real numbers , and , and so guarantees the required positive eigenvalues. We thus represent a dual ellipsoid using a tuple where and lives in D(3) the space of real diagonal matrices, i.e. an axisaligned ellipsoid accompanied by a rigid transformation. This decomposition exploits the underlying structure of the manifold of , ensuring we remain in the space of ellipsoids without needing to solve a constrained optimisation problem.
We update the separately in the underlying 6D space of and 3D space of , where both of them are Lie groups and can be updated efficiently by their respective Lie algebra. Thus the proposed update rule is:
(4) 
where is the mapping for updating ellipsoids, is the update for which comes from the first 3 elements of error vector and applies in the Euclidean space of and is the update for which comes from the last 6 elements of error vector and applies in space of . This decoupled update is a good approximation given the incremental nature of evidence.
This proposed representation of ellipsoids is beneficial particularly when we want to impose constraints on different parts of this representation. For instance, this representation for makes it possible to apply prior knowledge for shapes and sizes of objects, using the component, prior information about location and orientation of the object using the component, and adjacency/supporting constraints (see Section 3.3).
3.2 Plane Representation
To represent planes as structural entities in the map, we represent an infinite plane by its normalised homogeneous coordinates where is the normal vector and is the distance to origin. The reason for considering normalised homogeneous vectors is inspired by [15] to have a minimal representation for planes to avoid rankdeficient information matrices in optimization. This representation of the planes is isomorphic to the northern hemisphere of , or equivalently the Lie group, therefore the optimisation can be performed using three elements that represent an element of .
3.3 SLAM as a factorgraph
Following the seminal work of [4] it is now well known that SLAM can be represented as a factor graph where the vertices represent the variables that need to be estimated such as robot poses and points in 3D, and the edges represent constraints or factors between the vertices.
In a traditional pointbased SLAM system, factors exist between points and the camera that seek to minimize reprojection error:
(5) 
where represent a point in the world, is the pose of the camera which takes a point in the current camera frame () to a point in the world frame that is observed at the pixel location , and is a function that projects a world point into the camera. is the mahalanobis norm and equal to where is the covariance matrix associated with the factor. Likewise if odometry is known between two robot positions, a factor involving robot poses can be formulated as:
(6) 
The solution to the SLAM problem is a configuration of the vertices that minimizes the error over all the involved factors.
In our proposed objectoriented SLAM representation, the vertices in the SLAM graph consists not only of points but potentially planes and/or general objects (represented by quadrics). Fig. 1 shows the various factors involving cameras, points, planes, and quadric objects in our system. Below we describe in more detail how the new components of our SLAM system are introduced as additional factors in the graph.
3.3.1 Observations of Objects (Ellipsoids).
A quadric in the scene projects to a conic in an image [14]:
(7) 
where is the projection matrix of the camera with calibration matrix and is the pose of the camera. For observed conic , we consider the observation error for quadric as the Frobenius norm of the difference between normalized and normalized projected conic :
(8) 
which forms a factor between the quadric and the camera pose.
3.3.2 Observations of Planes.
If we denote the observation of the plane from a camera pose by , we can measure the observation error by:
(9) 
where is the transformed plane to the camera coordinates frame and is the distance function in the tangent space of the . For more details regarding plane updates and their corresponding exponential map refer to [15].
3.3.3 PointPlane Constraints.
If we believe that a point actually lies on a specific plane, it makes sense to impose a constraint between the point and the relevant plane landmark. To do so we introduce the following factor:
(10) 
which simply measures the orthogonal distance of the point from the plane with the unit normal vector . is an arbitrary point on the infinite plane.
3.3.4 PlanePlane Constraints (Manhattan Assumption).
Imposing constraints on relative plane orientations is simply a matter of introducing a factor on the plane surface normals. The most useful and common such constraints (especially indoors) are those associated with a Manhattan world in which planes are mostly mutually orthogonal or parallel. Constraints between planes and with unit normal vectors and , respectively, are implemented as:
(11)  
(12) 
3.3.5 Supporting/Tangency Constraints.
Almost all stable objects in the scene are supported by structural entities of the scene like planes; e.g. commonly objects are found on the floor or on a desk. Such an affordance relationship can be imposed between a quadric object and a structural infinite plane by introducing a geometric tangency constraint between them. To the best of our knowledge, this is the first time that such a constraint has been included in an online SLAM.
Although imposing a tangency constraint in the space of point quadrics could be tricky, in the dual space such a constraint takes a particularly simple form:
(13) 
where is the normalised homogeneous plane that supports the quadric .
4 System Implementation
Modern SLAM system can be divided into two functional parts: a) a frontend: which deals with raw sensory input to initialize vertices and factors and b) a backend which optimizes the SLAM graph to create an updated estimate of the vertices. In this section, we provide an overview of our frontend that extracts the landmarks, observations and constraints mentioned in the section 3 to construct the SLAM graph.
The backend of our SLAM system, optimises this graph using a leastsquares framework [16]. It should be pointed out that all of the landmarks and constraints participate in the optimisation after adding a new keyframe, as well as when a loop closure is detected. Our system augments RGBD variant of the publicly available ORBSLAM2 [20]. Loops are detected using bag of words [9] based on ORB features. The pipeline of our system is demonstrated in Fig. 2.
4.0.1 Point Observations.
We rely on the underlying ORBSLAM2 RGBD implementation for points; candidate features are extracted based on uniqueness and described using ORB features, with their depth initialized using the depth channel of the input. For dataassociation across frames, ORB features are matched in a coarsetofine pyramid in a local window around the previous observation.
4.0.2 Plane Observations.
For planar landmarks, we are interested not only in the parameters of the infinite planes, but also their extent visible in the current image, so that points can be associated to the planes on which they are observed.
Most plane fitting models for RGBD data use RANSAC which is extremely slow for the purpose of building a near realtime online SLAM framework. Our plane segmentation follows [34] which segments point clouds from RGBD data in near realtime. For dataassociation across frames, we rely on the sparsity (few dominant planes in the scene) and inherent robustness (little variation frametoframe) of these landmarks. Using the difference between normals and the distance between planes, dataassociation is done in a nearestneighbor fashion.
The plane segmentation and matching uses depth data, and is the only part of our system (other than ORB feature depth initialization) which relies on depth information. In the future we aim to remove even this requirement and make the system truly monocular by hypothesizing planes using singleview semantic segmentation, depth and normal estimation, as is now possible by deep nets [5].
4.0.3 Conic Observations.
We use FasterRCNN [26] with pretrained model on COCO dataset [18]
to detect the bounding boxes for objects of the scene. From the axisaligned bounding boxes, the inscribed ellipsoid is computed as the conic projection of an observed quadric object. To avoid outliers and achieve robust detections we consider objects with 95% or more detection confidence.
For dataassociation across frames, we utilize the semantic labels and rely on the detection of the object to match the corresponding landmark. If more than one instance of a semantic class is found, we use nearestneighbor matching in the feature space generated by the detector. This simple strategy is successful with highconfidence object detections, as shown in section 7.
Note that the partial occlusions can result in an inconsistent observations of a same object from different viewpoints that can lead to inaccuracy in the trajectory and map. The following course of actions is employed in our system to mitigate the negative impact of partial occlusions: (a) we use robust kernels (Huber) to robustify against large error, (b) only consider objects with 95% or more detection confidence. With these two recourses we have seen almost consistent observations of COCO objects in our experiments shown in section 7.
4.0.4 PointPlane Constraints.
Finding association between points and planes is established during plane detection and segmentation. After detecting each plane and its finite boundary, its inlier points are determined to be those satisfying a threshold, distance, which we set as a function of the distance of the points from the camera, because further points have greater uncertainty.
4.0.5 Plane/Manhattan Constraints.
The number of planes detected by our system is sufficiently small that we can consider all possible pairs, and introduce constraints with very little impact on overall speed of operation. At present we adopt the expedient of imposing a parallel constraint if the angle between the pair of planes is less than a threshold , and if the angle is within of 90 deg we introduce a perpendicular factor. For our experiments we have used and in our system.
Manhattan constraints are imposed in a conservative manner with a large uncertainty and act as a prior on the relative orientation of the planes. Based on evidence gathered over image frames, they might end up being perpendicular or parallel but are not forced to be in that configuration if the data strongly favors an opposite interpretation.
4.0.6 Supporting/Tangency Constraints.
A supporting/tangency constraint between a quadric and a plane is imposed based on the orthogonal distance of the centroid of the quadric and infinite plane. If this distance is less than we enforce the tangency constraint. In our experiments this threshold depends on the size of the quadric where , , and are half the length of the principal axes of the ellipsoid.
5 Experiments
We evaluate the performance of our SLAM system using the benchmarks RGBD TUM dataset [30] and NYUDepth V2 dataset [21]. These sequences have a wide range of conditions, from planerich scenes to scenes with little or no texture and also scenes with common objects such as those available in COCO dataset [18]. We show qualitative as well as quantitative results of our system using different combinations of the proposed landmarks and constraints and compare the accuracy in the estimated camera trajectory against the RGBD variant of the stateoftheart sparse mapping system, ORBSLAM2 [20].
5.1 Qualitative Results
Some sequences in the TUM RGBD dataset contain little or no texture which makes it difficult for pointbased SLAM systems to extract and track keypoints. However these sequences have rich planar structures which are exploited by our SLAM system. The results for using planes with Manhattan constraints on fr3/str_notex_far and fr1/floor are given in Fig. 3. Results for more sequences are reported in the supplementary material. The figure depicts the image frame along with tracked features, detected and segmented planes, and the reconstructed map consisting of points and planes from two different viewpoints. For these sequences, ORBSLAM2 is unable to detect features in the environment with the normal settings and loses track. Lowering the feature detection threshold in ORBSLAM2 yields a greater number of features, but also results in more outliers leading to more inaccurate trajectories.








To show the quality of the mapping and tracking with planes and objects along with the Manhattan and supporting constraints, we use the sequences fr1/xyz, fr2/desk from TUM dataset, and nyu/basement_1a, nyu/office_1 from NYU dataset. The reconstructions are shown in Fig. 4. The reconstructed map of fr1/xyz is depicted in column (c) and (d) of the first row. The planar structure of the map is consistent with the ground truth scene which consists of two planar monitors orthogonal to the green desk. Quadrics corresponding to objects on the desk have been reconstructed tangent to the desk, their supporting plane. Column (a) shows tracked ORB features and detected COCO objects with confidence of at least 0.95 at the corresponding frame. The red ellipses in column (a) are the projection of the reconstructed quadric objects. They closely fit the detected blue bounding boxes and their corresponding green computed ellipses.
















We use fr3/cabinet to show the importance of using the Manhattan constraint. The sequence contains a loop around a cabinet. All the faces of this cabinet are parallel or perpendicular to each other. Fig. 5 demonstrates the difference in the quality of the reconstruction of the cabinet’s sides with and without Manhattan assumption in column (a) and column (b) respectively.



Figs. 6(a,b) show the reconstructed quadric corresponding to the object on desk in the fr1/xyz before and after imposing the tangency constraint. Enforcing the tangency constraints makes sure that the quadric object is tangent to the supporting plane.


5.2 Quantitative Comparison
We compare the performance of the proposed SLAM system against the RGBD variant of the stateoftheart system ORBSLAM2 for TUM RGBD dataset that the groundtruth trajectories are available. This baseline is a monocular pointbased system that uses the depth information in the Dchannel to initialize 3D points. Our implementation builds directly on their opensource C++ codebase, and we structure our results as an ablation study, considering the effects of introducing different landmarks and constraints. In each case, we report the RMSE Absolute Trajectory Error (ATE)^{1}^{1}1Comparison for RMSE of relative errors, RTE and RRE, as well as runtime analysis are reported in the supplementary material. in Table 1.
We first consider the case where points are augmented by the plane information (PP). This already improves the ATE in each case over the baseline, which improves even further by enforcing Manhattan constraints (PP+M). The Manhattan constraint significantly reduces the trajectory error when dominant structure is present in the scene.
Some sequences do not contain objects similar to the COCO dataset. For those that do, we investigate using the combination of points and quadrics (PQ) as landmarks. While this reduces the trajectory drift compared to baseline, the improvement is smaller compared with using PP+M. Finally, we report numbers for the full system (PPQ+MS) in which points, planes and quadrics are used as landmarks and Manhattan and support constraints are enforced. For fr3/long_office the improvement is significant (51.07%) because of the presence of a large loop in this sequence, where all of the points, planes and quadrics landmarks participate and are updated based on the loop closure.
Dataset  ORBSLAM2  PP  PP+M  PQ  PPQ+MS 

fr1/floor  1.4399  1.3798  1.3246 [8.01%]  —  — 
fr3/cabinet  7.9602  7.3724  2.1675 [72.77%]  —  — 
fr3/str_notex_near  1.6882  1.0883  1.0648 [36.93%]  —  — 
fr3/str_notex_far  2.0007  1.9092  1.3722 [31.41%]  —  — 
fr1/xyz  1.0457  0.9647  0.9231  0.9544  0.9038 [13.57%] 
fr1/desk  2.2668  1.5267  1.4831  1.9821  1.4029 [38.11%] 
fr2/xyz  0.3634  0.3301  0.3174  0.3453  0.3097 [14.78%] 
fr2/rpy  0.3207  0.3126  0.3011  0.3195  0.2870 [10.51%] 
fr2/desk  1.2962  1.2031  1.0186  1.1132  0.8655 [33.23%] 
fr3/long_office  1.5129  1.0601  0.9902  1.3644  0.7403 [51.07%] 
Comparison of the estimated trajectories of our system against ground truth is presented in Fig. 7 for two example TUM sequences.


6 Conclusions
In this work, we have explored the effects of incorporating planes and quadrics as higherlevel geometric entities in a sparse pointbased SLAM framework. To do so we have introduced a new ellipsoid representation that is easily and effectively updated, and admits a simple method for imposing constraints between planes and objects. The improved performance due to using points and planes has been clearly shown by the experiments, most noticeably when there is dominant planar structure present. Of course in cases where enough planes are not present, the point based SLAM can still function as usual.
Currently, the method works with RGBD input. As in “vanilla” ORBSLAM2, 3D map points are initialized with depth obtained from the Dchannel of the RGBD camera. We also use the Dchannel to initialise planes, and this is both a bottleneck in terms of computation and presents a limitation on the sensor. In future, we will explore methods that can provide plane estimate from monocular input, which will enable us to transition to a purely monocular implementation. We also hope to further explore additional interobject relations and introduce greater rigour to how and when the constraints are effected.
References

[1]
Bao, S.Y., Bagra, M., Chao, Y.W., Savarese, S.: Semantic structure from motion with points, regions, and objects. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2012)
 [2] Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., Reid, I., Leonard, J.J.: Past, present, and future of simultaneous localization and mapping: Toward the robustperception age. IEEE Transactions on Robotics 32(6), 1309–1332 (2016)
 [3] Cross, G., Zisserman, A.: Quadric reconstruction from dualspace geometry. In: Computer Vision, 1998. Sixth International Conference on. pp. 25–31. IEEE (1998)
 [4] Dellaert, F., Kaess, M.: Factor graphs for robot perception. Foundations and Trends in Robotics 6(12), 1–139 (2017). https://doi.org/10.1561/2300000043, http://dx.doi.org/10.1561/2300000043
 [5] Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 713, 2015. pp. 2650–2658 (2015). https://doi.org/10.1109/ICCV.2015.304
 [6] Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
 [7] Engel, J., Schöps, T., Cremers, D.: Lsdslam: Largescale direct monocular slam. In: European Conference on Computer Vision. pp. 834–849. Springer (2014)
 [8] Forster, C., Pizzoli, M., Scaramuzza, D.: Svo: Fast semidirect monocular visual odometry. In: Robotics and Automation (ICRA), 2014 IEEE International Conference on. pp. 15–22. IEEE (2014)
 [9] GálvezLópez, D., Tardos, J.D.: Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics 28(5), 1188–1197 (2012)
 [10] Gay, P., Bansal, V., Rubino, C., Bue, A.D.: Probabilistic structure from motion with objects (psfmo). In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 3094–3103 (Oct 2017). https://doi.org/10.1109/ICCV.2017.334
 [11] Gee, A.P., MayolCuevas, W.: Realtime modelbased slam using line segments. In: International Symposium on Visual Computing. pp. 354–363. Springer (2006)
 [12] GomezOjeda, R., Moreno, F.A., Scaramuzza, D., GonzalezJimenez, J.: Plslam: a stereo slam system through the combination of points and line segments. arXiv preprint arXiv:1705.09479 (2017)
 [13] Grisetti, G., Kummerle, R., Stachniss, C., Burgard, W.: A tutorial on graphbased slam. IEEE Intelligent Transportation Systems Magazine 2(4), 31–43 (2010)
 [14] Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, New York, NY, USA, 2 edn. (2003)
 [15] Kaess, M.: Simultaneous localization and mapping with infinite planes. In: Robotics and Automation (ICRA), 2015 IEEE International Conference on. pp. 4605–4611. IEEE (2015)
 [16] Kümmerle, R., Grisetti, G., Strasdat, H., Konolige, K., Burgard, W.: g2o: A general framework for graph optimization. In: Robotics and Automation (ICRA), 2011 IEEE International Conference on. pp. 3607–3613. IEEE (2011)
 [17] Lemaire, T., Lacroix, S.: Monocularvision based slam using line segments. In: Robotics and Automation, 2007 IEEE International Conference on. pp. 2791–2796. IEEE (2007)
 [18] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)
 [19] McCormac, J., Handa, A., Davison, A., Leutenegger, S.: Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In: Robotics and Automation (ICRA), 2017 IEEE International Conference on. pp. 4628–4635. IEEE (2017)
 [20] MurArtal, R., Montiel, J.M.M., Tardos, J.D.: Orbslam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31(5), 1147–1163 (2015)
 [21] Nathan Silberman, Derek Hoiem, P.K., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV (2012)
 [22] Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A.: Kinectfusion: Realtime dense surface mapping and tracking. In: Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on. pp. 127–136. IEEE (2011)
 [23] Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: Dtam: Dense tracking and mapping in realtime. In: Computer Vision (ICCV), 2011 IEEE International Conference on. pp. 2320–2327. IEEE (2011)
 [24] Prisacariu, V.A., Kähler, O., Golodetz, S., Sapienza, M., Cavallari, T., Torr, P.H., Murray, D.W.: Infinitam v3: A framework for largescale 3d reconstruction with loop closure. arXiv preprint arXiv:1708.00783 (2017)
 [25] Pumarola, A., Vakhitov, A., Agudo, A., Sanfeliu, A., MorenoNoguer, F.: Plslam: Realtime monocular visual slam with points and lines. In: Proc. International Conference on Robotics and Automation (ICRA), IEEE (2017)
 [26] Ren, S., He, K., Girshick, R., Sun, J.: Faster RCNN: Towards realtime object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)
 [27] Rubino, C., Crocco, M., Bue, A.D.: 3d object localisation from multiview image detections. IEEE Transactions on Pattern Analysis and Machine Intelligence PP(99), 1–1 (2018). https://doi.org/10.1109/TPAMI.2017.2701373
 [28] SalasMoreno, R.F., Glocken, B., Kelly, P.H.J., Davison, A.J.: Dense planar slam. In: 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). pp. 157–164 (Sept 2014). https://doi.org/10.1109/ISMAR.2014.6948422
 [29] SalasMoreno, R.F., Newcombe, R.A., Strasdat, H., Kelly, P.H.J., Davison, A.J.: SLAM++: simultaneous localisation and mapping at the level of objects. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2013. pp. 1352–1359 (2013). https://doi.org/10.1109/CVPR.2013.178
 [30] Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgbd slam systems. In: Proc. of the International Conference on Intelligent Robot Systems (IROS) (Oct 2012)
 [31] Sünderhauf, N., Milford, M.: Dual Quadrics from Object Detection BoundingBoxes as Landmark Representations in SLAM. preprints arXiv:1708.00965 (Aug 2017)
 [32] Sünderhauf, N., Pham, T.T., Latif, Y., Milford, M., Reid, I.: Meaningful maps with objectoriented semantic mapping. In: Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. pp. 5079–5085. IEEE (2017)
 [33] Taguchi, Y., Jian, Y.D., Ramalingam, S., Feng, C.: Pointplane slam for handheld 3d sensors. In: Robotics and Automation (ICRA), 2013 IEEE International Conference on. pp. 5182–5189. IEEE (2013)
 [34] Trevor, A., Gedikli, S., Rusu, R., Christensen, H.: Efficient organized point cloud segmentation with connected components. In: 3rd Workshop on Semantic Perception Mapping and Exploration (SPME), Karlsruhe, Germany (2013)
 [35] Yang, S., Song, Y., Kaess, M., Scherer, S.: Popup slam: Semantic monocular plane slam for lowtexture environments. In: Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. pp. 1222–1229. IEEE (2016)
7 Experiments
In addition to the results in the main paper, the performance of our SLAM system is evaluated on more sequences from publicly available RGBD TUM dataset, NYUDepth V2 dataset, and also our own captured RGBD sequence using the UR5 robot arm in our lab.
7.1 Qualitative Results
7.1.1 TUM Dataset.
The results for extra sequences of fr3/str_notex_near and fr1/desk are illustrated in Fig. 8. The figure shows the image frame along with tracked features and possible detected objects (column (a)), detected and segmented planes (column (b)), and the reconstructed map consisting of points, planes, and quadric objects from two different viewpoints (columns (c) and (d)).








7.1.2 UR5 Sequence.
To evaluate the reconstruction of our SLAM system on a sequence with more quadric objects, we captured an RGBD sequence using the UR5 robot arm and Kinect. In this sequence, the camera is moved in a smooth trajectory over a table containing multiple objects. The smooth motion of the robot allows us to avoid image blur and rolling shutter effects to achieve robust object detection. The setup for UR5 robot arm is demonstrated in Fig. 9.
The detected objects in two different frames of our UR5 sequence, as well as the reconstructed map are shown in Fig. 10. In this sequence, no planes are detected therefore the map consists of points and quadric objects as landmarks without any additional constraints.




7.2 Quantitative Comparison
The performance of our proposed SLAM system is compared against the RGBD variant of the stateoftheart system ORBSLAM2, for TUM RGBD dataset that the groundtruth trajectories are available. The results are structured as an ablation study, considering the effects of introducing different landmarks and constraints. Comparison for RMSE of Absolute Trajectory Error (ATE) is reported in the main paper and the comparisons for RMSE of Relative Translational Error (RTE), and Relative Rotational Error (RRE) are reported in Table 2, and Table 3, respectively.
Dataset  ORBSLAM2  PP  PP+M  PQ  PPQ+MS 

fr1/floor  4.2161  3.8789  3.6381 [13.71%]  —  — 
fr3/cabinet  15.1002  14.4081  5.1328 [66.01%]  —  — 
fr3/str_notex_near  4.0383  2.4540  2.3533 [41.73%]  —  — 
fr3/str_notex_far  3.4869  3.3523  3.0834 [11.57%]  —  — 
fr1/xyz  1.5693  1.4675  1.2876  1.3795  1.2464 [20.38%] 
fr1/desk  4.0835  3.2994  3.1174  3.7453  3.0237 [25.95%] 
fr2/xyz  1.2107  1.0765  0.9659  1.1964  0.9309 [23.11%] 
fr2/rpy  0.5534  0.5322  0.5073  0.5484  0.4883 [11.76%] 
fr2/desk  4.7783  4.7110  3.6209  4.6309  3.5545 [25.61%] 
fr3/long_office  3.0555  2.6223  2.4750  2.5887  1.8906 [38.12%] 
Dataset  ORBSLAM2  PP  PP+M  PQ  PPQ+MS 

fr1/floor  3.3229  2.8856  2.7839 [16.22%]  —  — 
fr3/cabinet  6.8639  6.5623  2.9125 [57.57%]  —  — 
fr3/str_notex_near  1.8476  1.1541  1.1125 [39.79%]  —  — 
fr3/str_notex_far  0.8479  0.7679  0.6695 [21.04%]  —  — 
fr1/xyz  0.9871  0.9534  0.9037  0.9433  0.8822 [10.63%] 
fr1/desk  1.8547  1.7817  1.6932  1.8044  1.5214 [17.97%] 
fr2/xyz  0.5036  0.4888  0.4456  0.4904  0.4308 [14.46%] 
fr2/rpy  0.9667  0.9586  0.9452  0.9611  0.8770 [9.28%] 
fr2/desk  1.6062  1.4232  1.1236  1.2559  1.0034 [37.53%] 
fr3/long_office  0.8927  0.8062  0.7109  0.8744  0.6230 [30.21%] 
8 Runtime Analysis
All the experiments of our SLAM system have been carried out by a commodity machine with an Intel Core i74790 CPU at 3.6 GHz. All the source code is implemented in C.
In terms of runtime, the bottleneck of the system is the object detection component that is based on FasterRCNN which operates at less than 10 framespersecond (more than 100 msec per frame). Therefore, the object detections have been preevaluated for all of the sequences and the results of the perframe object detection have been fed to the system during online operation. This is not a fundamental restriction of our system and in future will be alleviated by incorporating a realtime object detection method.
The runtime analysis and average statistics of different components and threads of our SLAM system evaluated on RGBD TUM and NYUDepthV2 datasets are shown in Table 4. The system consists of three parallel modules: tracking, local map updates, and global map update when a loop is closed. The tracking thread has to run at framerate while the other two can operate at a slower pace. Plane segmentation is done perframe to do dataassociation against planes present in the map. The reported numbers are for the full system that utilizes all the landmarks (points, planes, quadric objects). The local map optimisation is carried out in a parallel thread after creating and adding a keyframe to the map.
Main Components and Threads  Runtime (msec) 

Plane Segmentation  23.6 
Tracking & Matching Landmarks  27.1 
Local Mapping Optimization  348.4 
Global Bundle Adjustment  2170.6 
Average Frame Time  51.9 
Comments
There are no comments yet.