|Home | About | Journals | Submit | Contact Us | Français|
In this paper, we propose a new visual-inertial Simultaneous Localization and Mapping (SLAM) algorithm. With the tightly coupled sensor fusion of a global shutter monocular camera and a low-cost Inertial Measurement Unit (IMU), this algorithm is able to achieve robust and real-time estimates of the sensor poses in unknown environment. To address the real-time visual-inertial fusion problem, we present a parallel framework with a novel IMU initialization method. Our algorithm also benefits from the novel IMU factor, the continuous preintegration method, the vision factor of directional error, the separability trick and the robust initialization criterion which can efficiently output reliable estimates in real-time on modern Central Processing Unit (CPU). Tremendous experiments also validate the proposed algorithm and prove it is comparable to the state-of-art method.
Simultaneous Localization and Mapping (SLAM) has attracted a lot of attention from both robotic community and industrial community. Laser scanner was the primary sensor in earlier SLAM works (e.g., [1,2]). However, the size and weight of laser scanner significantly constrain the agility of the platform and thus the use of vision sensor gradually became a tendency [3,4,5,6,7,8]. The advantages of vision sensor include cheaper price, lighter weight and lower power consumption, which are essential to mobile platforms (e.g., Micro Aerial Vehicle). Furthermore, vision sensor has the capability for retrieving the environment’s appearance, color and texture such that it is possible to perform some high-level tasks such as scene recognition. While stereo camera [9,10,11], RGB-D camera [12,13] and omnidirectional camera sensors  have been proven suitable in some certain scenarios and applications, monocular SLAM provides a fundamental solution.
A typical SLAM system is composed of front-end and back-end. Front-end is in charge of performing data association for the back-end module. For visual SLAM, feature-based method and direct method are two main approaches in the front-end module. Direct method (e.g., [15,16,17]) directly uses the intensity values in the image to estimates the structure and motion, showing more robust than feature-based method in the texture-less scenarios. However, feature-based method is much less sensitive to exposure adjustment in video/image, and it may be a better choice for robust tracking under rich texture environment due to the invariance of descriptor. Back-end in SLAM is in charge of state inference after data association. From the viewpoint of the probabilistic framework, the purpose of back-end is to output the MAP (Maximum a posterior) estimates given the measurements from front-end. For this purpose, the back-end solutions have evolved from filter based approaches [3,18,19,20,21,22] to graph optimization methods [7,8,23]. The first real-time monocular SLAM system was presented by Davsion  with Extended Kalman Filter (EKF) framework, and Civera  improved its performance with the inverse depth feature parametrization. However, the maintaining of the dense covariance matrix in EKF is very expensive so that the size of features has to be very limited. Compared to filter-based methods, Graph optimization exploits the sparse structure and thus it enables fast computation by using sparse linear solvers. Current optimization solvers (e.g., g2o , Ceres , iSAM , GTSAM ) are able to solve a typical optimization problems with tens thousands of variables in few seconds. There also exists different strategies for combing the front-end and back-end. Klein  presented a novel parallel system. This real-time system consists of the tracking thread and the mapping thread. Motivated by the parallel design, Raul  presented an improved system with the concept of co-visibility graph for local mapping to efficiently keep the consistency for large scale environment. Forster  also utilized a parallel system by combining direct tracking for pose estimation and depth filter for feature estimation. Besides these methods, the sliding window strategy also shows good performance [11,28,29,30] and it keeps the computational time bounded by marginalizing out old states.
On the other hand, inertial measurement unit (IMU), as a complementary sensor to camera, is gradually used in the field of SLAM because it allows the recovery of the global roll, pitch and the undetermined scale in monocular SLAM. The early works of visual-inertial fusion were loosely coupled approaches [21,30,31,32] and then tightly-coupled approaches proved its superior performance that jointly optimize all state variables [11,33,34,35]. Among these tight fusion approaches mentioned above [11,20,29] are feature based approaches, which require the feature points that present a high degree of saliency. Mourikis  provided MSCKF algorithm and then consistency analysis of MSCKF was followed by [35,36], and [11,28,29] performed optimization framework by a sliding window to limit the computation. Forster  proposed the IMU preintegration on a manifold for sensor fusion and the iSAM back-end for incremental optimization. In contrast to these feature-based approaches, direct method fusion with inertial sensor provided by Alejo Concha  is the first work that combines the direct method with inertial fusion. Although direct methods are able to track features very efficiently, but they are more likely to fail due to exposure adjustment in vision camera sensor. From the viewpoint of computational complexity, the approaches based on sliding window like MSCKF can be thought as a constant-time solution for each visual frame, but they suffer from relatively larger drift since (a) the commonly used marginalization step usually leads to inconsistent estimates because the invariance is obeyed [35,37]; (b) the earlier observations are neglected. The work in  is done by an incremental optimization strategy (iSAM), but one of disadvantage is the unbounded complexity of memory, which can grow continuously over time.
In this paper, we present a visual-inertial navigation system (VINS) that combines the visual SLAM approach and IMU preintegration technique [33,38] beyond the framework of ORB-SLAM  and PTAM . Firstly, we derive a new IMU factor, motivated by the work in  with the corresponding preintegration method. The derivation is based on the continuous form which allows the use of high-order integration like Runge-Kutta. We stress that the derived IMU factor does not depend on the assumption that the IMU biases keep unchanged between two consequential keyframes such that our proposed IMU factor can better capture the correlation of state uncertainties. Thanks to the proposed IMU factor, given IMU poses (up to a scale) and the preintegrated measurements, we derive a linear least square formulation to initialize the system, which does not need to separately estimate the state variables. More important, since the proposed initialization method has considered the propagated uncertainty and the magnitude of the gravitational vector, we can have a robust mechanism to decide whether current information for initialization is enough or not, which is beyond the discuss in [38,39,40]. We then propose a well-designed parallel framework Figure 1 that runs a tracking thread and a local mapping thread at the same time. In the tracking thread, we only optimize the current IMU state with the IMU preintegration technique and the current vision factor for low computation cost. In the local mapping thread, we optimize all IMU states in the co-visibility graph and all map points observed in the together for a more consist map. For faster convergence, we employ the separability trick in the optimization that subtly uses the overlooked property–IMU velocity and biases are linear in the cost function of the proposed IMU factor.
The rest of the paper is organized as follows. Section 2 introduces the graph optimization used in estimation and the proposed IMU factor with the corresponding preintegration method. Section 3 presents our work for the tightly coupled approach for visual-inertial SLAM algorithm. Section 4 gives the principle of initialization for our monocular visual-inertial SLAM algorithm. Initialization scheme is by no means trivial for a monocular visual-inertial SLAM because initial feature depth and IMU biases can have significant effects on tightly-coupled SLAM system and the estimator usually suffers from the ill-conditioned cases (e.g., constant velocity). Notations: To simplify the presentation, the vector transpose operators are omitted for the case .
In this section, we adopt the formalism of factor graph  and derive a nonlinear least squares formulation to calculate the maximum a posterior (MAP) estimate of the visual-inertial state estimation problem.
The IMU state to be estimated can be represented by a tuple, i.e.,
where , denotes the IMU pose in the global frame, denotes the IMU velocity expressed in the global frame, denotes the gyroscope bias and denotes the accelerometer bias.
An IMU sensor consists of a 3-axis gyroscope and a 3-axis accelerometer. The gyroscope reading at the time t is corrupted by the bias and noise: , where denotes the actual IMU angular velocity at the time t, is assumed to be a white Gaussian noise. Note that the effects form earth rotation is neglected. The accelerometer reading at the time t is also corrupted by the bias and noise: , where is the gravity vector in the global frame and is also modeled as a white Gaussian noise.
Employing the IMU measurements model above and the random walk model for the time-varying IMU biases, we can easily conclude the IMU motion model in the following:
where the skew symmetric operator is given in Appendix, are the measurements from IMU and are the white Gaussian noise with the known covariance .
Ignoring the IMU noise , we have a nominal IMU motion model:
where denotes the nominal IMU state. Given the IMU state and the IMU measurements between the time step i and j, the predicted IMU state at the time-step j can be recursively computed via the nominal motion model
Note that the transformation above represents a series of integral operations and thus a naive implementation of computing is time-consuming and memory-occupied. Later we will provide a method to efficiently compute without need to re-integration.
To concisely quantify the effects of IMU noise, we employ an error between the nominal IMU state and the actual IMU state motivated from 
Note that the error-state propagation Equation (2) is almost an autonomous linear system, independent of state . Therefore,
The matrix in (12) contains 225 elements. Fortunately, we can simplify the expression of as the following:
where , , , () can be preintegrated by the following differential equation:
with the initial guess and ().
Compared to the methods proposed in Forster  and HKST , our derivation is more straightforward and simple. Firstly, our proposed preintegration is born to be continuous. Secondly, unlike the preintegration of Forster and HKST, both of them fix the bias first when computing the 3 preintegration factors and simply use the first order Tyler expansion for the approximate Jacobian of IMU bias and IMU factor, our proposed IMU preintegration factor is based on the entire IMU state(including bias), use continuous differential equations which can better capture the correlations inside the IMU state. Thirdly, the defined error is invariant under the yaw angle transformation.
Given the IMU state and the IMU measurements between the time-step i and j, we have the predicted state as presented in (4). According to the error definition (5) and the error-state motion model (6), we can get the uncertainty between the predicted state and the actual state
where the covariance matrix is integrated from the differential equation (9) with the initial state . In terms of factor graph, we have derived an IMU factor :
Then the proposed preintegration elements in (14) results in a closed-form solution of the predicted state and therefore here we provide the closed form of the error function of the proposed IMU factor (16):
where , and are given in Appendix. Note that the proposed error function is linear to all variables except and . Later we will use this linear property to design the optimization algorithm.
The conventional vision factor employs the re-projection error as the cost function, which is
where is the projection function, is the camera pose, is the camera calibration matrix, is the transformation from camera to IMU and is the pixel observation for the map point . However, the zero re-projection error just implies that the predicted vector is parallel to the measured , which possibly results in a large number of local minimums. To alleviate this shortcoming, we employ the directional error as the cost function, resulting in a different vision factor. We present this improvement with more details in Figure 2.
In our proposed system, optimization is used to correct the error due to sensor noise. Given all IMU measurements and camera measurements along the trajectory, the MAP estimate is
where includes the observed map points and the IMU states from time step 0 to N. Note that and correspond to the IMU factor and the vision factor, as discussed in Section 2.1 and Section 2.2. Based on the theory of factor graph, the optimization problem above can be abstracted into a graph (Figure 3) that consists of nodes ( and ) and factors ( and ). The MAP estimate inference can be transformed into the following nonlinear least squares problem:
Different from the standard least squares problem, the state space of the problem above is non-Euclidean space and thus we integrate the “lift-retraction” strategy into the conventional Gauss-Newton method for solving optimization, which has been summarized in Algorithm 1. Note that in Algorithm 1 is user-defined and we choose
where is given in (11).
|Algorithm 1: Solving Equation (21) by Using the Gauss-Newton Algorithm|
Optimization can provide a relatively accurate estimation for visual construction and the IMU state. However, the Cholesky decomposition used in solving the normal equation (23) suffers from the complexity, where has the same sparsity of and only contains 1 and 0. To alleviate this, we present a novel way to solve (21), which employs the overlooked partial linear structure of (21) and the local observability of VINS. The related details will be given in Section 3.
Our system is inspired by ORB-SLAM  which simultaneously runs the tracking thread and the local mapping thread in real-time.
The tracking thread is in charge of estimating the latest IMU state, which involves twice optimization. In the first optimization, the initial value is given by the IMU preintegration. Then we search for the map points observation by 3D points’ projection. Finally, we perform the small-size optimization (24) which is solved by Algorithm 2. The optimization can be seen as an extension of the pose-only bundle adjustment in the ORB-SLAM . Different with , the state variables brought by the IMU factor has been considered. We separate the state vector into two groups and optimized them separately. In the first step, we only update and thus we can avoid the large drift caused by the low-cost IMU sensor. Secondly, the optimization turns into a linear least squares problem w.r.t the . We describe this solution with more mathematical detail in the Remark 2.
After the first optimization, we perform a guided search of the map points from last frame. A wider search of the map points will be used if not enough matches are found. For efficient and robust data association, we use the projection method from the frames in the co-visibility graph to the current camera frame to perform feature correspondence. In order to keep the computational complexity bounded, we only deal with the keyframes in the local mapping thread. Therefore, there is a mechanism in the end of the threading thread that decides whether the current frame is a new keyframe. To insert a new keyframe, all the following conditions must be met: (1) More than 5 cm have been passed from the last keyframe; (2) More than 20 frames have been passed from last keyframe; (3) Current frame tracks at least 50 points and the number of common points between current frame and last keyframe should be less than of last keyframe. The last condition ensures a visual change condition, and the 1st and 2nd condition will also reduce the number of unnecessary keyframes. We will also send a waiting signal to stop local mapping thread, so it can process the new keyframe as soon as possible. The framework of tracking is summarized in Figure 4.
To quickly output the estimate , we employ a small-size optimization (24) instead of the full optimization:
where is the previous IMU state, denotes the map point observed in the current step. For efficient estimation, both and are fixed in (24). In addition, an ignorable property of (24) is that given the current pose , the optimization becomes linear least squares problem w.r.t. . Therefore, we employ the separability trick for solving (24), which is summarized in Algorithm 2.
|Algorithm 2: Optimization (24) in Tracking|
Once a keyframe is inserted from the tracking thread, the local mapping thread will begin its work that includes creating map points, deleting map points, deleting keyframes and performing optimization. The flowchart of the local mapping thread is presented in Figure 5, and we also add a graph to present the local mapping thread in Figure 6.
When the local mapping thread gets a new keyframe, new map points observed in the new keyframe and the local keyframes will be created by triangulation. The following are the main steps. First, the projection method from the local keyframes is used to search the feature correspondences. The search is performed according to the time order, and it begins from last keyframe and stops once it fails to get a match. With the new feature correspondences from the search, we then calculate the coordinates of new map points by using the fast linear triangulation. In order to get rid of spurious data association, we only keep the new map points that are observed at least three times. Finally, the co-visibility graph will be updated by adding the undirected edges between the keyframes that share the same map points.
Considering that outliers or incorrect feature correspondences will significantly affect the system performance, map points culling is needed before optimization. In the local mapping thread, we check the epipolar constraint and reprojection error of each map point for each keyframe which observes this point. In addition, we also check the parallax angle of each point. If the maximum parallax value is below a threshold, the map point will be removed. In this step, only the map points in the local map are processed.
Deleting the redundant keyframes is beneficial for optimization, which saves the computational time. We discard keyframes whose of map points have been seen in at least other three keyframes. When deleting a keyframe, we need to integrate two IMU factors (connected to this keyframe) into one IMU factor (connected to the last keyframe and the next keyframe). According to the theory of linear system, we derived an integration algorithm, summarized in Algorithm 3. Figure 6 shows the keyframe process in optimization graph, from which we can easily see the way of IMU factor fusion when delete a keyframe.
|Algorithm 3: The Fusion of Two Consequential IMU Factors|
|Input: two consequential IMU factors and |
Output: IMU factor
Connected Nodes: the IMU state and the IMU state the IMU state .
Covariance matrix: .
Measurements: the pre-integrated matrix and the IMU biases .
The last step of the local mapping thread is the optimization (21) with the nodes:
The involved factors are:
To maintain a consistent estimate, we fix the IMU states in (b). Typically, the naive implementation of the optimization here suffers from the computational complexity in solving the reduced normal equation, where n is the number of IMU states in (a). Similar to the separability trick in Algorithm 2, we also use separability strategy on the optimization here so that the computational complexity can be reduced to , which is given in the following Algorithm 4.
|Algorithm 4: Optimization in Local Mapping|
In this section, we propose a novel initialization method that provides a robust estimate at the beginning stage. The initialization is significant to the visual-inertial SLAM system due to the nonlinearity in optimization. Inspired by the linear property of the variables in the IMU factor, we propose a linear least square that can estimate the scale, the velocity, the IMU biases and their covariance matrix. To achieve the reliable estimates and handle the case of poor observability, the linear estimator will keep running until the uncertainty is lower than a threshold. The whole initialization scheme is presented by Figure 7.
At the first step, we employ the pure monocular ORB-SLAM  to produce the estimates of the frame and their IMU body poses. Note that the absolute scale s is unobservable in the pure visual odometry. The output from the pure visual odometry is up to the scale s, i.e.,
for , where is the undetermined scale.
Along with the visual estimation, we also perform the IMU preintegration as shown in Section 2.1. This step will output N IMU factors: the IMU factors , , , . Note that here the nominal IMU biases used for the preintegration of (12) and (9) are zeros.
After visual estimation and IMU preintegration, we perform the visual-inertial alignment to roughly estimate the scale, gravity, velocity, IMU biases. First of all, we substitute (28) into the IMU cost function and then we can see that the variables s, , and are linear in this the term . Fixing the variables for , the MAP problem from (21) becomes
where . Note that here is almost linear to the variable block X. Thus we can straightforwardly obtain the solution of (29) using the linear least square. However, the linear solution for (29) does not consider the magnitude and thus it easily gets ill-conditioned. If we consider the magnitude constraint of , the linear optimization (29) becomes
The new optimization problem (30) is quadratically constrained quadratic program (QCQP) problem, which is a convex problem. It is well known that the local minimum in convex optimization is always a global minimum. Thus we convert (30) to a equivalent unconstrained form formulated in factor graph
with a corresponding retraction
where , and can be regarded as the null space of . The use of (32) can grantee that the magnitude of keeps unchanged after optimization.
It is well-known that a good visual-inertial alignment requires sufficiently motion. For robust estimates, we expect a smart checking step that is in charge of deciding if the estimate from last step (Section 4.3) is safe or not. Here we adopt a value to quantify the accuracy/uncertainty of the estimate , which is the worst-case estimation error [41,42]
where is the information matrix of scale and gravity, extracted from the Hessian matrix for the optimization problem (31), evaluated at the point . Note that the larger value of , larger uncertainty of gravity and scale.
If is more than a threshold , the system will accept the estimate . To refine the estimate , we perform the optimization process (21) with all IMU preintegration and visual measurements. After this step, we have finished the whole initialization.
If is less than the threshold , the system will reject the estimate and wait a time-step for reinitialization. The reinitialization will be boosted with all measurements from time-step 0 to time-step . Before the reinitialization, we perform IMU fusion of the IMU factors and to bound the size of the IMU factors. Note that the fusion algorithm is given in Algorithm 3.
The algorithm is implemented via C++11 code with ceres-solver  for nonlinear optimization framework. The proposed method runs in real-time (20 Hz) for all experiments on a standard computer (Intel Pentium CPU G840, 2.8 GHz, Dual-Core, 8 GB RAM). We test and evaluate our monocular visual-inertial SLAM system in both the low-cost, off-the-shelf visual-inertial sensor (Figure 8) and the EuRoC dataset .
At the beginning of the tracking thread (in Section 3.1), we select keypoints that are well-distributed in the current image. First we detect FAST corners in the 4 pyramid levels of the image. We then split the image into the blocks. For each block, we calculate the average Shi-Tomasi score  for the FAST corners inside this block. Then we filter out those FAST corners below a specific threshold and calculate the number of the rest FAST corners in each block. If there is a block in which the number of FAST corners is very small (below the 20 percent of the median value of all blocks), we set the threshold to be half of the original one to get more FAST corners. If there is a block that does not contain any FAST corner, we split the image into the blocks and repeat the selection steps above. After extracting the FAST corners, we then calculate the orientation and ORB descriptor for each retained corner. This stage takes about 17 ms on our computer.
After obtaining these keypoints with descriptors, we use the preintegrated IMU measurements (Section 2.1) to get the initial guess of the IMU state (1). In order to deal with the extreme case for the low-cost accelerometer, we filter out those accelerometer readings that are more than 50 times of the last reading. Then we start to perform the guided search of map points in the tracking thread (Section 3.1). Note that the feature correspondences in this step is coupled with the invariance property of the ORB descriptor such that the keypoints in the current frame can be matched with some earlier observations. In addition, we use the efficient subspace dog-leg algorithm in ceres-solver  to implement the nonlinear optimization (24). Multi-threads to compute the cost functions and jacobians are used to speed up the system.
We pay more attention about the outliers in the local mapping thread (Section 3.2). In order to gain robust performance, huber loss function with the scale value of 0.2 is used in the vision factors. To get rid of the effects caused by outliers, we first optimize with the huber loss function and then delete the vision factors whose cost function value is more than 0.2. We also check the estimated depth between each map points and keyframes. Map points with negative depth value will be seen as outliers and deleted. After deleting those map points that are outliers, we perform the nonlinear optimization without huber loss function.
The proposed VINS initialization is evaluated in the in the sequence . Because the robust estimates from visual odometry also need enough information, we perform the estimates of the IMU poses at the first 3 s with the visual initialization from ORB-SLAM  and then implement the initialization method presented in last section. Figure 9 shows the uncertainties of gravity and scale, which converges after 11 s. The convergence means that the information is enough and then the system can work with a reliable initial estimate. The novelty of our method is
Since the proposed initialization method is convex which means a unique minimum solution, we optimize the intialization with Gauss Newton method for faster convergence. The Gauss Newton method is implemented by our own source code with Eigen C++ library . The time cost for this optimization in initialization method is 23 ms on average. Note here we do not use huber loss function cause there is no outlier in IMU measurements. Neither ransc nor multi-thread implementation is needed. After this initialization module, we scale the poses of cameras and the positions of the map points.
In this subsection, the adopted visual-inertial sensor is the Loitor inertial Stereo camera which is a low-cost device. The Loitor device contains a synchronized global shutter stereo camera which is able to output the images at the frequency 30 Hz. The device also includes a MPU-6050 IMU with the frequency 200 Hz. The stereo camera and the IMU sensor have been synchronized. This sensor is calibrated by the calibration toolbox Kalibr . Note here although the device contains a stereo camera, we just use the output of the left camera for testing our system. The entire algorithm is implemented in C++ using ROS for acquiring device data.
The algorithm is tested under an indoor scene with random texture. Chess board or any special visual tag is unavailable. The rate of the algorithm is 20 Hz on our computer, with a hand-held Loitor device. Figure 10 shows the well-distributed keypoints in the images. The top view of the estimated trajectory from the proposed system is plotted in Figure 11.
The accuracy of our Visual-Inertial SLAM is evaluated in the 11 sequences of the EuRoC dataset. The EuRoc dataset provides synchronized global shutter WVGA stereo images at 20 Hz with MEMS IMU measurements at 200 Hz and trajectory ground-truth under different rooms in Figure 12. The dataset was collected by a MAV and ground truth is gained by a Vicon motion capture system which provided 6 DOF(degree of freedom) pose measurements at 100 Hz of a coordinate frame. For more detail we refer to paper .
IMU initialization is performed inside the SLAM system. Our system fails to run the sequence since the visual only SLAM failed to initialize under the extreme movement. For other data sequence, our SLAM algorithm can run in real-time without tracking lost. Table 1 presents the results of RMSE and standard deviation (in terms of translation) for different data sequences in EuRoC dataset. We evaluate the trajectories through the ATE method  which align the trajectories first, and then group them by the distance, finally compute the RMSE for each group. We present more details for the comparisons in Figure 13. In Figure 14 we plots some trajectories for our SLAM estimations and the ground truth (in top view).
We compare our proposed system with two state-of-art reasearch: stereo-inertial odometry OKVIS  and the VINS-MONO  without loop-closure and VINS-MONO with loop closure for completeness. Figure 13 shows the results of three system in terms of RMSE. From the error bar results in (b), (d), (e) and (f) in Figure 13, we can see our algorithm significantly outperforms the state-of-art algorithm VINS-MONO  and OKVIS  with stereo camera, which can be explained by the following:
We would let readers know that although our algorithm performs with high precision, it fails to run the V103 data sequence. This data sequence has extreme movement at the beginning which is the main reason for the initialization failure in our system. In comparison, OKVIS  with a stereo camera can run without performing initialization which make the algorithm handling this data sequence easily. However, OKVIS fails to process the V203 data sequences. On the other hand, VINS-MONO , use sparse optical flow tracking as an independent front-end module to retrieve data association. Optical flow is a robust way for tracking features in video, which makes the initialization successfully and let the algorithm can process all the data sequences in EuRoC dataset. However, the algorithm suffers from low precision for ignoring the early observations. Besides, optical flow is not accurate for feature tracking in sub-pixel accuracy.
In this paper, based on the pure monocular vision ORB-SLAM , we present a monocular visual-inertial SLAM system with our new IMU factor, vision factor, initialization. The proposed visual-inertial slam has high precision over the EuRoC dataset. One of the main reason behind this, is the co-visibility graph we employed from the ORB-SLAM , since even we found even the stereo ORB-SLAM  without IMU can have higher precision than OKVIS  with stereo-inertial sensor. The co-visibility graph makes it possible to utilize the early observations while the sliding window based methods ignore them. Our algorithm has substantial improvement on Machine Hall data sequences in EuRoC, since the visual texture is very friendly for ORB feature tracking and co-visibility graph construction. Meanwhile, our algorithm achieves the comparable performance with the state-of-art method OKVIS which use a stereo camera and IMU sensor on the Vicon Room data sequences. In these data sequences we found our algorithm happened to track lost for a few times since they contain a lot of white walls and gray boards which is hard to detect any features on them. We can easily see this effect in Figure 12.
Since the algorithm still use the same front-end module, same keyframe decision with visual ORB-SLAM , we still think there is lots of things to do for further improvement. (1) For the front-end, we think direct method, which directly use the gray scale value into optimization, is a promising way since it can use weak feature that just have gradient value. Like in the DSO algorithm , by understanding more exposure adjustment in optical camera, the algorithm have surprising precision with impressive robustness. (2) For the vision factor in back-end, we found that our implementation of directional error have higher precision, but it is a bit slower than original bundle adjustment. There will be more comparison with more details between direction and projection vision factor in our future research. Also, we think getting rid of the map points that their parallax are below certain threshold is not reasonable for it lost the rotation information given by those map points. However, map points with low parallax will turns the system into ill-posed since their location is not observable. Therefore, developing a vision factor that can make use of low parallax is essential. (3) Some basic technique like on-line calibration for the sensor, loop closure can also be added into system. Machine learning methods that detect movable objects in visual observation can also be tried.
This paper demonstrates a new method for the monocular vision and inertial state estimation algorithm with a real-time implementation. The proposed IMU preintegration not only reaches the state of art efficiency, but also have better linear form which can better capture the correlation of state uncertanties. To increase the speed of the algorithm, the separability trick and the novel vision factor for fast computation was used in both the tracking thread and the local-mapping thread. Thanks to the proposed IMU preintegration with better linearity, the proper weight and the reasonable criterion to check the reliability of the estimates, our initialization method is fast and reliable, which solves a convex optimization with less uncertainty. So far we have build a tightly coupled visual-inertial SLAM system that can run with real-time performance in unknown environment. The future work will be mainly on providing more data to give more insights about the performance of our initialization and seek a better model for the tightly-coupled visual-inertial SLAM’s back-end.
This project is supported by the National Science Fund of China under Grants 6171136, the Research Fund for the Doctoral Program of Higher Education of China under Grants 20110142110069 and CALT Aerospace Fund. I would also like to give my special thanks to Dr. Teng Zhang from Centre for Autonomous Systems, University Technology of Syndey, who gave me the special insight of IMU preintegration and the evaluation of initialization’s robustness.
Here we provide the mathematical functions used in this paper. For more details, see . The operator transforms a 3-dimensional vector to a matrix:
for . The exponential mapping transforms a 3-dimensional vector to a rotation matrix:
for . The logarithm mapping: for
where . The right Jaocbian: for
Yi liu conceived, designed the experiments and wrote the paper; Zhong Chen and Jianguo Liu contributed analysis tools; Hao Wang and Wenjuan Zhen analyzed the data. Yi liu performed the experiments.
The authors declare no conflict of interest.