-
Notifications
You must be signed in to change notification settings - Fork 225
Speed up pointcloud output by colorizing depth images instead of depthifying color images #92
Speed up pointcloud output by colorizing depth images instead of depthifying color images #92
Conversation
Thank you for your comparisons. But we had some other related issue before about "color to depth" and @skalldri argumented that he prefers "depth to color". EDIT: Here is the description of the differences on the MS Docs page: Just wanted to mention this. Nevertheless I am all in for an option to let the user decide which transformation is chosen for point cloud generation. |
@RoseFlunder thanks for being such an expert on this driver! :) |
I am glad to help when I can. |
This seems like the wrong solution, for two reasons. First, as RoseFlunder mentioned, I've previously articulated why using RGB-to-depth results in incorrect RGB point clouds. To re-iterate: we cannot know which RGB pixels are observed only by the RGB camera and not by the depth camera (or vice versa), so when colorizing the depth map with an RGB frame you will incorrectly colorize depth pixels that the RGB camera did not observe. This results in "color bleed" from objects observed by the RGB camera onto depth pixels not associated with that object. This effect is more pronounced with objects closer to the camera. The Azure Kinect SDK documentation suggest using depth-to-RGB for this exact reason. @helenol, you say:
This isn't correct. This is why the depth_image_proc/register nodelet exists: you must re-project depth images into the RGB co-ordinate frame before running them through depth_image_proc/point_cloud_xyzrgb. You can even see a comment that implies this in the point_cloud_xyzrgb code. The original Kinect for Xbox One ROS Driver does this as well). You also say:
I agree with your statement if we were talking about un-colorized point clouds, and indeed you can see that when publishing un-colorized point clouds they are published with the depth camera TF frame. Fundamentally, the Azure Kinect cannot capture RGB point clouds. It captures sufficient data that allows us to reconstruct an RGB point cloud, but RGB point clouds are not a native data format of the Azure Kinect. You claim it is wrong to publish this reconstructed point cloud in the RGB frame, since the depth data comes from the depth frame. But conversely, you could also claim it's wrong to publish it in the depth frame, since the RGB data that colorized each pixel came from the RGB frame. In both cases, some of the data is being published in the "wrong" frame from the "wrong" perspective. Since either publishing frame (depth or RGB) can be considered the "wrong" frame, we should instead consider the accuracy of re-projection into either frame. Re-projecting depth to RGB is easy, since each depth pixel represents a point in 3D space. Producing a new depth image based on these 3D points can be done with extremely high accuracy, taking into account occlusions which may occur in the new co-ordinate frame. Re-projecting RGB to depth is not easy: there is no Z-depth associated with the pixels, so the re-projection cannot account for new occlusions that might occur when observing the scene from a new perspective. It is therefore reasonable to publish this data from the perspective of the RGB camera, since it uses the highest accuracy re-projection method to preserve the image quality. Second, the point of opening this bug was to indicate that there is some significant inefficiency in the EDIT: removed incorrect reference to depth_image_proc overview |
@skalldri Thanks for the long explanation. I understand that the color of the point cloud pixels is not guaranteed to be correct and there will be color spill when using "color to depth", but when I mostly care about accurate positions (x,y,z) of the pixels and the color is just some "nice to have visual pleasing for the eyes"-feature I could use or did I understand it wrong? Because the positions of the points is the same as in the non-rgb point cloud then right? Nevertheless it would be fantastic if the time to transfer the SDK pointcloud data into a PointCloud2 message could be done faster. If it would run with 30 FPS (<33ms) with 1536P and "depth to color" it would be perfect. |
The points in the colorized cloud should have the same positional accuracy as the points in the un-colorized point cloud. That being said, if you don't actually need the color for your application, I highly recommend turning it off (color_enabled:=false, rgb_point_cloud:=false) since the color camera consumes more power, more CPU time, and more USB bandwidth. Using the colorized point cloud also restricts the The point of outputting a colorized point cloud is not for visual appeal. There are applications of the colorized point cloud that require high accuracy color. Consider object scanning, where someone might be trying to produce a 3D model of an object, including color. This could be a static object or a person detected by the body tracking SDK. Accurately colorizing the point cloud is important in these scenarios. Consider also a SLAM/Mapping algorithm like RTAB-Map. RTAB-Map collects colorized point cloud data and stitches it into an accurate colorized mesh of the environment it's in. Not only is this useful for debugging the map data, but this could be used to produce hypothesized locations for visual SLAM. The SLAM algorithm could use this colorized mesh by rendering it into a 2D camera capture from various perspectives, and compare these renders with the actual camera input to refine its position estimate. Without accurate color capture, this wouldn't be possible. |
Here is a concrete example of what I've been describing here. I took a recording where I put my hand close to the camera, since this illustrates the problem well. I then ran the recording through the current tip of Recording settings:
Using command line
the tip of Using the same command line, the tip of Please notice that my fingertips are drawn across the wall significantly more in this image. Please also note that there appears to be a bug in this branch that has swapped the red and blue color channels, producing a very off-color image (notice the red and blue objects on the left of the image have swapped colors). Now, if I launch with
We get this image (last frame of recording): We see that now the fingertips are not drawn across the wall, which is correct behavior. However, we do see that the red and blue color channels are still swapped. |
Hey @skalldri! Great to have your input in the discussion! Thank you for the thorough analysis. I think we absolutely agree on some things: if you want high-quality color data, it is better to depth-ify the color image when you output pointclouds. And there are definitely applications where that is more important, like adding texture to existing structure. Great! This is left as an option to the user. Where I disagree is that this is a better default for robotics applications when outputting the pointcloud.
Absolutely, and if you have applications where you care more about the quality of your RGB data when using the pointcloud topic than the quality of your depth data, then you would not use this.
Absolutely again, if you care about the quality of the color the most when using a pointcloud.
I would argue getting the 3D structure correct is more important during the 3D reconstruction step than the texture for this scenario, but again, your mileage may vary. You should create the 3D model with the actual depth data and texture afterwards with the full color image if you want the highest-quailty color possible, as the color images you have tend to be an order of magnitude higher resolution that the depth data.
Sure! RTAB-Map is doing RGB-D SLAM on the depth and RGB images. I understand the importance of co-registering them! It is not using the pointcloud data, it is using the reprojected images. It is not fusing the pointcloud data directly. I would argue this should have no effect on what the default pointcloud output is. Bear with me here on why not rescaling the depth image into the RGB image size and coordinate frame is important for 3D reconstruction applications (which is what I assume most of what a depth sensor is used for in robotics). Great, let's take it one step further. Let's say instead of using KinectFusion's very naive weighting when fusing depth rays you identify a sensor model of your sensor, following Nguyen 2012. This models the error behavior of the depth rays depending on position in the depth image, incidence angle with the surface, and other factors. This model can then be used to more accurately fuse depth data together. I hope this explanation is clear! I am happy to clarify any of this. I hope you agree that the RGB in Depth is a better default for robotics and 3D reconstruction applications. I completely agree there are cases where RGB in RGB makes more sense, which is why it is left as an option to the user. |
@RoseFlunder Thanks for your help with this! :) Sounds like an awesome project, I'd love to see any results that come out! |
@helenol I observed that publishing a rgb point cloud with 30hz and NFOV_UNBINNED consumes about 400 MB/s bandwith (measured with rostopic bw). But well thats a bit offtopic. |
@RoseFlunder PCL/ROS pointclouds are definitely not a compact representation by any stretch of the imagination. Passing depth images + RGB encodings as you've done is definitely a more bandwidth efficient method. If the final pointcloud is mostly for visualization, I would just decimate it as much as you can get away with before sending it to rviz. |
@helenol
When I interpret the loop correctly we also have xyz xyz xyz xyz xyz xyz in the sdks image buffer.
so in the first iteration the indices for x, y z are 0, 1, 2 I found the implementation in the SDK which uses SSE instructions.
But maybe there is a better way. Don't know if precomputing the xy_tables ourselfves and using this method would yield a better performance. |
Thank you for your detailed explanation. However, I still disagree that this should be the new default mode for this driver. You make an excellent point about up-scaling the depth data to the RGB resolution. This is adding more detail to the depth sensor output than actually exists, which can introduce error. It's also far more computationally / memory expensive than it needs to be. It seems very reasonable that the RGB point cloud should always be limited to the resolution of the depth camera, regardless of RGB camera resolution. That is absolutely a change that should be made, as it will help with #56. However, the situations you're describing (KinectFusion-related) are not what PointCloud2 messages are intended to be used for. PointCloud2 is a universal representation of the world in 3D that can be produced by a wide range of sensors, with optional color. These representations can come from a projector-based depth camera like a Kinect, or a stereo camera, or from a 3D LIDAR like a Velodyne. There are depth sensors that haven't been invented yet which will (likely) one day emit PointCloud2 messages. Each of these sensors has a different underlying data format that is "more native" to the sensor hardware: for Kinect-like cameras, we have depth images. For stereo images, we have disparity images. 3D laser scanners like Velodyne can often output PointCloud2 directly. REP-118 describes the rationale behind using Image messages to transport depth information from Kinect-style cameras, and it touches on some of the topics you bring up here. It also mentions that PointCloud2 messages are specifically intended to be more generic (emphasis mine):
What I'm trying to get at here is that software which consumes a PointCloud2 message needs to be more flexible than the types of restrictions you're describing. The problems you've raised here are all related to the particular operation of the Azure Kinect: it emits rays of IR light, which bounce off objects, and return to a TOF camera. The camera produces a 2D image, where each pixel value in that image represents a depth in millimeters. To produce a point cloud from this data, we "raycast" out from each depth pixel using the intrinsic calibration data from the camera. My point is that when we emit a PointCloud2 we intentionally discard all this extra information. We are only outputting 3D points in the world, with no assumptions about how that data was collected. If software needs to know these details (like KinectFusion does) then it should not be consuming a PointCloud2 message: it should be consuming the native sensor format directly. Software that consumes a PointCloud2 is intentionally not being given this information to improve compatibility between sensors. Such software needs to be capable or accepting data from multiple types of sensors, and cannot make assumptions about whether or not the sensor can provide intrinsics about the lens distortion used to produce the PointCloud2: that data may not exist. The data may not even have been captured through a traditional camera-lens system, in the case of Velodyne scans. I should not have used RTAB-Map as an example in my previous post, since it is much more dependent on the underlying camera hardware than I'm trying to convey here. It can accept PointCloud2 messages, but is designed to accept RGB and depth Images, more like KinFu. Instead, I should have used Google Cartographer as an example, since it only consumes PointCloud2 messages natively. Cartographer is a SLAM algorithm that works uses the Ceres scan-matcher to operate on raw 3D points. Because it only consumes PointCloud2 messages, and makes no assumptions about the underlying sensor hardware, it can consume data from any type of depth sensor. Cartographer doesn't use the colorized point cloud data internally, but will preserve the color data so that colorized point clouds can be produced from the output. This proposed change would break Cartographer scan colorization since the PointCloud2s would contain large RGB artifacts. I hope this clears up why I'm so resistant to this change. PointCloud2 messages are not intended to convey sensor-specific data to software, since that hinders compatibility between algorithms and the underlying sensors. This change also regresses the quality of the RGB data in the |
@skalldri I feel like you're not understanding my point though. Google's Cartographer does raycasting into the world for creating its maps. It casts rays from the sensor frame (in your output, the RGB camera frame) into its coordinate system. It assumes that your sensor is a LIDAR, which with a depth camera has different error characteristics than a depth camera, so this is already somewhat incorrect. However you are still tracing rays from the camera center of the RGB camera (which cartographer assumed to be LIDAR) to the depth points, which traces a different physical path of the ray than the physical sensor. And again the same restrictions about rescaling apply. I understand that your way makes the color look nice and there are many cases where this makes more sense but for most robotics applications the structure is far more important than the color. To give some context, I want to use the Azure Kinect with VIO (rovio, vins mono) and voxblox for dense reconstruction but it is incorrect to feed the rgb in rgb pointcloud into a dense reconstruction framework. I think this would allow the Azure Kinect to be much more useful for roboticists in a wide variety of applications. |
In summary: if there is a need for this mode we should have it. I'd like it to be a non-default option that users can enable, and I would like to see some documentation updated to explain when to use it.
That's why I opened #56 when it was reported as a problem. My intention was to go back and optimize the RGB point cloud rendering function, since I suspect the use of PointCloud iterator is slowing it down.
You're probably right. I don't have a strong enough maths background to properly understand how many of these tracking / reconstruction algorithms work, so I'm absolutely willing to acknowledge that this is a problem for those algorithms. Having a mode in the driver to help these algorithms work better sounds fine to me. However, I don't want it to be the default mode if it comes at the expense of significant colorization artifacts in the output. If you have some literature on the topic of why this is such a serious problem, I would love to spend some time reading it. Currently, beyond accidentally raycasting out cells in a voxel grid when it wasn't supposed to, I'm not able to see why this would degrade the quality of SLAM.
What I'm arguing is that they are equally important. The user has explicitly indicated they want color data in their point cloud. The expectation is that the node will provide artifact-free output data that represents the real world. We know there are significant colorization artifacts when using RGB to depth re-projection, and I feel that we should avoid introducing artifacts into data we are emitting. You're right: re-projecting the depth to the RGB frame is also an artifact, but it's an artifact that preserves the accuracy of the output data while introducing an artifact into the way the data was captured, which I feel is a reasonable tradeoff. We should make it clear in the documentation that there is a tradeoff between colorization accuracy and raycasting accuracy between these two modes, just like we document that enabling color point clouds will reduce the FOV of the output. I'm curious: in your example of collecting a dense reconstruction, how are you planning on removing the colorization artifacts once they have been introduced to your reconstruction? Detecting them in post processing sounds very difficult. Is your use case such that having colorization artifacts in the output is acceptable? If so, why bother collecting color data?
Aesthetics is not why I'm pushing back on this: I want the output of the node to best reflect what the sensor captured. The colorization artifacts introduced through the RGB to depth re-projection do not accurately reflect the scene captured by the sensor. |
@skalldri Great, thanks a lot for seeing my point of view! I really appreciate it. I can agree to keep the default the current state as long as the option exists. But I think I've been super hasty in responding without thinking about the problem properly, and I sincerely apologize since I think I wasted both of us lots of time.
We actually can know! Given that we have two fully-calibrated cameras and undistorted images from both, the left being RGB and the right being a depth camera, we can associate every point in the depth camera image with a single point in the RGB image. So why do we get the type of "halo" or color-bleed that's in the photo below when we colorize the depth cloud? Because there's actually two ways to align an RGB image and a depth image. First is the way I just described, which involves casting a ray through the depth image onto the structure and back for every pixel that has a valid depth. Without knowing the depth of the pixels, this is the best you can do. And it's pretty alright, except for objects closer to the camera and near depth continuities get assigned to the wrong pixel, because that's where the assumption that the world is a far away plane is most violated. I'm not 100% sure that this is what we're doing with the SDK calls, but from the results, this is my best guess for what it appears to be. That or it's a calibration issue. Re: what do I do with unmatched color when doing 3D reconstruction? I mostly just use the color for visualization, so not worry too much, I guess. ;) The way we're fusing color in voxblox over multiple scans is anyway super naive (weighted average!), it's not uncommon to have weird effects from changing exposure, etc. I totally agree that how much this matters is very application-dependent. But yeah let's fix the root cause! Re: affecting SLAM performance... It's definitely worst for actually having accurate free-space vs occupied information. Depending on the camera configuration, it may or may not have any measurable effects on SLAM performance. If the projective centers of the cameras are very close relative to the distance to obstacles, you might not notice any difference. |
@helenol |
@skalldri @RoseFlunder So you can do it, but doing it properly is probably gonna be expensive. I'll think of ways to maybe do it cheaper... But ok yes it's really not trivial, I totally agree with keeping the RGB depth output as default. Re: pointclouds, I'm also totally wrong here, also stored xyz xyz xyz in the image. Problem here is more packing actually; PCL packs 4 floats for position, SDK packs 3 uint8s. Could change the SDK to output 4 padded floats and then do some memcopy hacks... Not so sure how much faster it'll be though. |
@helenol
https://docs.microsoft.com/en-us/azure/kinect-dk/use-image-transformation The example picture shows that very well. They state at color to depth:
But I don't really know why they are doing it only in one direction. Maybe because its easier for depth to color. I am no expert here. EDIT: |
@helenol thanks for the detailed explanations! The drawings of what you're describing are very helpful, and now I've got some new reading material :) I think we're on the same page now. Your concern is that, when using Depth->RGB, we are pretending that the "yellow volume" is empty, when in reality we have no information about that area since the depth camera did not observe it. It might contain geometry, it might not. We can't know for sure, but we are emitting data about that region of space when in fact we have no idea what's there. SLAM algorithms might expect to find some geometry there while doing scan matching, and could get confused about their position, or might raytrace through some existing geometry, due to this inaccuracy. That makes complete sense to me, and we should definitely have the ability to change to RGB->Depth mode to resolve that issue.
Over the weekend I had been thinking about this exact idea, but figured it would be too expensive to do it in real time. It's been suggested to me that I should look into screen-space reflection algorithms for potential ways of solving this in an efficient way. That's a bit beyond my expertise, but maybe it makes sense to you? |
To pick this back up again... I think I have this PR in a state that we're all ok with, and I think it's about time I merged it. ;) @skalldri I'm not familiar with that class of algorithms! Definitely sounds interesting and worth investigating. I think there's quite a few things we could do, some even in real-time... If you choose to investigate it further, I'm excited to see what you come up with! :) And as one final note, a good example of why I don't personally care too much about color accuracy... Here's the same scene with RGB in RGB: 🤣 🤣 🤣 I think both equally horrible. I think this is why texturing over the structure is the right answer for anything you actually care about in any case. |
Fixes
Partially addresses #56
Hopefully makes @ooeygui and @RoseFlunder happier with the default behavior of the driver. :)
Description of the changes:
The default behavior for the RGB colored pointcloud was to take the depth image and project it into the RGB coordinate frame (and image size) and publish the pointcloud in the RGB TF frame. While this makes sense for some applications (i.e., texturing meshes) as it gives you the biggest possible pointcloud, it also, well, gives you a large pointcloud that takes a while to compute.
And I would argue in the vast majority of cases, the correct thing to do is to colorize the depth cloud instead of depth-ifying the color image. This generally leads to a smaller pointcloud size, and also is technically correct from a sensor integration point of view.
You can imagine if you project these pointclouds into a volumetric map, having the sensor TF frame be the RGB frame rather than the depth is incorrect, as the measurements are actually made from the depth frame. Similarly, projecting the depth into the RGB frame creates overconfidence in the depth measurements as then interpolated depth is treated as real sensor measurements.
point_cloud_in_depth_frame
parameter which now defaults to true, defaulting to the new behavior. This parameter makes the colorized pointcloud be published in the depth coordinate frame, rather than the RGB image frame.Before submitting a Pull Request:
I tested changes on:
Timing analysis
Attempting to address exactly the issue in #56, which is that the pointcloud output cannot run at 30 Hz if the RGB color is used.
Settings: WFOV_2x2BINNED, 1536P, point_cloud = true, rgb_point_cloud = true, point_cloud_in_depth_frame = false (old behavior, pre-PR):
Settings: WFOV_2x2BINNED, 1536P, point_cloud = true, rgb_point_cloud = true, point_cloud_in_depth_frame = true (new behavior):
The table below is a more comprehensive analysis of the time it takes to build and publish a pointcloud to ROS.
Depth in Depth: point_cloud = true, rgb_point_cloud = false, point_cloud_in_depth_frame = false
RGB in Depth: point_cloud = true, rgb_point_cloud = true, point_cloud_in_depth_frame = true (new)
RGB in RGB: point_cloud = true, rgb_point_cloud = true, point_cloud_in_depth_frame = false (old)
As can be seen, the new setting is about 2x faster in all cases except 720p and WFOV Unbinned, since the RGB image actually has a smaller resolution that the depth image and projecting into the RGB space creates a smaller pointcloud.
Screenshots
Depth in Depth
RGB in Depth
(small point size)
(large point size)
RGB in RGB