Skip to content
This repository has been archived by the owner on Jul 26, 2024. It is now read-only.

Speed up pointcloud output by colorizing depth images instead of depthifying color images #92

Merged
merged 7 commits into from
Oct 25, 2019

Conversation

helenol
Copy link
Contributor

@helenol helenol commented Oct 18, 2019

Fixes

Partially addresses #56

Hopefully makes @ooeygui and @RoseFlunder happier with the default behavior of the driver. :)

Description of the changes:

The default behavior for the RGB colored pointcloud was to take the depth image and project it into the RGB coordinate frame (and image size) and publish the pointcloud in the RGB TF frame. While this makes sense for some applications (i.e., texturing meshes) as it gives you the biggest possible pointcloud, it also, well, gives you a large pointcloud that takes a while to compute.
And I would argue in the vast majority of cases, the correct thing to do is to colorize the depth cloud instead of depth-ifying the color image. This generally leads to a smaller pointcloud size, and also is technically correct from a sensor integration point of view.

You can imagine if you project these pointclouds into a volumetric map, having the sensor TF frame be the RGB frame rather than the depth is incorrect, as the measurements are actually made from the depth frame. Similarly, projecting the depth into the RGB frame creates overconfidence in the depth measurements as then interpolated depth is treated as real sensor measurements.

  • Add point_cloud_in_depth_frame parameter which now defaults to true, defaulting to the new behavior. This parameter makes the colorized pointcloud be published in the depth coordinate frame, rather than the RGB image frame.
  • This speeds up the pointcloud publishing significantly, by publishing a smaller pointcloud (see analysis below).

Before submitting a Pull Request:

I tested changes on:

  • Windows
  • Linux

Timing analysis

Attempting to address exactly the issue in #56, which is that the pointcloud output cannot run at 30 Hz if the RGB color is used.

Settings: WFOV_2x2BINNED, 1536P, point_cloud = true, rgb_point_cloud = true, point_cloud_in_depth_frame = false (old behavior, pre-PR):

rostopic hz /points2
subscribed to [/points2]
average rate: 9.198
	min: 0.084s max: 0.118s std dev: 0.01112s window: 7

Settings: WFOV_2x2BINNED, 1536P, point_cloud = true, rgb_point_cloud = true, point_cloud_in_depth_frame = true (new behavior):

rostopic hz /points2
subscribed to [/points2]
average rate: 31.655
	min: 0.007s max: 0.034s std dev: 0.00590s window: 26

The table below is a more comprehensive analysis of the time it takes to build and publish a pointcloud to ROS.
Depth in Depth: point_cloud = true, rgb_point_cloud = false, point_cloud_in_depth_frame = false
RGB in Depth: point_cloud = true, rgb_point_cloud = true, point_cloud_in_depth_frame = true (new)
RGB in RGB: point_cloud = true, rgb_point_cloud = true, point_cloud_in_depth_frame = false (old)

Depth ModeWFOV UnbinnedWFOV 2x2 Binned
720P1536P720P1536P
Depth in Depth10 ms10 ms10 ms10 ms
RGB in Depth33 ms40 ms9 ms10 ms
RGB in RGB21 ms75 ms21 ms70 ms

As can be seen, the new setting is about 2x faster in all cases except 720p and WFOV Unbinned, since the RGB image actually has a smaller resolution that the depth image and projecting into the RGB space creates a smaller pointcloud.

Screenshots

Depth in Depth

Screenshot from 2019-10-18 15-53-56

RGB in Depth

(small point size)

Screenshot from 2019-10-18 15-54-29

(large point size)

Screenshot from 2019-10-18 15-55-46

RGB in RGB

Screenshot from 2019-10-18 15-55-04

@helenol helenol requested a review from ooeygui October 18, 2019 17:08
@RoseFlunder
Copy link
Contributor

RoseFlunder commented Oct 18, 2019

Thank you for your comparisons.
I switched myself to the "color to depth" transformation some time ago because of performance reasons.

But we had some other related issue before about "color to depth" and @skalldri argumented that he prefers "depth to color".
Here is his opinion:
#63 (comment)

EDIT: Here is the description of the differences on the MS Docs page:
https://docs.microsoft.com/en-us/azure/kinect-dk/use-image-transformation
Their recommendation is "depth to color" too.

Just wanted to mention this. Nevertheless I am all in for an option to let the user decide which transformation is chosen for point cloud generation.

@helenol
Copy link
Contributor Author

helenol commented Oct 18, 2019

@RoseFlunder thanks for being such an expert on this driver! :)
Interesting discussion there!
But I agree letting users decide is the better way, and I am pretty sure that rgb in depth is the standard way (see depth_image_proc: https://github.com/ros-perception/image_pipeline/blob/melodic/depth_image_proc/src/nodelets/point_cloud_xyzrgb.cpp and other depthcam drivers) and makes more sense from a sensor fusion point of view as well.

@RoseFlunder
Copy link
Contributor

I am glad to help when I can.
I am no expert in robotics or computer vision but at the moment I try to learn what I can about working with the Azure Kinect devices because I am going to use them in my master thesis project.
The department at my university has a multi kinect system in an ROS environment and just switched from the previous Kinect v2 to the Azure Kinects.
I shall be using them to implement a body tracking data fusion algorithm with some constraints like "works with at least 4 persons" and "persons can be partially occluded for some of the cameras" etc.
So my starting point was this package.

@skalldri
Copy link
Contributor

skalldri commented Oct 18, 2019

This seems like the wrong solution, for two reasons.

First, as RoseFlunder mentioned, I've previously articulated why using RGB-to-depth results in incorrect RGB point clouds. To re-iterate: we cannot know which RGB pixels are observed only by the RGB camera and not by the depth camera (or vice versa), so when colorizing the depth map with an RGB frame you will incorrectly colorize depth pixels that the RGB camera did not observe. This results in "color bleed" from objects observed by the RGB camera onto depth pixels not associated with that object. This effect is more pronounced with objects closer to the camera. The Azure Kinect SDK documentation suggest using depth-to-RGB for this exact reason.

@helenol, you say:

But I agree letting users decide is the better way, and I am pretty sure that rgb in depth is the standard way (see depth_image_proc: https://github.com/ros-perception/image_pipeline/blob/melodic/depth_image_proc/src/nodelets/point_cloud_xyzrgb.cpp and other depthcam drivers) and makes more sense from a sensor fusion point of view as well.

This isn't correct. This is why the depth_image_proc/register nodelet exists: you must re-project depth images into the RGB co-ordinate frame before running them through depth_image_proc/point_cloud_xyzrgb.

You can even see a comment that implies this in the point_cloud_xyzrgb code. The original Kinect for Xbox One ROS Driver does this as well).

You also say:

You can imagine if you project these pointclouds into a volumetric map, having the sensor TF frame be the RGB frame rather than the depth is incorrect, as the measurements are actually made from the depth frame.

I agree with your statement if we were talking about un-colorized point clouds, and indeed you can see that when publishing un-colorized point clouds they are published with the depth camera TF frame.

Fundamentally, the Azure Kinect cannot capture RGB point clouds. It captures sufficient data that allows us to reconstruct an RGB point cloud, but RGB point clouds are not a native data format of the Azure Kinect. You claim it is wrong to publish this reconstructed point cloud in the RGB frame, since the depth data comes from the depth frame. But conversely, you could also claim it's wrong to publish it in the depth frame, since the RGB data that colorized each pixel came from the RGB frame. In both cases, some of the data is being published in the "wrong" frame from the "wrong" perspective.

Since either publishing frame (depth or RGB) can be considered the "wrong" frame, we should instead consider the accuracy of re-projection into either frame. Re-projecting depth to RGB is easy, since each depth pixel represents a point in 3D space. Producing a new depth image based on these 3D points can be done with extremely high accuracy, taking into account occlusions which may occur in the new co-ordinate frame. Re-projecting RGB to depth is not easy: there is no Z-depth associated with the pixels, so the re-projection cannot account for new occlusions that might occur when observing the scene from a new perspective.

It is therefore reasonable to publish this data from the perspective of the RGB camera, since it uses the highest accuracy re-projection method to preserve the image quality.

Second, the point of opening this bug was to indicate that there is some significant inefficiency in the getRgbPointCloud() function. Publishing a point cloud with less pixels does not solve the fact that this function should have no problem rendering an RGB point cloud in significantly less than 30ms on pretty much any machine. RoseFlunder's previous investigations into timing showed that the bulk of that time was not spent in Azure Kinect SDK functions, so the source of the slowdown must lie in this driver.

EDIT: removed incorrect reference to depth_image_proc overview

@RoseFlunder
Copy link
Contributor

RoseFlunder commented Oct 18, 2019

@skalldri Thanks for the long explanation. I understand that the color of the point cloud pixels is not guaranteed to be correct and there will be color spill when using "color to depth", but when I mostly care about accurate positions (x,y,z) of the pixels and the color is just some "nice to have visual pleasing for the eyes"-feature I could use or did I understand it wrong? Because the positions of the points is the same as in the non-rgb point cloud then right?

Nevertheless it would be fantastic if the time to transfer the SDK pointcloud data into a PointCloud2 message could be done faster. If it would run with 30 FPS (<33ms) with 1536P and "depth to color" it would be perfect.

@skalldri
Copy link
Contributor

The points in the colorized cloud should have the same positional accuracy as the points in the un-colorized point cloud. That being said, if you don't actually need the color for your application, I highly recommend turning it off (color_enabled:=false, rgb_point_cloud:=false) since the color camera consumes more power, more CPU time, and more USB bandwidth. Using the colorized point cloud also restricts the /points2 output to the overlap between the depth and color camera FOVs, which is often significantly less than the FOV of the depth camera alone.

The point of outputting a colorized point cloud is not for visual appeal. There are applications of the colorized point cloud that require high accuracy color.

Consider object scanning, where someone might be trying to produce a 3D model of an object, including color. This could be a static object or a person detected by the body tracking SDK. Accurately colorizing the point cloud is important in these scenarios.

Consider also a SLAM/Mapping algorithm like RTAB-Map. RTAB-Map collects colorized point cloud data and stitches it into an accurate colorized mesh of the environment it's in. Not only is this useful for debugging the map data, but this could be used to produce hypothesized locations for visual SLAM. The SLAM algorithm could use this colorized mesh by rendering it into a 2D camera capture from various perspectives, and compare these renders with the actual camera input to refine its position estimate. Without accurate color capture, this wouldn't be possible.

@skalldri
Copy link
Contributor

Here is a concrete example of what I've been describing here.

I took a recording where I put my hand close to the camera, since this illustrates the problem well. I then ran the recording through the current tip of melodic and the tip of feature/pointcloud_speedup.

Recording settings:

> k4arecorder --record-length 10 --color-mode 1536p --depth-mode WFOV_UNBINNED --rate 5 color_spill.mkv

Using command line

roslaunch azure_kinect_ros_driver driver.launch recording_file:=/home/salldritt/color_spill.mkv

the tip of melodic produces (last frame of recording):

no_color_spill

Using the same command line, the tip of features/pointcloud_speedup produces (last frame of recording):

color_spill

Please notice that my fingertips are drawn across the wall significantly more in this image. Please also note that there appears to be a bug in this branch that has swapped the red and blue color channels, producing a very off-color image (notice the red and blue objects on the left of the image have swapped colors).

Now, if I launch with

roslaunch azure_kinect_ros_driver driver.launch recording_file:=/home/salldritt/color_spill.mkv point_cloud_in_depth_frame:=false

We get this image (last frame of recording):

no_color_spill_rb_swap

We see that now the fingertips are not drawn across the wall, which is correct behavior. However, we do see that the red and blue color channels are still swapped.

@helenol
Copy link
Contributor Author

helenol commented Oct 19, 2019

Hey @skalldri! Great to have your input in the discussion!
And great catch with the color channels, thanks a ton for noticing! I'll fix this first thing on Monday morning.

Thank you for the thorough analysis. I think we absolutely agree on some things: if you want high-quality color data, it is better to depth-ify the color image when you output pointclouds. And there are definitely applications where that is more important, like adding texture to existing structure. Great! This is left as an option to the user.

Where I disagree is that this is a better default for robotics applications when outputting the pointcloud.

Fundamentally, the Azure Kinect cannot capture RGB point clouds. It captures sufficient data that allows us to reconstruct an RGB point cloud, but RGB point clouds are not a native data format of the Azure Kinect. You claim it is wrong to publish this reconstructed point cloud in the RGB frame, since the depth data comes from the depth frame. But conversely, you could also claim it's wrong to publish it in the depth frame, since the RGB data that colorized each pixel came from the RGB frame. In both cases, some of the data is being published in the "wrong" frame from the "wrong" perspective.

Absolutely, and if you have applications where you care more about the quality of your RGB data when using the pointcloud topic than the quality of your depth data, then you would not use this.

It is therefore reasonable to publish this data from the perspective of the RGB camera, since it uses the highest accuracy re-projection method to preserve the image quality.

Absolutely again, if you care about the quality of the color the most when using a pointcloud.

Consider object scanning, where someone might be trying to produce a 3D model of an object, including color. This could be a static object or a person detected by the body tracking SDK.
Accurately colorizing the point cloud is important in these scenarios.

I would argue getting the 3D structure correct is more important during the 3D reconstruction step than the texture for this scenario, but again, your mileage may vary. You should create the 3D model with the actual depth data and texture afterwards with the full color image if you want the highest-quailty color possible, as the color images you have tend to be an order of magnitude higher resolution that the depth data.

Consider also a SLAM/Mapping algorithm like RTAB-Map. RTAB-Map collects colorized point cloud data and stitches it into an accurate colorized mesh of the environment it's in. Not only is this useful for debugging the map data, but this could be used to produce hypothesized locations for visual SLAM. The SLAM algorithm could use this colorized mesh by rendering it into a 2D camera capture from various perspectives, and compare these renders with the actual camera input to refine its position estimate. Without accurate color capture, this wouldn't be possible.

Sure! RTAB-Map is doing RGB-D SLAM on the depth and RGB images. I understand the importance of co-registering them! It is not using the pointcloud data, it is using the reprojected images. It is not fusing the pointcloud data directly. I would argue this should have no effect on what the default pointcloud output is.

Bear with me here on why not rescaling the depth image into the RGB image size and coordinate frame is important for 3D reconstruction applications (which is what I assume most of what a depth sensor is used for in robotics).
Let's say you have an Azure Kinect in a room. You would like to create a 3D reconstruction of this room. You turn to literature, and use a KinectFusion-based method which is currently the standard approach.
KinectFusion ingests a pointcloud by raycasting from the sensor center to each point in the pointcloud, projected into its world frame.
So if we publish the pointcloud in the RGB frame, then we are raycasting depth data from the RGB camera center to the depth points. This will carve different paths through space than the original depth rays from the camera did in space during exposure! This is violating your sensor model.

Great, let's take it one step further. Let's say instead of using KinectFusion's very naive weighting when fusing depth rays you identify a sensor model of your sensor, following Nguyen 2012. This models the error behavior of the depth rays depending on position in the depth image, incidence angle with the surface, and other factors. This model can then be used to more accurately fuse depth data together.
If you then feed the "RGB in RGB" pointcloud that is the current output of this driver into it, you immediately hit several issues.
First of all, the points you get from the pointcloud are not the raw sensor measurements that you are modeling. They are transformed and interpolated to scale up. This means you are adding any error in the extrinsics between the RGB and depth camera into the quality of your 3D reconstruction. If your RGB image is let's say 2x as large as the depth camera image, you are also casting 4x rays per actual depth measurement. If you use a sensor model that matches the characteristics of the actual depth sensor, you are counting every actual sensor data 4 times, and you are 4 times overconfident about the error characteristics of your depth measurement. This is again violating your sensor model.

I hope this explanation is clear! I am happy to clarify any of this.

I hope you agree that the RGB in Depth is a better default for robotics and 3D reconstruction applications. I completely agree there are cases where RGB in RGB makes more sense, which is why it is left as an option to the user.

@helenol
Copy link
Contributor Author

helenol commented Oct 19, 2019

@RoseFlunder Thanks for your help with this! :) Sounds like an awesome project, I'd love to see any results that come out!
I recently joined the Zurich Mixed Reality and AI lab at Microsoft, and I'm pushing very hard to have better connections between Microsoft and the robotics research community (which is why this driver is very important to me too!).

@RoseFlunder
Copy link
Contributor

RoseFlunder commented Oct 19, 2019

@helenol
Sounds great and your input with all your experience is very valuable.
I marked the lines where you need to swap red and blue points.
Working on this package is a great learning experience for me.
Although I may end up using only parts in my own project and not the whole package.

I observed that publishing a rgb point cloud with 30hz and NFOV_UNBINNED consumes about 400 MB/s bandwith (measured with rostopic bw).
That will be too much for the network when I use four kinects (one pc per kinect) and send the data to another pc.
So the last days I switched to sending the depth images, transformed color images, a xy-table and body tracking data via protocol buffers to the main computer. the main computer has a server network card with four RJ45 Inputs and the kinect pcs are directly connected to this network card.
On the main computer the xy-table is used with the images to generate a pointclouds which will be published to ROS system and are usable at least by other nodelets on this main pc.
One of these nodelets will fuse the body tracking data but users want to see the result live in RViz in a rgb pointcloud.
I am not sure if I can unifiy the four registered clouds into a single one live with 30hz.
Just overlaying the four different pointclouds is too much for RViz and the RViz framerate will drop into single digits.

But well thats a bit offtopic.
Back to the topic and this package:
Do you know a way how we could speed up the transfering the pointcloud data from the sdk into the ROS PointCloud2 message?
Is there something faster than using the PointCloud iterators?
One option would be reimplementing the sdk method so that the results are stored directly in the ros Message and we dont have to transfer it like now but I hope there is an easier way so that we can still use the SDKs pointcloud method.

@helenol
Copy link
Contributor Author

helenol commented Oct 19, 2019

@RoseFlunder
Re: pointcloud output faster: unfortunately there's no free lunch on this one, without changing the outputs of the SDK directly. Issue is how the points are stored in memory: the output of the SDK is a 3-channel image, with each channel encoding a different axis. PCL on the other hand stores a vector of point objects.
So from the SDK we get:
xxxx..xxxx, yyyy....yyyy, zzzz....zzzz
and we need to pack it into PCL's:
xyz xyz xyz xyz xyz xyz
As far as I know, no better way to do this than with individual iterators.
It might be worth checking if you can instead use depth_image_proc directly rather than the SDK methods (for depth image -> pointcloud projection) and get better performance, but I really don't know if this will be better or worse. The only other thing you can really do to make pointcloud creation and transmission faster is to have smaller pointclouds. ;)

PCL/ROS pointclouds are definitely not a compact representation by any stretch of the imagination. Passing depth images + RGB encodings as you've done is definitely a more bandwidth efficient method. If the final pointcloud is mostly for visualization, I would just decimate it as much as you can get away with before sending it to rviz.

@RoseFlunder
Copy link
Contributor

RoseFlunder commented Oct 19, 2019

@helenol
Are you sure about this?

So from the SDK we get:
xxxx..xxxx, yyyy....yyyy, zzzz....zzzz

When I interpret the loop correctly we also have xyz xyz xyz xyz xyz xyz in the sdks image buffer.
We iterate of the buffer like this:

const int16_t* point_cloud_buffer = reinterpret_cast<const int16_t*>(pointcloud_image.get_buffer());
for (size_t i = 0; i < point_count; i++, ++iter_x, ++iter_y, ++iter_z)
  {
    float z = static_cast<float>(point_cloud_buffer[3 * i + 2]);

    if (z <= 0.0f)
    {
      *iter_x = *iter_y = *iter_z = std::numeric_limits<float>::quiet_NaN();
    }
    else
    {
      constexpr float kMillimeterToMeter = 1.0 / 1000.0f;
      *iter_x = kMillimeterToMeter * static_cast<float>(point_cloud_buffer[3 * i + 0]);
      *iter_y = kMillimeterToMeter * static_cast<float>(point_cloud_buffer[3 * i + 1]);
      *iter_z = kMillimeterToMeter * z;
    }
  }

so in the first iteration the indices for x, y z are 0, 1, 2
second iteration the indeces for x, y, z are 3, 4, 5
etc.
So xyz xyz xyz
Or am I wrong?


I found the implementation in the SDK which uses SSE instructions.
https://github.com/microsoft/Azure-Kinect-Sensor-SDK/blob/develop/src/transformation/rgbz.c#L1091
But I am not familiar with such low level stuff.
I mean I kind of understand idea and how the result is transferred into the image buffer but I don't have a clue how to efficiently add the conversion to meters for ROS there without extracting single value as int16 and multiply with 0.001.
Something like this at the bottom of the for loop instead of placing it into xyz_data_m128i:

int mask = 0;
for (int j = 0; j < 8; ++j, ++mask, ++iter_x, ++iter_y, ++iter_z)
{
  *iter_x = _mm_extract_epi16(x, mask) * 0.001;
  *iter_y = _mm_extract_epi16(y, mask) * 0.001;
  *iter_z = _mm_extract_epi16(z, mask) * 0.001;
}

But maybe there is a better way.
Instead of doing the conversion like this there might be a nice way to do it with SSE too and use no iterators at all?

Don't know if precomputing the xy_tables ourselfves and using this method would yield a better performance.
Maybe I will give it a try instead of using the "depth_image_to_point_cloud" method some day.

@skalldri
Copy link
Contributor

Thank you for your detailed explanation. However, I still disagree that this should be the new default mode for this driver.

You make an excellent point about up-scaling the depth data to the RGB resolution. This is adding more detail to the depth sensor output than actually exists, which can introduce error. It's also far more computationally / memory expensive than it needs to be. It seems very reasonable that the RGB point cloud should always be limited to the resolution of the depth camera, regardless of RGB camera resolution. That is absolutely a change that should be made, as it will help with #56.

However, the situations you're describing (KinectFusion-related) are not what PointCloud2 messages are intended to be used for. PointCloud2 is a universal representation of the world in 3D that can be produced by a wide range of sensors, with optional color. These representations can come from a projector-based depth camera like a Kinect, or a stereo camera, or from a 3D LIDAR like a Velodyne. There are depth sensors that haven't been invented yet which will (likely) one day emit PointCloud2 messages.

Each of these sensors has a different underlying data format that is "more native" to the sensor hardware: for Kinect-like cameras, we have depth images. For stereo images, we have disparity images. 3D laser scanners like Velodyne can often output PointCloud2 directly.

REP-118 describes the rationale behind using Image messages to transport depth information from Kinect-style cameras, and it touches on some of the topics you bring up here. It also mentions that PointCloud2 messages are specifically intended to be more generic (emphasis mine):

With the addition of depth images, ROS now has three messages suitable for representing dense depth data: sensor_msgs/Image, sensor_msgs/DisparityImage, and sensor_msgs/PointCloud2. PointCloud2 is more general than a depth image, but also more verbose.

What I'm trying to get at here is that software which consumes a PointCloud2 message needs to be more flexible than the types of restrictions you're describing.

The problems you've raised here are all related to the particular operation of the Azure Kinect: it emits rays of IR light, which bounce off objects, and return to a TOF camera. The camera produces a 2D image, where each pixel value in that image represents a depth in millimeters. To produce a point cloud from this data, we "raycast" out from each depth pixel using the intrinsic calibration data from the camera.

My point is that when we emit a PointCloud2 we intentionally discard all this extra information. We are only outputting 3D points in the world, with no assumptions about how that data was collected. If software needs to know these details (like KinectFusion does) then it should not be consuming a PointCloud2 message: it should be consuming the native sensor format directly.

Software that consumes a PointCloud2 is intentionally not being given this information to improve compatibility between sensors. Such software needs to be capable or accepting data from multiple types of sensors, and cannot make assumptions about whether or not the sensor can provide intrinsics about the lens distortion used to produce the PointCloud2: that data may not exist. The data may not even have been captured through a traditional camera-lens system, in the case of Velodyne scans.

I should not have used RTAB-Map as an example in my previous post, since it is much more dependent on the underlying camera hardware than I'm trying to convey here. It can accept PointCloud2 messages, but is designed to accept RGB and depth Images, more like KinFu.

Instead, I should have used Google Cartographer as an example, since it only consumes PointCloud2 messages natively. Cartographer is a SLAM algorithm that works uses the Ceres scan-matcher to operate on raw 3D points. Because it only consumes PointCloud2 messages, and makes no assumptions about the underlying sensor hardware, it can consume data from any type of depth sensor. Cartographer doesn't use the colorized point cloud data internally, but will preserve the color data so that colorized point clouds can be produced from the output. This proposed change would break Cartographer scan colorization since the PointCloud2s would contain large RGB artifacts.

I hope this clears up why I'm so resistant to this change. PointCloud2 messages are not intended to convey sensor-specific data to software, since that hinders compatibility between algorithms and the underlying sensors. This change also regresses the quality of the RGB data in the /points2 topic for any software that is actually consuming PointCloud2 messages as they were intended to be used.

@helenol
Copy link
Contributor Author

helenol commented Oct 19, 2019

@skalldri
Let's see where we can meet in the middle. Do you agree at least having both options available to the user is correct?
I can create a separate launch file for any application I support and you can keep the driver defaults, though as @ooeygui mentioned the slow runtimes with default settings are a problem for some customers.

I feel like you're not understanding my point though. Google's Cartographer does raycasting into the world for creating its maps. It casts rays from the sensor frame (in your output, the RGB camera frame) into its coordinate system. It assumes that your sensor is a LIDAR, which with a depth camera has different error characteristics than a depth camera, so this is already somewhat incorrect. However you are still tracing rays from the camera center of the RGB camera (which cartographer assumed to be LIDAR) to the depth points, which traces a different physical path of the ray than the physical sensor. And again the same restrictions about rescaling apply.
What you're suggesting is akin to suggesting, given a LIDAR that also has an RGB sensor attached, the correct thing to do would be to make the color pointcloud topic have the LIDAR data reprojected into the RGB frame would be the most correct, and then this should be fed into 3D reconstruction frameworks like cartographer.
And you appear to be arguing that having the scan colorization cartographer correct is more important than having the structure in cartographer correct!

I understand that your way makes the color look nice and there are many cases where this makes more sense but for most robotics applications the structure is far more important than the color.

To give some context, I want to use the Azure Kinect with VIO (rovio, vins mono) and voxblox for dense reconstruction but it is incorrect to feed the rgb in rgb pointcloud into a dense reconstruction framework. I think this would allow the Azure Kinect to be much more useful for roboticists in a wide variety of applications.

@skalldri
Copy link
Contributor

In summary: if there is a need for this mode we should have it. I'd like it to be a non-default option that users can enable, and I would like to see some documentation updated to explain when to use it.

though as @ooeygui mentioned the slow runtimes with default settings are a problem for some customers.

That's why I opened #56 when it was reported as a problem. My intention was to go back and optimize the RGB point cloud rendering function, since I suspect the use of PointCloud iterator is slowing it down.

I feel like you're not understanding my point though.

You're probably right. I don't have a strong enough maths background to properly understand how many of these tracking / reconstruction algorithms work, so I'm absolutely willing to acknowledge that this is a problem for those algorithms.

Having a mode in the driver to help these algorithms work better sounds fine to me. However, I don't want it to be the default mode if it comes at the expense of significant colorization artifacts in the output.

If you have some literature on the topic of why this is such a serious problem, I would love to spend some time reading it. Currently, beyond accidentally raycasting out cells in a voxel grid when it wasn't supposed to, I'm not able to see why this would degrade the quality of SLAM.

And you appear to be arguing that having the scan colorization cartographer correct is more important than having the structure in cartographer correct!

What I'm arguing is that they are equally important. The user has explicitly indicated they want color data in their point cloud. The expectation is that the node will provide artifact-free output data that represents the real world. We know there are significant colorization artifacts when using RGB to depth re-projection, and I feel that we should avoid introducing artifacts into data we are emitting. You're right: re-projecting the depth to the RGB frame is also an artifact, but it's an artifact that preserves the accuracy of the output data while introducing an artifact into the way the data was captured, which I feel is a reasonable tradeoff.

We should make it clear in the documentation that there is a tradeoff between colorization accuracy and raycasting accuracy between these two modes, just like we document that enabling color point clouds will reduce the FOV of the output.

I'm curious: in your example of collecting a dense reconstruction, how are you planning on removing the colorization artifacts once they have been introduced to your reconstruction? Detecting them in post processing sounds very difficult. Is your use case such that having colorization artifacts in the output is acceptable? If so, why bother collecting color data?

I understand that your way makes the color look nice and there are many cases where this makes more sense but for most robotics applications the structure is far more important than the color.

Aesthetics is not why I'm pushing back on this: I want the output of the node to best reflect what the sensor captured. The colorization artifacts introduced through the RGB to depth re-projection do not accurately reflect the scene captured by the sensor.

@RoseFlunder
Copy link
Contributor

@skalldri do you already have an approach to speed #56 up thats in the works?

@helenol
Copy link
Contributor Author

helenol commented Oct 21, 2019

@skalldri Great, thanks a lot for seeing my point of view! I really appreciate it. I can agree to keep the default the current state as long as the option exists.

But I think I've been super hasty in responding without thinking about the problem properly, and I sincerely apologize since I think I wasted both of us lots of time.
There actually shouldn't be colorization artifacts in the first place. The pixel halo in the colorized depth image isn't a fact of how projections work, it's just a bug.
It's a bug I wanted to look more into when I first started using the Azure Kinect, but then got side-tracked with other stuff and forgot about it. I also think I thought it was a calibration issue because like a good roboticist, I blame calibration for 100% of my problems.

To re-iterate: we cannot know which RGB pixels are observed only by the RGB camera and not by the depth camera (or vice versa), so when colorizing the depth map with an RGB frame you will incorrectly colorize depth pixels that the RGB camera did not observe. This results in "color bleed" from objects observed by the RGB camera onto depth pixels not associated with that object.

We actually can know!
Please enjoy some powerpoint figures I made to explain this concept. For a less ad-hoc terrible explanation, please check out Hartley and Zisserman's Multi-view Geometry, chapters 2-3 cover math basics that I also need to brush up on, chapter 10 covers the basics of 3D reconstruction.
We actually have a way easier case than in the book because we do not need to recover depth from two projected images, but the depth is already recovered for us.

Given that we have two fully-calibrated cameras and undistorted images from both, the left being RGB and the right being a depth camera, we can associate every point in the depth camera image with a single point in the RGB image.
This is done by casting a ray into 3D space through a pixel on the projective plane, for the distance d which is the depth measurement stored in that pixel.
projective_transforms
This gives you a fully-defined, metric point in 3D in the depth camera's frame. To recover the corresponding pixel in the RGB image, you simply need to project that 3D point into the RGB image plane.
Do this for every valid depth pixel and you have a valid colorized depth image, no halos!
You can think about this similar to converting the depth image into the RGB frame, colorizing a pointcloud, and then converting it back to the depth frame, except with a lot fewer steps.

So why do we get the type of "halo" or color-bleed that's in the photo below when we colorize the depth cloud?
image

Because there's actually two ways to align an RGB image and a depth image. First is the way I just described, which involves casting a ray through the depth image onto the structure and back for every pixel that has a valid depth.
The second way is to treat the depth image as a regular projective image, and not using depth information, warp the RGB image in the depth image's coordinate frame. Imagine just going through the extrinsic and intrinsic calibration between the two cameras to map what the RGB camera sees into the depth camera's image frame. Here's a drawing which may or may not be helpful.
projective_transforms (1)
The green projective plane is what the RGB camera sees warped into the depth camera's view.
The structure of the RGB plane is assumed to either be planar at a fixed distance or far away to accomodate this warping.
Nice slide deck on these concepts here: http://graphics.cs.cmu.edu/courses/15-463/2004_fall/www/Lectures/mosaic.pdf and I stole the figure from slide 10 here:
image

Without knowing the depth of the pixels, this is the best you can do. And it's pretty alright, except for objects closer to the camera and near depth continuities get assigned to the wrong pixel, because that's where the assumption that the world is a far away plane is most violated.
This approach has one huge advantage: it's fast. Knowing the calibration between two fixed cameras, you can pre-compute a look-up table to map every pixel in one image to another.

I'm not 100% sure that this is what we're doing with the SDK calls, but from the results, this is my best guess for what it appears to be. That or it's a calibration issue.
At any rate, I think investigating and fixing this problem is a high priority for me now.
EDIT: Investigation work has led to the conclusion that what I described first is what the SDK is doing, not the homography warp! https://github.com/microsoft/Azure-Kinect-Sensor-SDK/blob/develop/src/transformation/rgbz.c#L111-L148

Re: what do I do with unmatched color when doing 3D reconstruction? I mostly just use the color for visualization, so not worry too much, I guess. ;) The way we're fusing color in voxblox over multiple scans is anyway super naive (weighted average!), it's not uncommon to have weird effects from changing exposure, etc. I totally agree that how much this matters is very application-dependent. But yeah let's fix the root cause!

Re: affecting SLAM performance... It's definitely worst for actually having accurate free-space vs occupied information. Depending on the camera configuration, it may or may not have any measurable effects on SLAM performance. If the projective centers of the cameras are very close relative to the distance to obstacles, you might not notice any difference.
However, you end up with this entire yellow region that you are completely hallucinating from the color camera when doing raycasting (green dotted lines):
projective_transforms (2)
If there are any obstacles inside this hallucinated space, cartographer performance would diminish because when doing scan matching matching to its map, it would expect to see obstacles there but doesn't.
This is why the case of RGB camera strapped to a LIDAR makes this case more ridiculous to me: the larger the distance between sensor centers, the bigger this yellow region grows and the more hallucinated data you feed into your algorithms.

@RoseFlunder
Copy link
Contributor

@helenol
I don't understand everything from your post but sounds like your approach to get valid rgb values for the depth pixels could be a new addition to the SDK itself?
That could be something useful in general for users of the azure kinect.
The sensor sdk is on github too :)

@helenol
Copy link
Contributor Author

helenol commented Oct 21, 2019

@skalldri @RoseFlunder
Well I said a lot of things totally wrong! Probably disregard my last message completely. I apologize even more sincerely!
Spent the day on a deep dive through the driver... Yes color bleed is from occlusions between RGB and depth camera. Yes you can know about it, but not without basically rendering the image into the RGB frame and then back (or doing z-buffering on the RGB image). Tried a very very very hacky thing to invalidate the occluded pixels by going backwards through the depth image and allowing only the first hit from the depth camera (only works for low camera res):

Before:
Screenshot from 2019-10-21 20-36-14

After:
Screenshot from 2019-10-21 20-33-09

So you can do it, but doing it properly is probably gonna be expensive. I'll think of ways to maybe do it cheaper... But ok yes it's really not trivial, I totally agree with keeping the RGB depth output as default.

Re: pointclouds, I'm also totally wrong here, also stored xyz xyz xyz in the image. Problem here is more packing actually; PCL packs 4 floats for position, SDK packs 3 uint8s. Could change the SDK to output 4 padded floats and then do some memcopy hacks... Not so sure how much faster it'll be though.

@RoseFlunder
Copy link
Contributor

RoseFlunder commented Oct 21, 2019

@helenol
As you mention the Z-Buffer. MS writes in the docs about how they implemented the depth to color transformation:

This transformation function is more complex than simply calling k4a_calibration_2d_to_2d() for every pixel. It warps a triangle mesh from the geometry of the depth camera into the geometry of the color camera. The triangle mesh is used to avoid generating holes in the transformed depth image. A Z-buffer ensures that occlusions are handled correctly. GPU acceleration is enabled for this function by default.

https://docs.microsoft.com/en-us/azure/kinect-dk/use-image-transformation

The example picture shows that very well.
image
We can see in the bottom left picture that all occluded areas get invalidated (black pixels) on the other hand the color to depth version on the bottom right has no occlusion checking.

They state at color to depth:

As this method produces holes in the transformed color image and does not handle occlusions, we recommend using the function k4a_transformation_depth_image_to_color_camera() instead

But I don't really know why they are doing it only in one direction. Maybe because its easier for depth to color. I am no expert here.

EDIT:
If you want to dig through the SDK itself here you can see the non gpu accelerated version of depth to color (i think it will be easier to understand than GPU accelerated)
https://github.com/microsoft/Azure-Kinect-Sensor-SDK/blob/develop/src/transformation/rgbz.c#L506

@skalldri
Copy link
Contributor

@helenol thanks for the detailed explanations! The drawings of what you're describing are very helpful, and now I've got some new reading material :)

I think we're on the same page now. Your concern is that, when using Depth->RGB, we are pretending that the "yellow volume" is empty, when in reality we have no information about that area since the depth camera did not observe it. It might contain geometry, it might not. We can't know for sure, but we are emitting data about that region of space when in fact we have no idea what's there.

SLAM algorithms might expect to find some geometry there while doing scan matching, and could get confused about their position, or might raytrace through some existing geometry, due to this inaccuracy.

That makes complete sense to me, and we should definitely have the ability to change to RGB->Depth mode to resolve that issue.

Yes you can know about it, but not without basically rendering the image into the RGB frame and then back (or doing z-buffering on the RGB image).

So you can do it, but doing it properly is probably gonna be expensive. I'll think of ways to maybe do it cheaper... But ok yes it's really not trivial, I totally agree with keeping the RGB depth output as default.

Over the weekend I had been thinking about this exact idea, but figured it would be too expensive to do it in real time. It's been suggested to me that I should look into screen-space reflection algorithms for potential ways of solving this in an efficient way. That's a bit beyond my expertise, but maybe it makes sense to you?

@helenol
Copy link
Contributor Author

helenol commented Oct 25, 2019

To pick this back up again... I think I have this PR in a state that we're all ok with, and I think it's about time I merged it. ;)

@skalldri I'm not familiar with that class of algorithms! Definitely sounds interesting and worth investigating. I think there's quite a few things we could do, some even in real-time... If you choose to investigate it further, I'm excited to see what you come up with! :)
I see you're following up with a more thorough investigation of #56 , that's great, thank you!
I agree on leaving that bug open to see if there are more performance improvements to be made.

And as one final note, a good example of why I don't personally care too much about color accuracy...
Here's a quick 3D reconstruction with voxblox with 2.5 cm voxels of the sheep scene above with color bleeding (RGB in depth):
Screenshot from 2019-10-22 09-58-32

Here's the same scene with RGB in RGB:
Screenshot from 2019-10-22 09-59-13

🤣 🤣 🤣 I think both equally horrible. I think this is why texturing over the structure is the right answer for anything you actually care about in any case.

@helenol helenol merged commit 94497b3 into microsoft:melodic Oct 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants