-
-
Notifications
You must be signed in to change notification settings - Fork 56k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Half precision floating point support. #8428
Comments
You can take a look on this PR: #6447 (changes are merged) There are no plans to add FP16 as |
Thanks. |
If we get a time machine and go back by ~10 years, I would add CV_16F as another regular data type, supported by CvMat and cv::Mat. But even now we can probably squeeze it in with a little bit of effort. We use the 3 lower bits of CvMat/cv::Mat type field to represent the data type. Those are: 0 - CV_8U As you can see, we occupied all the possible values. And increasing the number of bits per depth from 3 to 4 or more is not feasible - there is too much code that may use it. But there is still workaround available. Suppose that we've overridden depth=7 as CV_16F. What's next? C++ does not even know about such a type, as far as know. So we'd need to do the following:
It's a big project. I think, we could do 1-4, stop and wait for subsequent contributions. Now the question - who will do it? :) |
There are still no effective implementations for arithmetic operations for almost all platforms (no support on x86 - conversions only, ARM has alternative floats). Current FP16 area is special accelerators (power efficient), and usage on CPUs is very limited. Also there is alternative to FP16 - 16-bit fixed point numbers, like Q8.7. Both of approaches are interesting, but probably they are useful for specific tasks only. |
I believe, the main point here is to have some standard, "official" container for such data. Then people can add their own algorithms working on such data, including obvious scenarios when Mat is converted to CV_32F, processed and converted back to CV_16F if needed. Also, Intel, NVidia and AMD GPUs already have hardware FP16 support. There are AVX intrinsics to convert between half and float, so any more or less complex operation can do conversion on-fly, and the performance will be comparable with CV_32F: a few more conversion operations vs 2x smaller memory traffic. Half is definitely more convenient to use than Q8.7, you still have a decent representation range: +/-65504, and at the same time can represent (normalized) values as little as 6*10^-5. With Q8.7 data you'd often need to supply the scale factor that would map your data to this quite narrow range. |
@vpisarev let me shed some light on why @mkimi-lightricks and our team requires half-float support, and how we're currently made OpenCV to "work" with it. We have an engine based on OpenGL running on iOS. On these devices the GPU's and the CPU's memory is shared, so it's zero-copy to map GPU memory to CPU and work directly on GPU textures. Our CPU matrix data type is Since OpenGL ES (3.0) supports half-float out-of-the-box (and not float), we require support on the CPU side to that type as well. Additionally, since the data stored in these matrices is usually HDR images we do not require wider data types as the amount of RAM on these devices is fairly limited. Therefore, converting a 16MP image from half-float to float is a huge waste of memory and computation power we do not have. Our current solution is a hack - we defined a type
The underlying data type is Obviously, almost all OpenCV operations doesn't work, but the |
There was an idea of creation of special "fp16" module which will handle operation on half-float type (probably via But at least such functionality would be useful for testing or prototypes (for example, it may detect "overflows" in some simulation mode). Also such approach would not explode size of OpenCV binaries with "unused" functionality. |
Let me add some background here: 16-bit floats are about to appear in C++20 (and C, at some point) as new "short float" fundamental type: |
@alalek : efficient FP16 GPU implementations go way back. Most professional photo and video formats use it. Once the fundamental type appears in the standard, OpenCV would have no choice but extend for it - better to start early! |
@vpisarev, @mkimi-lightricks: as a main driver for "short float" proposal in ANSI C++, I would definitely volunteer to do the steps 1-4. Whoever has implemented any bits for FP16 support, please send those to me. Our company would be 100% behind it, as you might guess, but I am not authorized to speak for it :) |
@StatusReport: I do not see any opencv forks on your profile - if you care to share any fp16 code you guys produced by now, I would appreciate that and would incorporate with all proper credits into (hopefully happening) PR for steps 1-4! |
@mkimi-lightricks: I do not see any public forks of opencv in your profile either - I would be happy to consider your PR into what may become step 1-4 implementation. |
@borisfom I was not aware of the upcoming We haven't forked OpenCV by now, as we tried to avoid diverging from the mainline releases. The code I posted + using the The pending question is if such PRs will be accepted by the maintainers. |
@StatusReport: until there is 'short float', there is a problem with common type definition if one can't assume CUDA is present - otherwise one can't import CUDA half type and has to come up with distinct CPU type - trying to solve same issue for, say, Torch DL framework, was not successful. |
I would go for @borisfom can you please elaborate on the problem of sharing different types? Isn't this solvable by defining implicit/explicit cast operators between them? |
The problem was - if you can't rely on CUDA type being defined, you have to define it somehow for CPU/OpenCL. Even if it has the same name as CUDA type and defined in same exact literal way, C++ compiler had every right to consider it different type, so you would have to jump through the hoops of the conversions - yes it is solvable but would be better avoidable. |
sorry for delay with reply. Ok, so here is the minimal float16_t implementation that we can put into some header in opencv2/core: #include <stdio.h> #include <stdlib.h> #include <math.h> namespace cv { struct CV_EXPORTS float16_t { explicit float16_t(float x) { union { unsigned u; float f; } in; in.f = x; unsigned abs_f = in.u & 0x7fffffff; unsigned t = (abs_f >> 13) - 0x1c000; unsigned sign = (in.u & 0x80000000) >> 16; unsigned e = abs_f & 0x7f800000; t = e < 0x38800000 ? 0 : t; // Flush-to-zero t = e < 0x47000000 ? t : abs_f > 0x7f800000 ? t - 0x1c000 : 0x7c00; // Clamp-to-max t = (e == 0 ? 0 : t) | sign; bits = (unsigned short)t; } operator float() const { union { int u; float f; } out; unsigned t = ((bits & 0x7fff) << 13) + 0x38000000; // extract and adjust mantissa and exponent unsigned sign = (bits & 0x8000) << 16; // extract and shift the sign bit unsigned e = bits & 0x7c00; out.u = (e >= 0x7c00 ? t + 0x38000000 : e == 0 ? 0 : t) | sign; // convert denormals to 0's return out.f; } unsigned short bits; }; } int main(int argc, char** argv) { float a[] = { 1.f, 0.f, 0.001f, 0.000001f, 1234.56f, 1e6f, (float)(4*atan(1.)), (float)(1./0.), (float)(0./0.), (float)log(0.) }; int i, n = 10; for( i = 0; i < n; i++ ) { cv::float16_t h(a[i]); printf("%d. flt=%f, half=%f (%04x)\n", i, a[i], (float)h, h.bits); } return 0; } who can be a volunteer to prepare the pull request where |
@vpisarev: I would not sign up for other things - but would like to help defining the type right. We had a few rounds integrating short float support into Torch/Caffe at NVIDIA- Caffe definitely being cleaner as it used a (fancier) C++ class, lifted from here: http://half.sourceforge.net/. We also added CUDA-specific sections to make use of native CUDA support - not too many, but important places (conversion). It's not up on GitHub yet, we'll have a public release next week. Another very crucial thing we learned is that for DL (and I believe, for image processing as well) clamping conversion is a must (in example above, 1e6f would translate to the max positive value, not +Inf). Otherwise many algorithms go nuts. |
@borisfom, thanks for the link! If I understand correctly, it needs C++11, which is still optional for OpenCV. In fact, the actual half implementation is not very important. We just want some placeholder for half type. As soon as you add Mat_ m(1080, 1920); for(...) { m(i, j) = ...; } In other words, declaring half type in OpenCV does not prevent you from using your own implementation of half. |
@vpisarev while this will solve (1), it will not solve (2)-(4), since we'll need to implement the basic half float operations by ourselves. Wouldn't it be easier to add a battle-tested third party such as Also,
|
@StatusReport, we need to make sure that operations on half's are not just template instantiations of generic code, because at least now some major archs, like x86/x64, do not support half in hardware (except for conversion to/from float in AVX2). And so the instantiations will work significantly slower than even double-precision versions. From this point of view my dummy implementation of float16_t is even better than half::half_float, because there are bigger chances that the instantiations will not compile. If you want to provide certain operations on half float, e.g. in DNN module, you need to implement such operations explicitly. That's the point. |
@StatusReport : yes C11 is optional. My point exactly: this implementation was tested with a lot of scrutiny, NVIDIA alone ran many CPU(and GPU)-weeks through it, uncovering any possible rounding/conversion/normalization issue. |
@StatusReport, @vpisarev : now that I have some free time after GTC, I would be interested to get some version of float.cpp with CUDA support into OpenCV - including CV_16F plumbing - please let me know how to proceed. |
@StatusReport, @vpisarev , @mkimi-lightricks : I'd like to take a cut on FP16 support (steps 1-4), let me know if there is any new work on that in the process. |
Sorry, but there's nothing on my end. |
@borisfom please proceed, as it seems that you have strong opinions about (1). We'll be happy to chime in after the basic types are merged. |
@StatusReport, @mkimi-lightricks, @vpisarev Update: I started working on it. Resolutions so far: I have submitted 2 PRs so far, one fixing CUDA9 build and another extending CPU dispatch for the tests. Both are ready to go in: |
@borisfom, thank you for working on it! Hopefully, at some point we will have more or less universal, Halide-based solution to support |
@vpisarev : You are welcome! |
I think we can close it. See #12463. |
System information (version)
OpenCV iOS 3.1.
MacOS Sierra (v 10.12.3)
,iMac
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Detailed description
If yes, when?
Steps to reproduce
None.
The text was updated successfully, but these errors were encountered: