-
Notifications
You must be signed in to change notification settings - Fork 235
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #119 from TimmyLiu/master
merge develop branch into master branch. bump the version number to 2.6
- Loading branch information
Showing
134 changed files
with
48,875 additions
and
1,264 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,3 +17,6 @@ | |
|
||
# Generated kernel template files | ||
*.clT | ||
|
||
# flags.txt file | ||
*flags.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,6 +20,24 @@ library does generate and enqueue optimized OpenCL kernels, relieving | |
the user from the task of writing, optimizing and maintaining kernel | ||
code themselves. | ||
|
||
## clBLAS update notes 04/2015 | ||
- A subset of GEMM and TRSM can be off-line compiled for Hawaii, Bonaire and Tahiti device at compile-time. This feature | ||
eliminates the overhead of calling clBuildProgram() at run-time. | ||
- Off-line compilation can be done with OpenCL 1.1, OpenCL 1.2 and OpenCl 2.0 runtime. However, for better | ||
performance OpenCL 2.0 is recommended. Library user can select "OCL_VERSION" from CMake to ensure the library with | ||
OpenCL version. It is library user's responsibility to ensure compatible hardware and driver. | ||
- Added flags_public.txt file that contains OpenCL compiler flags used by off-line compilation. The flags_public.txt | ||
will only be loaded when OCL_VERSION is 2.0. | ||
- User can off-line compile one or more supported device by selecting | ||
OCL_OFFLINE_BUILD_BONAIRE_KERNEL | ||
OCL_OFFLINE_BUILD_HAWII_KERNEL | ||
OCL_OFFLINE_BUILD_TAHITI_KERNEL. | ||
However, compile for more than one device at a time might result in running out of heap memory. Thus, compile for | ||
one device at a time is recommended. | ||
- User may also supply specific OpenCL compiler path with OCL_COMPILER_DIR or the library will load default OpenCL compiler. | ||
- The minimum driver requirement for off-line compilation is 14.502. | ||
|
||
|
||
## clBLAS library user documentation | ||
|
||
[Library and API documentation][] for developers is available online as | ||
|
@@ -48,15 +66,12 @@ how to contribute code to this open source project. The code in the | |
be made against the /develop branch. | ||
|
||
## License | ||
|
||
The source for clBLAS is licensed under the [Apache License, Version | ||
2.0][] | ||
The source for clBLAS is licensed under the [Apache License, Version 2.0]( http://www.apache.org/licenses/LICENSE-2.0 ) | ||
|
||
## Example | ||
The simple example below shows how to use clBLAS to compute an OpenCL accelerated SGEMM | ||
|
||
The simple example below shows how to use clBLAS to compute an OpenCL | ||
accelerated SGEMM | ||
|
||
```c | ||
#include <sys/types.h> | ||
#include <stdio.h> | ||
|
||
|
@@ -170,42 +185,30 @@ accelerated SGEMM | |
|
||
return ret; | ||
} | ||
``` | ||
## Build dependencies | ||
|
||
### Library for Windows | ||
|
||
- Windows® 7/8 | ||
|
||
- Visual Studio 2010 SP1, 2012 | ||
|
||
- An OpenCL SDK, such as APP SDK 2.9 | ||
|
||
- Latest CMake | ||
* Windows® 7/8 | ||
* Visual Studio 2010 SP1, 2012 | ||
* An OpenCL SDK, such as APP SDK 2.8 | ||
* Latest CMake | ||
### Library for Linux | ||
|
||
- GCC 4.6 and onwards | ||
|
||
- An OpenCL SDK, such as APP SDK 2.9 | ||
|
||
- Latest CMake | ||
* GCC 4.6 and onwards | ||
* An OpenCL SDK, such as APP SDK 2.9 | ||
* Latest CMake | ||
### Library for Mac OSX | ||
|
||
- Recommended to generate Unix makefiles with cmake | ||
* Recommended to generate Unix makefiles with cmake | ||
### Test infrastructure | ||
|
||
- Googletest v1.6 | ||
|
||
- ACML on windows/linux; Accelerate on Mac OSX | ||
|
||
- Latest Boost | ||
* Googletest v1.6 | ||
* ACML on windows/linux; Accelerate on Mac OSX | ||
* Latest Boost | ||
### Performance infrastructure | ||
|
||
- Python | ||
* Python | ||
[Library and API documentation]: http://clmathlibraries.github.io/clBLAS/ | ||
[[email protected]]: https://groups.google.com/forum/#!forum/clmath | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
S. Chauveau | ||
CAPS Entreprise | ||
clBLAS Project | ||
------------------------------ | ||
April 30,2014 | ||
|
||
|
||
The implementation of a binary cache for CL programs can be found in | ||
files src/include/binary_lookup.h and src/library/blas/generic/binary_lookup.cc | ||
|
||
The cache is currently disabled by default. It can be enabled by | ||
setting the environment variable 'CLBLAS_CACHE_PATH' to the directory | ||
containing the cache entries. | ||
|
||
In the code itself, accesses to the cache are controlled by the | ||
BinaryLookup class. A typical cache query looks as follow: | ||
|
||
(1) Create a local instance of BinaryLookup | ||
|
||
(2) Specify the additional characteristics (i.e. variants) of the | ||
requested program. That information combined with the program name | ||
and the OpenCL context and device shall form a unique signature | ||
for the binary program. | ||
|
||
(3) Perform the effective search by calling the 'found' method | ||
|
||
(4a) If the search was successful then cl_program can be retrieved | ||
by a call to the 'getProgram' method | ||
|
||
(4b) If the search was not successful then a cl_program | ||
must be created and populated in the cache by a call | ||
to the 'setProgram' method. | ||
|
||
(5) Destroy the BinaryLookup local instance. | ||
|
||
|
||
So in practice a typical query shall looks as follow: | ||
|
||
cl_program program ; | ||
|
||
// The program name is part of the signature and shall be unique | ||
const char * program_name = "... my unique program name ... " ; | ||
|
||
BinaryLookup bl(context, device, program_name); | ||
|
||
// Specify some additional information used to build a | ||
// unique signature for that cache entry | ||
|
||
bl.variantInt( vectorSize ); | ||
bl.variantInt( hasBorder ); | ||
... | ||
|
||
// Perform the query | ||
if ( bl.found() ) | ||
{ | ||
// Success! use the cl_program retrieved from the cache | ||
program = bl.getProgram(); | ||
} | ||
else | ||
{ | ||
// Failure! we need to build the program | ||
program = build_my_program(context,device,vectorSize,...) ; | ||
// and inform the lookup object of the program | ||
bl.setProgram(program); | ||
// and finally populate the cache | ||
bl.populateCache() | ||
} | ||
|
||
// The BinaryLookup shall now be destroyed |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
S. Chauveau | ||
CAPS Entreprise | ||
April 30, 2014 | ||
|
||
The Functor concept was introduced in clBLAS to simplify the creation | ||
of specialized versions for dedicated architectures. | ||
|
||
The original system, referred as the 'Solver' system in this document, | ||
is very centralized and not flexible enough to insert customized kernels. | ||
|
||
The Functor | ||
=========== | ||
|
||
A functor is simply a C++ object that provides an implementation of | ||
a function. In the current case, that function is one of the BLAS calls | ||
implemented in OpenCL. | ||
|
||
The base class of all functors is clblasFunctor | ||
- see src/library/blas/functor/include/functor.h | ||
- see src/library/blas/functor/functor.cc | ||
|
||
That class does not provide much by itself but it is supposed to be derived | ||
once for each BLAS function to be implemented. | ||
|
||
For instance the clblasSgemmFunctor class will be the base class of all | ||
functors providing a generic or specific implementation of SGEMM. | ||
|
||
A generic functor is one that is applicable to all possible arguments of the | ||
function it implements. In most cases, there will be at least one generic | ||
functor that will simply call the existing Solver-based implementation of the | ||
function. For SGEMM, that is the class clblasSgemmFunctorFallback. | ||
|
||
A specific functor is one that is applicable to only a subset of the possible | ||
arguments of the function it implements. For instance, a SGEMM functor could | ||
only implement it for matrices of a given block size or only for square | ||
matrices or only for a specific device architecture (e.g. AMD Hawai) etc | ||
|
||
The Functor Selector | ||
==================== | ||
|
||
Multiple generic and specific functors may be available to implement each | ||
clBLAS call. The selection of the proper functor is delegated to the class | ||
clblasFunctorSelector whose default implementation typically returns the | ||
fallback functors. | ||
|
||
- see src/library/blas/functor/include/functor_selector.h | ||
- see src/library/blas/functor/functor_selector.cc | ||
|
||
So clblasFunctorSelector provides a large set of virtual selection methods. | ||
Typically, a method to select a specific functor will be provided for each | ||
supported BLAS function. Another method may be provided to select a generic | ||
functor but that is not mandatory. | ||
|
||
The default implementation of clblasFunctorSelector is typically that the | ||
specific selector is redirected to the generic one returning the fallback | ||
functor (so using the existing Solver-based implementation). | ||
|
||
|
||
The class clblasFunctorSelector is supposed to be derived once for each | ||
supported architecture (e.g. Hawai, Tahiti, ...) and a single global instance | ||
of each of those derived classes shall be created. This is important because | ||
those instances register themselves in a global data structure that is later | ||
used to find the proper clblasFunctorSelector according to the architecture | ||
(see clblasFunctorSelector::find() ) | ||
|
||
|
||
Functor Management & Cache | ||
========================== | ||
|
||
Each functor contains a reference counter that, when it reaches zero, causes | ||
the functor destruction. See the members clblasFunctor::retain() and | ||
clblasFunctor::release(). | ||
|
||
Of course, to be efficient, functors must be reusable between BLAS calls so | ||
some mechanisms must be implemented to manage the functors. | ||
|
||
Some functors, such as the fallback functors, are independent of the | ||
arguments and of the opencl context & device. Those can typically be | ||
implemented using a single global instance that will never be destroyed. | ||
|
||
Other functors, such as those that manage a cl_program internally, are | ||
dependent of the opencl context & device and sometimes of some arguments. | ||
They need to be stored in caches using some information as keys. | ||
|
||
In the current implementation, we propose that each functor class shall | ||
implement its own private cache. Such functors shall not be created directly | ||
using its constructor but via a dedicated 'provide' function (the name 'provide' | ||
is not mandatory) that will take care of managing the internal cache. | ||
|
||
The template class clblasFunctorCache<F> is provided as a simple | ||
implementation of a cache of functors of type F. Use of that cache is not a | ||
mandatory part of the functor design. Another strategies could be to keep a | ||
single instance of the functor and implement a cache for the cl_program or to | ||
implement a global cache shared by multiple functor classes. | ||
|
||
|
||
|
||
|
||
|
||
|
Oops, something went wrong.