About the OpenCL backend¶
This is a brief document/note about the OpenCL backend.
OpenCL standard version¶
Currently the backend only requires and uses the features provided by OpenCL 1.2. But OpenCL 2.0 support is planned.
Program caching¶
The OpenCL backend caches programs automatoically. If a program cast
already exists in the kernel manager. Asking it to compile another version of cast
does nothing. Set the force_override
flag to force a compilation and override.
Local Memory¶
Due to how HTM work - being very memory bangwidth hungry and the input SDR is relativelly small. The OpenCL backend tries to stores data in the GPU’s local(on-chip) memory so more bandwidth can be used for fetching synapse.
However this also poses limitations. Since the input Tensor is copied into the local memory. The size of the input Tensor cannot exceed the size of local memory (48KB on NVIDIA cards and 64KB on AMD cards). If larger inputs are encountered, Etaler simply switches to a version not using local memory. Also some GPUs - all of them are mobile GPUs - don’t have such local memory and is emulated using global memroy. Etaler uses kernels without local memory optimization.
Global inhibition¶
Currently finding the top-k values are done by a quick counting based scanning methoud. Performed by a single work group and stores it’s counters in Local Memory. This is fast and accurate as long as the size of the input array is small (the max value is less then the size of the counter). Thus another algorithm might be needed for big arrays.
OpenCL kenrel distribution¶
The OpenCL backend searchs for the kernels it needs in the following paths: ./kernels/
, ../kernels/
, /usr/local/share/Etaler/kernels
and /usr/share/Etaler/kernels/
. If the file is found, then it is read and cached in memory. If not, an exception is raised.
The kernel files are installed when installing the library.
If you have your kernels stored at a different location. You can set the ETALER_KERNEL_PATH
enviroment variable to make your path avavliable.
JIT compiling views¶
The OpenCL backend generates the OpenCL kernels to copy/write to Tensor views at runtime. Thus copying from a view might be slow. It the problem turns out to be too bug a problem. It will be cchanged.
NVIDIA’s OpenCL implementation¶
NVIDIA’s OpenCL implementation can crash without notifing the user. (kerenl can crash without abort, generating error code at the wrong places, etc…). Use POCL’s CUDA backend for varification that the kernel is running correctly.
program name mangling¶
Since the OpenCL backend tracks programes using a key. Name mangling (in Etaler’s case, appending the hash of the compiler argumnents to the end of the key) is required to support multiple versions of the same program (with different -D
defines, etc…).
RPi VC4CL Support¶
VC4CL is not suported for now. The limitation of VC4CL only supporting up to 12 PE per work group is not taken into account in the OpenCL backend. (And VC4 uses global memory to emulate local memory, it is going to be slow),
Altera AOCL / Xiinx SDAccel support¶
FPGA based OpenCL although interaseting, are not supported now due to the lack of a API callable compiler.