I'm working with matrices that range in size from 2,000x2,000 up to 5,000x5,000, doing operations such as multiplication and QR decomposition. I'm curious if, for example, I should align the stride by 64 for all matrixes for best performance. Also, should I avoid strides that are a multiple of some page size due to cache associativity, or does that not apply to GPU memory?
What stride should I use for matrices in CUDA for the fastest possible speed?
82 views Asked by meisel At
1
There are 1 answers
Related Questions in CUDA
- Why we use general interface(GI)?
- Spotfire cross table - calculate difference when multiple hierarchies are on Columns
- parsing a xml using java code palette in tibco
- Tibco AMX Business Studio BWSE service identification Issues after BC pallet installation
- How do you solve EMS Server connection when giving an error when you test?
- How to execute command using Tibco BW
- TiBCO Spotfire - How to Calculate only the last 3 columns in a Data - see descr
- TIBCO Hawk Sample ConsoleApp Testing
- Get all messages from topic
- Details on Demand
Related Questions in MEMORY-ALIGNMENT
- Why we use general interface(GI)?
- Spotfire cross table - calculate difference when multiple hierarchies are on Columns
- parsing a xml using java code palette in tibco
- Tibco AMX Business Studio BWSE service identification Issues after BC pallet installation
- How do you solve EMS Server connection when giving an error when you test?
- How to execute command using Tibco BW
- TiBCO Spotfire - How to Calculate only the last 3 columns in a Data - see descr
- TIBCO Hawk Sample ConsoleApp Testing
- Get all messages from topic
- Details on Demand
Related Questions in STRIDE
- Why we use general interface(GI)?
- Spotfire cross table - calculate difference when multiple hierarchies are on Columns
- parsing a xml using java code palette in tibco
- Tibco AMX Business Studio BWSE service identification Issues after BC pallet installation
- How do you solve EMS Server connection when giving an error when you test?
- How to execute command using Tibco BW
- TiBCO Spotfire - How to Calculate only the last 3 columns in a Data - see descr
- TIBCO Hawk Sample ConsoleApp Testing
- Get all messages from topic
- Details on Demand
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Popular Tags
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
I imagine most people trust
cudaMallocPitch
orcudaMalloc3D
to provide the proper alignment as this is their stated purpose. While not explicitly clarified in the runtime documentation, they align tocudaDeviceProp::textureAlignment
(512 byte on current hardware). There are also NPP's allocator functions which seem to have different alignment strategies (or at least did so in the past). See How does CUDA's nppiMalloc... function guarantee alignment? for some discussion on that.The lack of a pitched allocator function for the stream ordered memory allocator suggests that alignment may not be as relevant today. Or it might be an oversight in the API, who knows?
What we do know from different parts of the programming guide is that
memcpy_async
requires 16 byte alignment for best performanceThe best practices guide simply recommends 32 byte aligned memory transactions.
I'm not aware of a list of cache parameters for each generation. Turing's L2 is 4 MiB 16-way set associative with 64 byte lines and the memory pages are 2 MiB. If I did the math right, this means an alignment of 256 kiB would be pathological. With these numbers I'd imagine you could start seeing effects with 16 kiB alignment or more but I'm not aware of any official guidance on the subject.
Personally I stick with the pitched allocators and when I don't use them, I use the texture alignment except for smaller line sizes where I just use the next power of 2 as to not waste so much memory unless I plan to use texture binding.