How many threads/work-items are used?

94 views Asked by At

I am trying to understand the architecture of a GPU and estimate the latency of one arithmetic statement without compiling or running it.

I suppose the following code will only use one thread/work-item although I specify local size = 32. Is it correct?

int k = 0;
for (; k < 32000; k++){
     A = C * (B + D);
}

If I run a programme using double precision unit (DPU), and there are 1 DPU per SM on NVIDIA Tesla GPU, what is the size of a warp? Is it still 32 threads (1 thread uses a DPU, plus 31 threads use SPs)?

One more question: according to this GPU architecture, there are no threads on a real GPU. Is thread a virtual concept for programmers?

1

There are 1 answers

0
user703016 On

I am trying to understand the architecture of a GPU and estimate the latency of one arithmetic statement without compiling or running it.

I do not believe this is publicly specified anywhere and it varies between vendors and models. Modern discrete GPUs by AMD and NVIDIA typically have pipelines of around 20 stages.

I suppose the following code will only use one thread/work-item although I specify local size = 32. Is it correct?

If you specify an NDRange of 32 work items, irrespective of the local size, you get 32 work items. You haven't shown how you launch your kernel, so your question here is unclear.

If I run a programme using double precision unit (DPU), and there are 1 DPU per SM on NVIDIA Tesla GPU, what is the size of a warp?

The size of the warp does not depend on the type of instruction to execute. Warps are a physical concept, akin to SIMD lanes. You cannot change it. On NVIDIA hardware, this is always 32.

This has nothing to do with SPUs and DPUs. The amount of SPUs and DPUs constrains the number of single precision and double precision instructions that can be issued/retired at every cycle (exact constraints vary between hardware, it is not always possible to issue both types of instructions in the same cycle).

Assuming a fictitious SM with 32 SPUs and 1 DPU, this means you can issue 32 single precision instructions and 1 double precision instruction at every cycle .

If all your 32 threads need to execute a single precision instruction, it will get issued in a single cycle. If they all need to execute a double precision, it will get issued over 32 cycles. And if we assume the SM can do both in parallel, then it can issue 1 double precision instruction and 31 single precision instructions in a single cycle, too.

Is thread a virtual concept for programmers?

Yes, the term "thread" when talking in CUDA parlance is completely unrelated to the usual meaning, it is akin to "SIMD lane". Note however that OpenCL does not use the term thread, it is work-item. The underlying execution mechanism is unspecified and need not map to any hardware concept.