Loading...
Different results between GPU and CPU: what is the matter?
Fri, 10/08/2010 - 05:36
Hi all,
I have a simple cuda program which executed on the CPU and on the GPU, gives different results when the GPU is a Tesla card and gives the same results when the GPU is a Fermi card.
Does anybody know any explanation for the difference of the results?
I post also the code of my program:
/* File: testDefinitions.h */ #define MVMul(v1, m, v2) \ (v1).x = (m)[0] * (v2).x + (m)[3] * (v2).y + (m)[6] * (v2).z, \ (v1).y = (m)[1] * (v2).x + (m)[4] * (v2).y + (m)[7] * (v2).z, \ (v1).z = (m)[2] * (v2).x + (m)[5] * (v2).y + (m)[8] * (v2).z typedef struct { float x, y, z; } triple; typedef struct { float u[9]; } ninuple; /* File testc.c */ #include #include #include #include "testDefinitions.h" float uNewGpuHost[3]; void preparation_gpu_matrix_product(triple, ninuple); // main routine that executes on the host int main(int argc, char **argv) { triple uOld, uNewGpu, uNewHost; ninuple Ru; uOld.x = 2.39282537; uOld.y = 0.653135121; uOld.z = -0.224300578; Ru.u[0] = 1; Ru.u[1] = 0; Ru.u[2] = 0; Ru.u[3] = 0; Ru.u[4] = 0.999999523; Ru.u[5] = -0.000974848808; Ru.u[6] = 0; Ru.u[7] = 0.000974848808; Ru.u[8] = 0.999999523; MVMul(uNewHost, Ru.u, uOld); printf("\nValue of uNew calculated by host: %.20f %.20f %.20f\n", uNewHost.x, uNewHost.y, uNewHost.z); preparation_gpu_matrix_product(uOld, Ru); uNewGpu.x = uNewGpuHost[0]; uNewGpu.y = uNewGpuHost[1]; uNewGpu.z = uNewGpuHost[2]; printf("Value of uNew calculated by GPU: %.20f %.20f %.20f\n", uNewGpu.x, uNewGpu.y, uNewGpu.z); if(((uNewHost.x-uNewGpu.x)==0) && ((uNewHost.y-uNewGpu.y)==0) && ((uNewHost.z-uNewGpu.z)==0)) { printf("Test PASSED\n"); } else printf("Test FAILED\n"); return 0; } /* File test.cu */ #include #include __global__ void matrix_product(triple, ninuple, float*); extern float uNewGpuHost[3]; extern "C" { void preparation_gpu_matrix_product(triple uOld, ninuple Ru) { float* uNewDev; cudaMalloc((void **) &uNewDev, 3*sizeof(float)); // Allocate array on device cudaError_t error=cudaGetLastError(); if(error != cudaSuccess)printf("cudaMalloc: %s\n",cudaGetErrorString(error)); // Do calculation on device: matrix_product <<< 1, 16 >>> (uOld, Ru, uNewDev); cudaThreadSynchronize(); error=cudaGetLastError(); if(error != cudaSuccess)printf("Synchronize kernel: %s\n",cudaGetErrorString(error)); // Retrieve result from device and store it in host array cudaMemcpy(uNewGpuHost, uNewDev, sizeof(float)*3, cudaMemcpyDeviceToHost); error=cudaGetLastError(); if(error != cudaSuccess)printf("cuda copy device to host: %s\n",cudaGetErrorString(error)); cudaFree(uNewDev); } } __global__ void matrix_product(triple uOld, ninuple Ru, float* uNewDev) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if(idx==0) { triple uNew; MVMul(uNew, Ru.u, uOld); uNewDev[0]=uNew.x; uNewDev[1]=uNew.y; uNewDev[2]=uNew.z; } } The results of this simple program targeted to a Tesla card: x295 are: Value of uNew calculated by host: 2.39282536506652832031 0.65291619300842285156 -0.22493718564510345459 Value of uNew calculated by GPU: 2.39282536506652832031 0.65291619300842285156 -0.22493717074394226074 Test FAILED insted when run on a Fermi card the results are: Value of uNew calculated by host: 2.39282536506652832031 0.65291619300842285156 -0.22493718564510345459 Value of uNew calculated by GPU: 2.39282536506652832031 0.65291619300842285156 -0.22493718564510345459 Test PASSED Does anybody know why? Thank you in advance, Ardita
Fri, 10/08/2010 - 15:27
#2
Hi kpesler,
thank you very
Hi kpesler,
thank you very much for your answer. I find it very useful.
Tue, 10/12/2010 - 07:23
#3
Using double precision I have
Using double precision I have also a difference in the results of GPU and CPU.
For example if I have an instruction a*b + c*d + e*f where a, b, c, d, e, f are double variables, the CPU and GPU execute it in different ways so as to give different results many times.
How can I find a general way such as to have the same results frome both CPU and GPU?
Thank you in advance,
Ardita

BayWebSoft
In binary, the two numbers that differ between CPU and Tesla are given below:
-0.22493718564510345459 = 10111110011001100101010111101111
-0.22493717074394226074 = 10111110011001100101010111101110
As you can see, the two differ only by the last bit. Fermi's floating point unit has been updated to the latest IEEE standard, hence bit-for-bit agreement with the CPU result. The single bit error may have resulted from the use of the single-precision MAD (multiply-add) instruction. If this type of error will be problematic for your application, you may need to use double precision.