calling CUDA functions from Ice

aabramovaabramov Member Alexey AbramovOrganization: Georg-August-UniversityProject: Using CUDA together with Ice
Dear All,
I'm working on integration of some my shared libraries using CUDA (for Nvidia Tesla T10 card) into the Ice framework. However, I'm facing the following problem.
For example, one class is declared within a shared library which performs some computations on the GPU using CUDA. In the Ice module (within the Ice interface function) an object of this class is created only once and used for all Ice calls from outside. But it works fine only for the very first call (the first portion of data). During the second call it crashes... I've figured out that CUDA calls in the shared library crash when they are called second time from our Ice module by the created object. The error message is "cudaSafeCall() Runtime API error : invalid argument". Here is an example of a testing class which doesn't work with Ice:
CCudaTest::CCudaTest(): m_pData(0){

  m_pData = new unsigned char[320 * 256];
  bzero(m_pData, 320 * 256);

  for(int i = 0; i < 320 * 256; ++i)
    *(m_pData + i) = i;

  // allocate device memory
  SC( cudaMalloc((void**)&m_d_CudaData, 320 * 256) );
  SC( cudaMemcpy(m_d_CudaData, m_pData, 320 * 256, cudaMemcpyHostToDevice) );

}

CCudaTest::~CCudaTest(){

  delete[] m_pData;

  // release device memory
  SC( cudaFree(m_d_CudaData) );

}

void CCudaTest::PrintString(){

  std::cout << "CCudaTest: print string function !!!" << std::endl;

}

void CCudaTest::CudaTestFunction(){

  bzero(m_pData, 320 * 256);

  cudaMemcpy(m_pData, m_d_CudaData, 320 * 256, cudaMemcpyDeviceToHost);

  for(int i = 0; i < 100; ++i)
    std::cout << "val = " << (int)*(m_pData + i) << " ";

}


Calling "CudaTestFunction" from the Ice interface function with the same object fails on the second call and no proper values are being printed. However, it works without problems if an object of this class is created every Ice call... Furthermore, everything works fine if this function is called in a loop by the same object outside the Ice functions... "PrintString" function works in all cases.

Does anybody here have an experience using Ice with CUDA? I suspect that something weird happens here with GPU memory addresses or thread contexts.

Any suggestions and/or ideas are kindly welcome!
Thanks a lot for your help!

Best,
Alexey

Comments

  • xdmxdm La Coruña, SpainAdministrators, ZeroC Staff Jose Gutierrez de la ConchaOrganization: ZeroC, Inc.Project: Ice Developer ZeroC Staff
    I have not experience with CUDA, but seems like a problem with using the GPU buffer from multiple threads.

    See:

    http://stackoverflow.com/questions/5616538/cudamemcpy-invalid-argument

    Have you change the Ice.ThreadPool.Server.Size? the default is one, in that case all request are dispatched in the same thread. But could still be problems depending in what thread you call constructor and destructor of CCudaTest object, as you are allocating/deallocating buffers in there.

    I think you should have a dedicated WorkQueue to use the GPU, and your Ice objects can push jobs to the WorkQueue, there is a wokrqueue demo included in Ice, see cpp/demo/IceUtil/workqueue
  • aabramovaabramov Member Alexey AbramovOrganization: Georg-August-UniversityProject: Using CUDA together with Ice
    xdm wrote: »
    I have not experience with CUDA, but seems like a problem with using the GPU buffer from multiple threads.

    See:

    http://stackoverflow.com/questions/5616538/cudamemcpy-invalid-argument

    Have you change the Ice.ThreadPool.Server.Size? the default is one, in that case all request are dispatched in the same thread. But could still be problems depending in what thread you call constructor and destructor of CCudaTest object, as you are allocating/deallocating buffers in there.

    I think you should have a dedicated WorkQueue to use the GPU, and your Ice objects can push jobs to the WorkQueue, there is a wokrqueue demo included in Ice, see cpp/demo/IceUtil/workqueue

    Thank you for your reply! Yes, I set Ice.ThreadPool.Server.Size? to one but it didn't help and the problem remains the same. Also I switched to CUDA 4.1 which is more thread safe. Furthermore, now I'm calling a CCudaTest object in the constructor of the Ice component (not in the Ice interface fucntion) and it fails as well. However, the same code works fine on the same machine and the same card without Ice. What else could case this problem? Thanks!
  • xdmxdm La Coruña, SpainAdministrators, ZeroC Staff Jose Gutierrez de la ConchaOrganization: ZeroC, Inc.Project: Ice Developer ZeroC Staff
    Hi, Alexey

    Can you test if something like this works for you
    //Slice
    module Test
    {
    
    inteface Cuda
    {
        void simple()
    };
    
    };
    
    //
    // C++
    //
    class CudaI : public Test::Cuda
    {
        void simple(const Ice::Current&)
        {
            CCudaTest c;
            c.CudaTestFunction();
        }
    };
    

    Here each request allocate/deallocate a different buffer so shouldn't be problems, probably you don't want to do that in you application , but is just a test to know if the problems come from using GPU from different threads.

    If that works i think you can isolate the GPU from using different threads by creating a Work Queue , and Work classes.

    So your ice operation create a Work and push it to the Work Queue to be processed, that way all the GPU iteration is done in the Work Queue thread, and you can keep GPU buffers alive in the Work Queue.

    Then you can wait from the works to be completed on the Ice operation thread, or for better performance use AMD and don't block the Ice thread.

    For AMD usage see Asynchronous Method Dispatch (AMD) in C++ - Ice 3.4 - ZeroC

    For a WorkQueue sample implementation see demo/IceUtil/workqueue In Ice distribution.
Sign In or Register to comment.