logo
down
shadow

Is it possible to overlap batched FFTs with CUDA's cuFFT library and cufftPlanMany?


Is it possible to overlap batched FFTs with CUDA's cuFFT library and cufftPlanMany?

By : meizaps
Date : November 22 2020, 03:03 PM
like below fixes the issue If you use Advanced Data Layout, the idist parameter should allow you to set any arbitrary offset between the starting points of 2 successive transform input sets.
For the 1D case, the input will be selected according to the following based on the parameters you pass:
code :
input[ b * idist + x * istride]


Share : facebook icon twitter icon
How are batched DWR requests handled differently from non-batched on the server side?

How are batched DWR requests handled differently from non-batched on the server side?


By : DreamCatcher
Date : March 29 2020, 07:55 AM
it helps some times In almost all cases, batching acts the same way as a bunch of synchronous unbatched calls. See the Caveats section of DWR Batching for potential pitfalls.
Batched FFTs using cufftPlanMany

Batched FFTs using cufftPlanMany


By : jie zhou
Date : March 29 2020, 07:55 AM
hop of those help? Here is a full example on how using cufftPlanMany to perform batched direct and inverse transformations in CUDA. The example refers to float to cufftComplex transformations and back. The final result of the direct+inverse transformation is correct but for a multiplicative constant equal to the overall number of matrix elements nRows*nCols.
code :
#include <stdio.h>
#include <stdlib.h>
#include <cufft.h>
#include <assert.h>

/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
    if (code != cudaSuccess) 
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) { getchar(); exit(code); }
    }
}

/*********************/
/* CUFFT ERROR CHECK */
/*********************/
static const char *_cudaGetErrorEnum(cufftResult error)
{
    switch (error)
    {
        case CUFFT_SUCCESS:
            return "CUFFT_SUCCESS";

        case CUFFT_INVALID_PLAN:
            return "CUFFT_INVALID_PLAN";

        case CUFFT_ALLOC_FAILED:
            return "CUFFT_ALLOC_FAILED";

        case CUFFT_INVALID_TYPE:
            return "CUFFT_INVALID_TYPE";

        case CUFFT_INVALID_VALUE:
            return "CUFFT_INVALID_VALUE";

        case CUFFT_INTERNAL_ERROR:
            return "CUFFT_INTERNAL_ERROR";

        case CUFFT_EXEC_FAILED:
            return "CUFFT_EXEC_FAILED";

        case CUFFT_SETUP_FAILED:
            return "CUFFT_SETUP_FAILED";

        case CUFFT_INVALID_SIZE:
            return "CUFFT_INVALID_SIZE";

        case CUFFT_UNALIGNED_DATA:
            return "CUFFT_UNALIGNED_DATA";
    }

    return "<unknown>";
}

#define cufftSafeCall(err)      __cufftSafeCall(err, __FILE__, __LINE__)
inline void __cufftSafeCall(cufftResult err, const char *file, const int line)
{
    if( CUFFT_SUCCESS != err) {
                fprintf(stderr, "CUFFT error in file '%s', line %d\n %s\nerror %d: %s\nterminating!\n",__FILE__, __LINE__,err, \
                           _cudaGetErrorEnum(err)); \
             cudaDeviceReset(); assert(0); \
    }
}

/********/
/* MAIN */
/********/
void main() {

    cufftHandle forward_plan, inverse_plan; 

    int batch = 3;
    int rank = 2;

    int nRows = 5;
    int nCols = 5;
    int n[2] = {nRows, nCols};

    int idist = nRows*nCols;
    int odist = nRows*(nCols/2+1);

    int inembed[] = {nRows, nCols};
    int onembed[] = {nRows, nCols/2+1};

    int istride = 1;
    int ostride = 1;

    cufftSafeCall(cufftPlanMany(&forward_plan,  rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_R2C, batch));

    float *h_in = (float*)malloc(sizeof(float)*nRows*nCols*batch);
    for(int i=0; i<nRows*nCols*batch; i++) h_in[i] = 1.f;

    float2* h_freq = (float2*)malloc(sizeof(float2)*nRows*(nCols/2+1)*batch);

    float* d_in;            gpuErrchk(cudaMalloc(&d_in, sizeof(float)*nRows*nCols*batch)); 
    float2* d_freq; gpuErrchk(cudaMalloc(&d_freq, sizeof(float2)*nRows*(nCols/2+1)*batch)); 

      gpuErrchk(cudaMemcpy(d_in,h_in,sizeof(float)*nRows*nCols*batch,cudaMemcpyHostToDevice));

    cufftSafeCall(cufftExecR2C(forward_plan, d_in, d_freq));

    gpuErrchk(cudaMemcpy(h_freq,d_freq,sizeof(float2)*nRows*(nCols/2+1)*batch,cudaMemcpyDeviceToHost));

    for(int i=0; i<nRows*(nCols/2+1)*batch; i++) printf("Direct transform: %i %f %f\n",i,h_freq[i].x,h_freq[i].y); 

    cufftSafeCall(cufftPlanMany(&inverse_plan, rank, n, onembed, ostride, odist, inembed, istride, idist, CUFFT_C2R, batch));

    cufftSafeCall(cufftExecC2R(inverse_plan, d_freq, d_in));

    gpuErrchk(cudaMemcpy(h_in,d_in,sizeof(float)*nRows*nCols*batch,cudaMemcpyDeviceToHost));

    for(int i=0; i<nRows*nCols*batch; i++) printf("Inverse transform: %i %f \n",i,h_in[i]); 

    getchar();

}
1D batched FFTs of real arrays

1D batched FFTs of real arrays


By : Jumbaro
Date : March 29 2020, 07:55 AM
Does that help As mentioned by Robert Crovella, and as reported in the cuFFT User Guide - CUDA 6.5,
code :
int rank = 1;                           // --- 1D FFTs
int n[] = { DATASIZE };                 // --- Size of the Fourier transform
int istride = 1, ostride = 1;           // --- Distance between two successive input/output elements
int idist = DATASIZE, odist = (DATASIZE / 2 + 1); // --- Distance between batches
int inembed[] = { 0 };                  // --- Input size with pitch (ignored for 1D transforms)
int onembed[] = { 0 };                  // --- Output size with pitch (ignored for 1D transforms)
int batch = BATCH;                      // --- Number of batched executions
cufftPlanMany(&handle, rank, n, 
              inembed, istride, idist,
              onembed, ostride, odist, CUFFT_R2C, batch);
cufftPlan1d(&handle, DATASIZE, CUFFT_R2C, BATCH);
#include <cuda.h>
#include <cufft.h>
#include <stdio.h>
#include <math.h>

#define DATASIZE 8
#define BATCH 2

/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

/********/
/* MAIN */
/********/
int main ()
{
    // --- Host side input data allocation and initialization
    cufftReal *hostInputData = (cufftReal*)malloc(DATASIZE*BATCH*sizeof(cufftReal));
    for (int i=0; i<BATCH; i++)
        for (int j=0; j<DATASIZE; j++) hostInputData[i*DATASIZE + j] = (cufftReal)(i + 1);

    // --- Device side input data allocation and initialization
    cufftReal *deviceInputData; gpuErrchk(cudaMalloc((void**)&deviceInputData, DATASIZE * BATCH * sizeof(cufftReal)));
    cudaMemcpy(deviceInputData, hostInputData, DATASIZE * BATCH * sizeof(cufftReal), cudaMemcpyHostToDevice);

    // --- Host side output data allocation
    cufftComplex *hostOutputData = (cufftComplex*)malloc((DATASIZE / 2 + 1) * BATCH * sizeof(cufftComplex));

    // --- Device side output data allocation
    cufftComplex *deviceOutputData; gpuErrchk(cudaMalloc((void**)&deviceOutputData, (DATASIZE / 2 + 1) * BATCH * sizeof(cufftComplex)));

    // --- Batched 1D FFTs
    cufftHandle handle;
    int rank = 1;                           // --- 1D FFTs
    int n[] = { DATASIZE };                 // --- Size of the Fourier transform
    int istride = 1, ostride = 1;           // --- Distance between two successive input/output elements
    int idist = DATASIZE, odist = (DATASIZE / 2 + 1); // --- Distance between batches
    int inembed[] = { 0 };                  // --- Input size with pitch (ignored for 1D transforms)
    int onembed[] = { 0 };                  // --- Output size with pitch (ignored for 1D transforms)
    int batch = BATCH;                      // --- Number of batched executions
    cufftPlanMany(&handle, rank, n, 
                  inembed, istride, idist,
                  onembed, ostride, odist, CUFFT_R2C, batch);

    //cufftPlan1d(&handle, DATASIZE, CUFFT_R2C, BATCH);
    cufftExecR2C(handle,  deviceInputData, deviceOutputData);

    // --- Device->Host copy of the results
    gpuErrchk(cudaMemcpy(hostOutputData, deviceOutputData, (DATASIZE / 2 + 1) * BATCH * sizeof(cufftComplex), cudaMemcpyDeviceToHost));

    for (int i=0; i<BATCH; i++)
        for (int j=0; j<(DATASIZE / 2 + 1); j++)
            printf("%i %i %f %f\n", i, j, hostOutputData[i*(DATASIZE / 2 + 1) + j].x, hostOutputData[i*(DATASIZE / 2 + 1) + j].y);

    cufftDestroy(handle);
    gpuErrchk(cudaFree(deviceOutputData));
    gpuErrchk(cudaFree(deviceInputData));

}
Using cufft when operating with the thrust library

Using cufft when operating with the thrust library


By : ahmad irham
Date : March 29 2020, 07:55 AM
I hope this helps . thrust::complex and std::complex share the same data layout with cuDoubleComplex. As a result, all that is required to make your example above work is to cast the data in your device_vector to raw pointers and pass them to cuFFT. Thrust itself can't work with cuDoubleComplex in most operations because that type is a simple container which doesn't define any of the operators which are required to perform any of the operations which Thrust expects for POD types.
This should work:
code :
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/sequence.h>
#include <thrust/complex.h>
#include <iostream>
#include <cufft.h>

int main()
{
    int length = 5;
    thrust::device_vector<thrust::complex<double> > V1(length);
    thrust::device_vector<thrust::complex<double> > V2(length);
    thrust::device_vector<thrust::complex<double> > V3(length);
    thrust::sequence(V1.begin(), V1.end(), 1);
    thrust::sequence(V2.begin(), V2.end(), 2);
    thrust::transform(V1.begin(), V1.end(), V2.begin(), V3.begin(), 
                         thrust::multiplies<thrust::complex<double> >());
    cufftHandle plan;
    cufftPlan1d(&plan, length, CUFFT_Z2Z, 1);
    cuDoubleComplex* _V1 = (cuDoubleComplex*)thrust::raw_pointer_cast(V1.data());
    cuDoubleComplex* _V2 = (cuDoubleComplex*)thrust::raw_pointer_cast(V2.data());

    cufftExecZ2Z(plan, _V1, _V2, CUFFT_FORWARD);
    for (int i = 0; i < length; i++)
        std::cout << V1[i] << ' ' << V2[i] << ' ' << V3[i] << '\n';
    std::cout << '\n';
    return  EXIT_SUCCESS;
}
Multi-GPU batched 1D FFTs: only a single GPU seems to work

Multi-GPU batched 1D FFTs: only a single GPU seems to work


By : user3570165
Date : March 29 2020, 07:55 AM
like below fixes the issue to @RobertCrovella for the answer:
As of CUDA 10.2.89 according to the documentation strided input and output are not supported for multi-GPU transforms.
Related Posts Related Posts :
  • C++ Error: C4430 and C2143 Error at an impossible place
  • How can I track object lifetime in C++11 lambda?
  • #include statement mapping in Biicode (biicode.conf)
  • std::equal gives "Term doesnt evaluate to a function taking 2 arguments"
  • C++ template argument as reference lvalue
  • Legal to forward declare C standard library entities but not C++ standard library entities?
  • Conversion of wchar_t* to string
  • VirtualTreeView - Embarcadero C++ Builder XE7 - 64 bits
  • I don't understand C++ pointer arithmetic
  • Invalid addition of constness? Error: Cannot use char** to initialize const char**
  • Initilize constructor of class from another class
  • what happens when a class object is used as an index for an array?
  • Read and straighten multiple images from vector string, get error: "vector subscript out of range" [c++]
  • Meaning of a few lines in C++
  • Map, pair-vector or two vectors...?
  • Redefinition error in ostream overload in template and inherited classes
  • CUDA 6.5: error MSB3191 Unable to create directory and LNK2001 Unresolved External symbol
  • How to delete function from DLL binary
  • How do I loop over a boost MPL list of non-default constructed classes?
  • Download page using IE engine + use POST
  • How to insert an element into ublas matrix with dynamic size
  • Using typedefs appropriately to avoid "typedef contamination"
  • C++ Simple Converting from Binary to Decimal
  • error deleting item from list after passing through function
  • C++: RVO, NRVO and returning local objects
  • performance map c++ find (g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3))
  • How to change the fundamental underlying container type for the adaptor containers?
  • Crash when SAFE_RELEASE is called on IMMDeviceEnumerator
  • Visual Studio C++, how to change the text color for "nullptr"
  • C++ , return string from function; boost::asio read / write
  • C++ and finance, trouble understanding syntax in these files
  • Exception Handling in underflow while doing POP in Stack
  • C++, Mongoose: How to make a POST request?
  • boost split method results in iterator error
  • Why Would Different Pointers Act As If They Shared The Same Values?
  • g++ dumped assembly output doesn't work
  • C++ Rotation matrix issue when used on a square
  • Avoiding multiple objects of same name in C++
  • Is it possible to get Lexer output from gcc or clang?
  • C++ getline pass file or cin
  • Find critical edges of an MST: possible with modified Prim's algorithm?
  • Converting wostringstream to wchar_t* Garbles Values
  • Causing segfault in program doesn't get caught by signal handler
  • Undefined symbol: _ZN7QString13toUtf8_helperERKS_ at runtime
  • proper usage of C dummy functions replacement in different environment
  • How do you save images of detected objects in OpenCV?
  • See if length between two equal numbers in deque is even
  • Random real in [0..1[ using Mersenne Twister
  • Writing the contents of a map through operator overloading
  • Defined operator works in main() but doesn't work in class
  • Can a throw or delete expression ever be dependent?
  • C++ finding the (largest) index of the largest element in an array
  • Can sizeof nested twice ever be a dependent expression?
  • Creation of objects from a string (C++)
  • OpenCV: extractor->descriptorSize() - Segfault
  • C++ generic iterator
  • Why HANDLE created by 'CreateEvent' isn't valid in another process?
  • Assistance with a Memory Allocation Error in c++, linux using g++
  • Returning by reference a member of a destroyed local variable
  • Opengl: how do I affect the lighting on a textured plane?
  • shadow
    Privacy Policy - Terms - Contact Us © ourworld-yourmove.org