MapReduce is presented by Dean and Ghemawt  as a programming model for parallelizing the computations of data intensive application across a cluster. It has two primitives, map and reduce. Map computes a set of intermediate key/value pairs for each record, then reduce is applied to all the values with similar keys so it may forms a smaller set of values.
Several implementation of MapReduce have been implemented according to the hardware infrastructure. Phoenix , as an example, is a shared-memory implementation of MapReduce. Apache Hadoop is another example implemented for running applications on large cluster built of commodity hardware .
MATE-CG  is another map-reduce-like framework, implemented appropriately for programming the CPU+GPU clusters. MATE-CG aims the acceleration of map-reduce applications on parallel heterogeneous environments, especially CPU+GPU clusters. It enables accelerating different types of applications, supporting three schemes. Based on its dataset, an application could be accelerated by one of the CPU-only, GPU-only and CPU-n-GPU schemes, to get the best performance. Appropriate APIs allow the implementer to specify reduction functions for CPU and GPU. GPU reduction is implemented by CUDA kernels to perform on GPUs. Also the user defines application specific partitioner and splitter functions. The former is used to partition the dataset among the computing nodes and the latter splits the data blocks into smaller chunks to be processed on the CPUs and GPUs. In the runtime system, the CPU is used to execute partitioning, scheduling, etc. GPU is mainly responsible for accelerating the computation. Based on the user defined partitioning parameter, data is distributed among the nodes. After completion of data block distribution, each block is cut into two parts and both the CPU and the GPU start their execution concurrently. The data parts are split into smaller chunks to enable the load to be balanced. The CPU-GPU data distribution fraction is automatically tuned based on an analytical model. The chunk size for the CPU and the chunk size for the GPU are also defined with an analytical model, but can be provided by the developer.
In [Ravi et al.], two scheduling scheme is presented for single node of CPU/GPU and a cluster of CPU/GPU nodes. The first scheme outperforms the Blind round Robin scheme....