Fault Tolerance In Gpgpu Essay

[paper report 5] studies three software approaches for GPGPU reliability. These approaches are based on the redundant execution. The first approach is to execute the kernel twice, so the performance overhead is around 100 percent. The other two approaches use the interleaved execution of the main kernel with redundant threads. The paper explores the usefulness of employing ECC/parity bits in memories considering it’s exerted overhead. The first approach, called R-Native executes the kernel twice. One drawback is the similar effect of the permanent hardware defects on both of the executions that could not be detected. This could be avoided by reorganizing the input data for redundant ...view middle of the document...

The results for six application benchmark show that the benefits of using complex approaches is depend on the application and the architecture of GPU. Therefore, executing the kernel twice is sufficient in most cases.
In [paper – report5], the reliability properties of GPGPU using error injection are studied. This study considers the permanent errors (SEU) in ALU and LD/ST units. The error is injected using a developed error injector and a heuristic method is adopted to recognizing hot spots of reliability in the code and depending on the type of this hot spot, a convenient error detector is inserted into the code.
The error injector, inject the errors at the assembly code level by choosing a register at random with normal distribution and injecting an error to it. One error is injected in each execution and the result of it is considered only if the error becomes active. The hot spot of reliability are grouped into three categories including loop condition, branch with a thread or block index and computational statements. Each of these categories equipped with an appropriate error detector. In the paper, only errors causing SDC (i.e. incorrect result) is considered. In a conducted experiment, eight to forty percent of errors caused SDCs that show their considerable participation. The results show that sixty percent of SDC errors are covered by the presented scheme while exerting an overhead between 95 to 35 percent.

Vulnerability of GPU cores to soft errors is also studied in [paper report5]. Two techniques are presented in the paper to detecting the soft errors and improving reliability of SMs, with low overhead. During the branch divergence and pipeline stalls, SMs are underutilized and the paper suggests using idle times of SM to execute redundant threads to improve the reliability and enhance the error coverage. This approach is called RISE by the authors. RISE is...

