GPU Accelerated programming with Amplifier

5 min readMay 18, 2019

With the thirst to build sophisticated AI or to build complex algorithm to solve a problem, the need for faster code execution is always a challenge. For many years, GPUs have powered the display of images and motion on computer displays, but they are technically capable of doing more. Graphics processors are brought into play when massive calculations are needed on a single task. GPGPU (General-purpose computing on graphics processing units) is currently one of the hot topics you will see in the machine learning world which is used heavily to run matrix and vector computation.

A normal CPU will have one or more cores, whereas a GPU has 1000’s of cores. Now think about the amount of parallelism you can achieve by executing the code across the cores, the execution of your algorithm simply multiplies. Although it’s not just simple to write a GPU based code since we have many programming languages which is well polished to execute in CPU, with little more effort you can very well develop your own GPU based programming. In this post, we will uncover the use of Amplifier.NET to write your GPU based program and execute it using C# .NET. Oh yes using C# and not C or C++ which will make the .NET developer life easier.

Amplier.NET is a high level wrapper to make your life easier which got OpenCL under the hood.

“ OpenCL ( Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages (based on C99 and C++11) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.”

Mode about OpenCL: https://en.wikipedia.org/wiki/OpenCL

What is a kernel?

Code that gets executed on a device is called a kernel in OpenCL. The kernels are written in a C dialect, which is mostly straightforward C with a lot of built-in functions and additional data types. For instance, 4-component vectors are a built-in type just as integers. For the first example, we’ll be implementing a simple SAXPY kernel.

Let’s Implement

Create a new .NET console project (Core or Standard), and name it as “MyFirstGPGPU”. Now time to add the Aplifier.NET by going to Manage NuGet and install it from there. Link to nuget package: https://www.nuget.org/packages/Amplifier.NET/

Now time to create your kernel functions. Create a new class file “SimpleKernels.cs”. The class will be inherited from OpenCLFunctions which will give you a list of all OpenCL supported datatypes and functions to build your kernel in C#. Below is three kernels, AddData, Fill and SAXPY

AddData: To add two float array. When you pass an array, the Amplifier will split the data across multiple cores and execute it in parallel.
Fill: Fill a float array with a constant value
SAXPY: Once of simple BLAS program

class SimpleKernels : OpenCLFunctions
    {
        [OpenCLKernel]
        void AddData([Global]float[] a, [Global] float[] b, [Global]float[] r)
        {
            int i = get_global_id(0);
            b[i] = 0.5f * b[i];
            r[i] = a[i] + b[i];
        }

        [OpenCLKernel]
        void Fill([Global] float[] x, float value)
        {
            int i = get_global_id(0);
            
            x[i] = value;
        }

        [OpenCLKernel]
        void SAXPY([Global]float[] x, [Global] float[] y, float a)
        {
            int i = get_global_id(0);

            y[i] += a * x[i];
        }
    }

Multiple kernels can be present in a single source file, called a program.

The [Global] indicates that this is global memory or simply memory allocated on the device. OpenCL supports different address spaces; in this sample, we’ll be using only global memory. You wonder probably where the SAXPY loop went; this is one of the key design decisions behind OpenCL, so let’s understand first how the code is executed.

Unlike C, OpenCL is designed for parallel applications. The basic assumption is that many instances of the kernel are executed in parallel, each processing a single work item. Multiple work items are executed together as part of a workgroup. Inside a workgroup, each kernel instance can communicate with other instances. This is the only execution ordering guarantee that OpenCL gives you; there is no specified order how work-items inside a group are processed. The work-group execution order is also undefined. In fact, you cannot tell if the items are executed in parallel, sequential, or in random order. This freedom and the minimal amount of data exchange and dependencies between work items makes OpenCL so fast. Work-groups allow for some order as all items in a work-group can be synchronized. This comes in handy if you want to load for instance a part of an image into the cache, process it, and write it back from cache. For our example at hand, however, we will ignore work-groups completely.

In order to identify the kernel instance, the runtime environment provides an id. Inside the kernel, we use get_global_id which returns the id of the current work item in the first dimension. We will start as many instances as there are elements in our vector, and each work item will process exactly one entry.

Now its time to execute the kernel, open the Program.cs and add the following code one by one.

Create an instance of OpenCL compiler and you can see the list of available device supported to execute your code.

           //Create instance of OpenCL compiler
            var compiler = new OpenCLCompiler();

            //Get the available device list
            Console.WriteLine("\nList Devices----");
            foreach (var item in compiler.Devices)
            {
                Console.WriteLine(item);
            }

Lets select the first device, you can choose any one of them, and then compile the kernel class created above.

            //Select a default device
            compiler.UseDevice(0);

            //Compile the sample kernel
            compiler.CompileKernel(typeof(SimpleKernels));

            //See all the kernel methods
            Console.WriteLine("\nList Kernels----");
            foreach (var item in compiler.Kernels)
            {
                Console.WriteLine(item);
            }

Awesome! Once compiled you will see all the kernel methods in the console. Now let create sample variables and execute Fill and SAXPY method. This call will basically copy the data to the device memory, perform the kernel execution and copy the result back to CPU for display. All the magic is taken care by Amplifier and you just don’t have to write any extra code to manage the memory.

            //Create variable x, y with sample data
            Array x = new float[] { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
            Array y = new float[9];

            //Get the execution engine
            var exec = compiler.GetExec<float>();

            //Execute fill kernel method to fill the y with constant value 0.5
            exec.Fill(y, 0.5f);

            //Execute SAXPY kernel method
            exec.SAXPY(x, y, 2f);

            //Print the result
            Console.WriteLine("\nResult----");
            for (int i = 0; i < y.Length; i++)
            {
                Console.Write(y.GetValue(i) + " ");
            }

Below is the execution result:

If you understand how to break down your code into simpler kernel, then you can use the power of GPU to full potential. Enjoy GPUing….. 🙂

Project with Examples: https://github.com/tech-quantum/Amplifier.NET

Originally published at https://www.tech-quantum.com on May 18, 2019.

GPU Accelerated programming with Amplifier

What is a kernel?

Let’s Implement

Written by Deepak Battini