OpenACC: View Generated CUDA Code

This is going to be a short post on how one can view the actual compiler generated CUDA code when running OpenACC on NVIDIA hardware. It is a warmup to my upcoming post on where-data-goes-with-OpenACC’s-various-data-directives.

When compiling a OpenACC accelerated program, heres what a canonical compile command would look like for the PGI compiler

(Yes, this is from my first OpenACC tutorial)

With these set of options, the intermediate PTX or CUDA code is not visible to the user. However, if we add keepgpu,nollvm to the -ta=nvidia option, then the compiler dumps those for us to see. With only keepgpu, you would get only the PTX source code and binaries.

With this, ordinary OpenACC’d code like this:

Would generate an intermediate file that looks like this:

While this is legal CUDA code, its quite cluttered. With a neato perl oneliner, we can get that a bit cleaner and easier to read:

Code is now:

Much better.

Thats it! Hope this (and my next OpenACC post TODO:insert link ) helps you guys.

PS: I wonder if a screencast is a better medium for my (b)log.

Performance Prediction of CUDA Kernels

This post is going to be more loud thinking, and less code - (‘cuz there isn’t any yet :) )

NVIDIA CUDA is supported on a wide range of hardware platforms they sell, right from their Tegra Mobile SoCs, notebook GPUs (thats what I have right now), desktop and workstation class GPUs, server class Tesla series, the most recent (and powerful) of which is the mighty Tesla P100.

Over side a wide range of hardware and their relative cost, there must be a way for a potential buyer or user to establish (even approximately) how much  spent would result in what improvement in performance of their CUDA or OpenACC accelerated code. A couple of ways come to mind on how to achieve this, listed below.

1. Create C++/SystemC models of all these available GPUs, and run your program on these to artifically judge their performance. Or,

2. Maybe they(NVDA) could provide the capability to ‘test-drive’ these GPUs, say in the cloud for some trial period for users to judge their potential return on hardware investment. I recall seeing a link for something this somewhere in CUDAZone - though only for their Tesla M40 series. My guess is this service is made available via an GPU enabled AWS instance that actually has an M40. So this idea could work, but is not scalable because we’d need a whole room full of the entire product portfolio that we need to then allow N users access to. Bleh.

3. Hey, its April 2016. No, its not super relevant other than the fact that it’ll mark a few months of my machine learning experience and that the world has just been introduced to DGX-1. More the latter :) The point I’m making is we could adopt a Machine Learning based approach! Let me summarize this below:

• Create a feature vector for each CUDA kernel - This could be a bunch of stats from nvprof, NVIDIA’s handy univeral profiler for all GPGPU code. Selecting the right features is the first step here. We’d then do all the usual tricks with it. Normalize, scale etc, etc.

• Then, once we can represent any CUDA kernel in the world with a feature vector, we’d now need the data. Now the labelled training data in this case would be pairs of features-wallclock time. We could have the same kernels run on a range of hardware and generate the full spread of traning data. Given that nvprof already knows a whole lot about your kernel, this data collection would be best handled by it. Maybe, and I’m getting crazy here - we could even crowdsource that data! Many programs already do that type of thing for usage data, so NVIDIA could add that to nvprof (‘Do you want to submit anonymous report stats to NVIDIA? Click Yes to help the machines rise.’) That way, a ton of data would pour in (and keep pouring in) from customers, developers and datacenters all around the world. (Well thats only if the option to upload said data is not at an annoying point - like firefox meekly asks after it crashes. Do you want to report this.. angry shouts I dont care about you, damn mozilla! I just lost all my browsing data!) I see the availability of data as the real bottleneck here for someone to create that. Once again, an example of the fact that advances in machine learning are not going to come from people with the best ideas, algorithms or even the best hardware - but by who has the data. Information is power.

• Say you have this data. Then you could run your favorite regression algorithm to predict this! Bloody Brilliant! ..the added awesomeness comes from the fact that the crowd sourced data is like free fuel for this prediction engine!

But, I dont have that data, or the hardware to collect it. So, this idea kind of hits a dead end but I’m leaving it around on the blog for now. I wonder if I could publish someplace…

OpenACC: Analyze, Express, Tweak! - Part 1

Whats OpenACC?

OpenACC is a directive-based programming model designed to provide a simple yet powerful approach to accelerators without significant programming effort.

What that is means is, you can pickup existing code written for an x86 CPU, and add some compiler #pragmas, compile with an OpenACC capable compiler - and voila! You get accelerated binaries for a range of hardware accelerators - Nvidia GPUs, AMD GPUs and even Intel multi-core CPUs. Thats really the USP of OpenACC - a single copy of the source code will deliver performance portability across this range of hardware platforms. So, to be successful with OpenACC all you need are strong concepts in parallel programming, some know-how about OpenACC syntax and you’re good to go! You dont need to really know too many lower level hardware details with OpenACC, as opposed to, maybe CUDA C. However, this is a double edged sword - I will revisit this later in this post. Remember, OpenACC is about expressing parallelism - its not GPU programming.

There are some really good tutorials on OpenACC itself available online:
1. Jeff Larkin’s post on the Parallel Forall blog
2. Jeff Larkin’s sessions from GTC 2013 - recordings on Youtube here : Part1 Part2

The recommended approach for parallelism anywhere is to:
1. Try and use existing parallel optimized libraries like cuBLAS, cuDNN etc. if they exist for your application.
2. If you dont get those, try OpenACC on your code. That should get you about 80% of the maximum available performance.
Ofcourse, that is a very rough number and is subject to, you guessed it, your code and the GPU hardware you’re running. 3. Roll your own CUDA kernels. This is definitely the most involved of the 3 options, but it will allow you to squeeze every last drop of that good perf juice from your software and hardware.

OpenACC tutorials online often use the Jacobi Iteration/sAXPY example to demonstrate OpenACC, but all that those examples teach us are syntax constructs. However, if you use OpenACC in the real world, you’ll know it’s all about how you analyze your source code, understand its scope for parallelism and finally express that formally via OpenACC syntax. What this post is really about is about the analysis of a simple program, which is hopefully a little less trivial than the Jacobi type examples all over the net. Also, this is not one of those 100X in 2 hours posts, because that does not always happen.

Setup

First off, some logistics about tool installation and setup.

If you have everything correctly setup, try pgcc --version as shown below

The target

Now, onto our target today - a subroutine that converts a hexadecimal string to base64. I picked this up from the matasano cryptography challenges I’m attempting on the side and decided it’d be a good example for this tutorial.

Heres a brief overview of the algorithm itself:
1. Take 3 bytes of input hex data at a time,
2. Do some bitwise concatenation (shift and OR) and get indexes of 4 base64 characters that these 3 bytes are encoded into
3. Lookup the actual base64 characters using these indices.
..and heres a quick diagram to explain that:

Lets look at what we’ll start here:

Usually, you’d just throw some #pragma accs at the around loops in the problem and let the compiler guide you. But, the idea of this tutorial is to help develop some analysis skills, so we’ll look through the program first.

Now, the function basically takes in a character array of a fixed size, and generates an output array also of a known size (4 x input_size/3). The sizes are important to know, because the compiler needs to know how many bytes to transfer over the CPU<->GPU link. (Side note - if you dont specify those sizes clearly, the compiler will throw - Accelerator restriction: size of the GPU copy of output is unknown) We need to copy over the input array from the CPU to the GPU - or, Host and Device respectively in CUDA terminology. Sometimes, OpenACC documentation refers to the CPU as ‘Self’ and GPU as ‘Device’. And when it is done processing, we must copy the output array back to the CPU. And, the base64_LUT is a common array used by all threads. So, that too will need to be on the GPU. So thats the basic data movement defined right there that you should aim to isolate first. “Whats my input? Whats my output?”

That for (i=0.. loop can be parallelized to operate on chunks of the input in parallel. But, hang on. The next thing I’d like to draw your attention to is - data dependence between loop iterations. What? Where? Well, if you take a closer look at how we’re updating the output array, you’ll quickly realize that j++ implies that you rely on the previous value of j - i.e. the previous iteration. Why is that a problem? Well, for us to run the conversion in parallel, each thread must know its input index and output index without communicating with other threads. Because, if it needed to, that’ll defeat the purpose of parallelization - thats as good as sequential CPU code. So, thats the first thing that needs fixing. Dont worry, the compiler will warn you about this, but it helps to develop what I like to call dependence vision - the ability to “see” the data dependence. That’ll help you with complex code bases where things are not so obvious. Moral of the story: Try to code in a way that keeps the array indices independent of the previous loop iteration, and hopefully dependent on only the current iteration

Going further, the decoded_octets variable is used as a scratch variable to hold 4 values that we eventually push to the output array. This means, each iteration of the loop uses it for itself - something we need to tell the compiler. This is a private variable for each iteration, or each parallel thread.

Because we’re dealing with pointers to access data arrays, there is an additional complication - but I’ll get to that later.

Armed with this non-zero knowledge of not-so-hidden parallelism in the program, we will now use OpenACC directives to express these ideas of parallelism and data movement.

Lets look at the #pragmas required to express this parallelism. I’ve also added some crude instrumentation to measure the elapsed time for the function as a whole.

1. #pragma acc kernels This tells the compiler - “Hey, I think this section of code can be parallelized. Go try and do that for me.” Remember, pragmas are for the immediate next code block. So, this one applies to the for (i=0.. loop. As you will soon learn, adding this macro does not mean that parallel code will be generated. The compiler will try and might fail - so watch the compile output closely for such cases.

2. #pragma acc data present(input[0:size]), present(base64_LUT[64]), copyout(output[0:4*size/3]) Here, we’re using the present clause to tell the compiler about data arrays that we will copy into GPU memory beforehand. Specifically, I have done that just before the function call to encode_block using the copyin clause. The copyout clause as the name suggests directs the compiler to copy out an array output[0:4*size/3] from the GPU to the CPU at the end of the parallel thread’s execution.

3. #pragma acc loop private(decoded_octets, k) This one tells the compiler - “look, the variables decoded_octets and k are private to each iteration of the loop, or each parallel thread. So create private copies of those variables and dont think they depend between loop iterations.

With these changes in place, try giving it a whirl - run make. This is what you can expect:

Those slew of messages based on line 92 - thats our for (i=0.. loop. Lets look at what these messages mean: 1. Loop carried dependence due to exposed use of output[:?] prevents parallelization What do you mean exposed? Enter: the restrict keyword. By default, the compiler will assume that the underlying data object that a pointer points to can be manipulated by other pointers from other threads too. Super paranoid (as it should be!). So, this is perceived as ‘data dependence’ and the whole story goes south. So, as a programmer we must give the compiler the assurance that only the specified pointer variable (or expressions using it) will be used to access that underlying data. So, in our case -

A compile will this change will see most of the issues above resolved. But the compiler still thinks there is some lingering data dependence. But, our analysis shows its all good and thread-safe. Lets reassure the compiler about the same by adding the indepdent clause to the #pragma acc loop line.

The compiler will successfully generate a parallel kernel (CUDA speak for GPU function). Heres what that’ll look like:

Finally! The line Generating Tesla Code simply implies that it will generate parallel code for NVIDIA hardware. Doesnt mean that my 760m GPU is a tesla class card =D. The part about ‘gang’ and ‘vector(128)’ is to do with the CUDA programming model.

Basically in CUDA, we have threads. And a collection of threads forms a thread-block. A collection of thread-blocks forms a grid. And you can express the number of threads, blocks and grids as 3 dimensional co-ordinates. Pretty handy for intuition when working with images and such.

Heres how that maps to OpenACC’s hardware agnostic hierarchy:

CUDA OpenACC
Set of blocks (blockIdx.x) Gang
Set of blocks (blockIdx.y) Worker
Set of threads Vector

So, it has produced 1 gang of 128 threads (didnt create an additional notion of workers here). Thats a default value, so you can use pragma’s to fix that to a more realistic value for our problem size. Say, 32?

One should always tweak the vector() and gang() constructs for optimum device utilization. Value for cores? (like Value for Money..). Most modern GPUs can support thousands of threads, but generaing extra empty threads will eat into performance because they will also be scheduled just the same as active threads and will consume slots that could have been used for some real active work on the GPU.

Note the device_type(nvidia) clause which means that this vector(32) will be applied only for NVIDIA devices. And with OpenACC-2.0, you can have different configurations of these for different devices - giving you control without sacrificing performance portability:

So, its 32 for NVIDIA cards, 256 for AMD Radeon (LoL) and 16 by default if the device is neither.

Code

Complete code available on github at - matasano-cuda/tree/master/challenge1/openacc

Hope this wall of text has helped you better understand OpenACC and parallel programming in general. Thats where Part-1 of this ends. Part-2 will cover profiling, tweaking and more best practices.

I’d like to thank @JeffLarkin for releasing all this awesome training content on the internet and for patiently guiding a newbie like myself through some of the trickier bits.

How to Remap Dead Keys in Ubuntu

My personal computer is an old Lenovo G560, whose keyboard recently started showing signs of wear, with the double-quote/single-quote key being the first to give out. As someone who likes to write code, losing the quote key is a real inconvenience. Sure, I could spend some cash for a new keyboard, but what good is your knowledge in software if it can’t help workaround hardware problems? :P

So I decided to implement the lost key by re-mapping an existing key to behave like it. I chose the semicolon/colon key here. The idea is simple: I’d remap the semicolon/colon key to print single and double quotes with key modifiers (Right Alt and Shift)

Alt + Semicolon = Single Quote + =

..and, Shift + Alt + Semicolon = Double Quote + + = +

To achieve this, what you need to do is create a custom keyboard layout. In Ubuntu, all the keyboard layouts (for different languages and such) are located at /usr/share/X11/xkb/symbols/. Now, you can chose to create a new file altogether, or add a variant to an existing file. I chose the latter, so I added the following lines to the US Keyboard layout located at /usr/share/X11/xkb/symbols/us

Here, I define a new keyboard layout named kmmankad, and its description is specified under name[Group1]. The include "us(basic)" line includes the basic US keyboard layout, since ours is really just one key away from the default US basic. Thats what the next line is about.

AC10 - A (for alphanumeric). The second letter could take values from A-E (rows 1 to 5, bottom to top - rowA is the spacebar row) and 1-12 (key position in the row, going left to right). So with that co-ordinate system, we locate semicolon as AC10. After the key is defined, you the next four values will correspond with what the key will produce by itself, with Shift, with AltGr (usually the right Alt key), and Shift+AltGr respectively. Easy peasy.

Now that we have this defined, We need to update /usr/share/X11/xkb/rules/evdev.xml to include this new layout.

Note that the name and description should match what you entered earlier in /usr/share/X11/xkb/symbols/us.

Now, restart your machine and you should have your new keyboard layout available under System Settings -> Keyboard -> Text Entry (see bottom right) -> +

Select, Test and Done. Hope this was helpful!

So You Want to CUDA?

This is a post about various available resources, and how you could go about becoming a real CUDA pro. This post isn’t about convincing you about why you should definitely learn CUDA - I’ll leave that to the voices in and around your head.

To start out, I would highly recommend going through the free MOOC from Udacity - Intro to Parallel Programming. This is a course that isn’t too technical right off the bat and yet its assignments are non-trivial and could also be a bit challenging for some. But they really help you get some real world exposure to parallel programming in general, apart from the CUDA specific knowledge you would gain in the process. The course really helps develop a ‘think parallel’ mindset - which I feel is as important, (if not more) compared to the knowledge of the actual semantics of a specific programming language or platform. The best part? You can do this without any special hardware - its all in the cloud!

Along with the udacity course, there are a couple of great texts I would urge you guys to get:

1. Sanders & Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming This first one is a good text for beginners because it presents a very approachable learning curve. It has lots of small code examples, something I personally like. It lives up to its title in that respect. Having digestible code examples allow you to tinker with different concepts till you get the hang of things, without the overhead of programming full assignments. The book’s code is available for download on NVIDIA’s site here and serves as handy reference later on as well. However, this book does not go too deep into the application side, and the ‘bigger picture’ of parallel programming. Thats where the next book is better.

2. Kirk & Hwu. Programming Massively Parallel Processors This book definitely dives a bit deeper with regard to the technical aspects. Since it was created keeping in mind a typical graduate-level course on this subject, each chapter has exercises as well. Chapter 6 on performance considerations, and Chapter 7 on floating point math are two I consider particularly important for a learner to understand early on. The chapters on computational thinking and OpenCL make this a complete text on parallel programming. In addition, the code for the case studies discussed has been made available freely available online.

And as you get more hands-on with the programming aspects of it, you will be able to appreciate the wealth of info in the CUDA Best Practices Guide. I actually have a printed copy I refer to often.

Among must-read blogs, there is NVIDIA’s Parallel Forall blog that has some really well written articles on a wide variety of topics and applications in accelerated computing. Most of the CUDA related content posted here is best understood by someone who already has a higher-than-basic understanding of CUDA. Still, do subscribe.

I almost forgot to mention the hands-on labs offered by NVIDIA via qwiklabs. While these aren’t anywhere as fully featured as the resources mentioned above, these serve as good exercises nonetheless. These are also in the cloud, hosted on GPU enabled AWS instances.

Though there are lots of such free(-ish) learning resources out there, you really need access to some hardware in order to really sharpen your skills. But this does not mean you need to spend big bucks. Lots of older GPUs support CUDA, and if you’re part of an academic institution, you could also look at Nvidia’s hardware grant program. You can also run your CUDA code on your multicore CPU (coming-soon-a-link-to-a-tutorial-on-how-to-do-that)

And finally, you need to have a project that you really want to invest your sweat and skills into. Something to tie all of this together. It could be a cryptographic algorithm, or a massively parallel numerical method or perhaps something cool in the field of machine learning. Maybe you could build a encoder/decoder for an image format. Basically, you can CUDA-fy mostly anything compute intensive around you. I’m not saying that everything is going to work well with CUDA - thats the topic for another blog post. But as someone starting out, one shouldn’t be overly picky about that.

Oh, and theres always stackoverflow, /r/CUDA and NVIDIA’s developer forum if you get stuck somewhere - or even just want to discuss your ideas.

As with any new endeavor, you will fail and learn a lot. But the key as always is to persevere and accept experience that comes your way, whatever the form.

Replace Text in Incoming Email With Outlook VBA

After a quick and very helpful markdown tutorial over at www.markdowntutorial.com, heres a post.

As someone who works for a large tech company, I need to write a lot of email where I need to refer to one or more bugs or commit-IDs, and manually adding hyperlinks to emails soon lost its charm. In an age where almost everything we encounter in our day that can be programmed now - surely there had to be a better way to do this. I had done some tinkering earlier with Excel VBA, but Outlook was something I never looked deeper into.

Some googling later, I chanced on this superuser answer, that seemed to document exactly what I wanted. Which was basically, I’d write an email with some text like this:

Please pull the fix for Bug#123456

and I wanted a script to turn this into a hyperlink to a shortlink created from the Bug’s ID# (123456), like so:

Please pull the fix for Bug#123456

So, I plonked the prescribed code in from the superuser answer to see if it would work, and it didn’t. While it did do the actual replace-text-with-a-hyperlink, it stripped the email of all formatting, and the hyperlink wasn’t the kind I was expecting either (I wanted the link text to remain intact). Some more time reading about Outlook VBA and the MailItem class revealed that this wasn’t the right way to tackle this when your Outlook Client uses the HTML Editor for composing and viewing email. So I decided to roll my own solution.

The code is available in this Github repo - OutlookEmailHyperlinker and works with Outlook 2013. I’m not sure if further description into its gory details is that attractive, so I’ll keep it to this description on this for now.

Pull requests are most welcome!

My New Blog

Now that I have this shiny new blog setup, here’s where I will be continuing (albeit after more than 2 years) writing about my explorations. My older blogspot blog will stay on though.

Here are some links on setting up Octopress with github pages that I found useful:

Oh, and I’m using the whitespace theme.