How to Make Embedded Software Smaller and Faster

Posted January 06, 2016

We've all heard it before - "premature optimization is the root of all evil" - yet many embedded systems have reliability, cost, and performance requirements that demand a development style where we optimize early and often. Performance needs to be designed into a system, from architecture to algorithms to data structures to coding guidelines.

In this webinar, we reviewed ten top C coding and other embedded development techniques to maximize the performance of an embedded system--making the executable faster and smaller.

Slide 1: Top Ten Development Tips for High Performance Embedded Systems

Good afternoon and thank you for attending Barr Group’s webinar on “Top Ten Development Tips for High Performance Embedded Systems.” My name is Jennifer, and I will be moderating today’s webinar. Our presenters today will be Michael Barr, Chief Technical Officer at Barr Group, and Nigel Jones, Chief Engineer at Barr Group.

Today’s presentation will be about 45 minutes long, after which there will be a moderated Question and Answer session. To send a question during the event, please type your question into the rectangular space near the bottom right of your screen then click the send button. Please hold your technical questions until the end when the presenters will be addressing them.

At the start of the webinar, check to be sure that your speakers are on and your volume is turned up. Please make sure that all background programs are turned off so they do not affect your audio feed.

I am pleased to present Michael Barr and Nigel Jones’ webinar on “Top Ten Development Tips for High Performance Embedded Systems.”

Slide 2: About Barr Group

Thank you Jennifer. Welcome everyone to today’s webinar. My name is Michael Bar, Chief Technical Officer and the Co-Founder at Barr Group. I just want to briefly introduce you to the Barr Group and then I’ll turn it over to the main speaker for Today’s webinar in just a minute.

Barr Group’s mission is to help engineers build safer, more reliable, more secure embedded Systems. To do this we principally provide four types of services. We consult with various engineering organizations on architecture or re-architecture of their designs and also process improvement.

We also participate in and take on development projects. Both electronics and software, or both. Additionally we provide trainings such as these webinars and public courses and private courses, which I’ll talk about in a minute. And in addition we also work with attorneys when things go bump in the night and products become dangerous or un-secure and in that role we work as testifying experts or expert witnesses.

Slide 3: Upcoming Public Training

As I mentioned we do a lot of training. Some of this training is at private companies. Could be your company, in which case we’ll see you at your location.

We also do training in public locations. We have announced already our upcoming training dates. You can find the full details in the firmware training area of our website.

Slide 4: Nigel Jones, Chief Engineer

I’m now going to turn things over for today’s webinar to our instructor, Nigel Jones. He’s the Chief Engineer here. He has more than 30 years of experience in designing embedded systems. Including a number of systems that could kill you if you use them. He’s got a wide range of experience with different programming languages and processors and operating systems and all of that sort.

And he engages in exactly the kinds of work that I talked about that Barr Group does: engineering consulting, expert witnessing and obviously training here today. So, Nigel?

Slide 5: High Performance Embedded Systems

Thank you Michael for the introduction. Today I will be talking about how to build high performance embedded systems.

Slide 6: Introduction

The speed at which embedded systems perform their tasks is usually of critical interest to engineers. Sometimes the speed is an inviolate requirement as in hard real time systems, whereas in other cases it is about minimizing energy consumption or allowing the use of cheaper hardware or simply improving the user experience. I also suspect that many of you find tremendous personal satisfaction in building high performance systems.

In this webinar we will touching upon a smorgasbord of techniques that will allow you to maximize the performance of your embedded system. Given the vast scope of the material to be covered, we will not be going into detail on any given topic, but will instead be providing links to allow you to explore the topics of interest to you in greater detail. At the end of this webinar my hope is to have given you enough food for thought that you’ll be inspired to go and test some of the techniques that have been espoused

Slide 7: High Performance Principles

When building a high performance system I like to keep in mind the hierarchy shown here. The points at the top of the list have a much bigger impact on system performance than those on the bottom of the list. However, those on the bottom of the list tend to have wider applicability. I also find that the techniques at the top of the list are used early on in a project, whereas those at the bottom of the list are used later in a project, so depending upon where you are in your project development cycle will dictate which of these techniques you can use.

Slide 8: Hardware Configuration

I hope it isn’t exactly news to you that firmware runs on real hardware that has real limitations. Consequently it’s essential that your hardware be configured for optimal performance. As an absolute minimum this means that you have the processor’s clock programmed correctly – which given the complexity of setting up oscillators on today’s processors is no easy task. One tip for checking that the processor clock is set correctly is to make use of the clock output feature that many processors support. This allows you to route the processor clock to a port pin, such that it can be monitored with an oscilloscope.

For high-speed processors it’s also essential that you have the correct number of wait states set for the various memory spaces. With most processors as you increase the clock speed; it also becomes necessary to increase the number of wait states, particularly with flash memory. Consequently, increasing clock speed does not always lead to an increase in performance. Hence it’s essential that you understand the tradeoff between clock speed and actual throughput.

For processors that support them, do you have the instruction and data caches turned on? If you do, do you know how to tell the compiler to use cached versus non-cached memory as appropriate for the problem you are trying to solve? Failure to set these up correctly will have a huge impact on your system performance and so checking they are set up correctly is an essential first step.

Slide 9: Direct Memory Access

One hardware feature that I rarely see used is direct memory access or DMA. DMA is great whenever you have to transfer large amounts of data – particularly at high frequency. DMA allows you to do memory to memory transfer, peripheral to memory and so on. In addition a lot of DMA controllers are now offering nifty features such as on the fly CRC calculation. Now it’s normally quite hard to set up DMA controllers, which explains why I see it used so infrequently. Note that a large amount of the performance gain comes from minimizing the interrupt overhead, and so to really get the benefits of DMA make sure that you are only interrupting when you really need to.

Slide 10: Other Specialized HW

As a general rule, if you’ve paid the money for a processor with a hardware feature, then it’s a mistake to not take advantage of that feature. For example most processors will have FIFOs on communication interfaces. Turning these on, particularly for high-speed interfaces can have a dramatic impact on performance.

If you have a DSP like multiply-accumulate unit, then you can typically do certain types of calculations much faster with them compared to ‘standard C code’.

If you have a processor with a floating-point unit, then it will behoove you to understand what it takes to use it. In particular, make sure the compiler knows about it, since it’s a bit embarrassing to have the compiler link in a floating-point library when your processor can do it in HW!

If you think I’m joking, then trust me – I’ve seen it done. It’s also important to understand how the FPU handles single vs. double precision arithmetic. In some cases it’s faster to use exclusively double precision for all storage and calculations, while in other cases there’s an advantage to doing computations in single precision. So if your precision doesn’t matter to you, then do the tests and find out which one works better for you.

I also strongly recommend finding out whether your CPU contains a barrel shifter. If you don’t know what a barrel shifter is, it’s a piece of HW that can shift a word any number of places in one clock cycle, such that shifting right 14 places takes just as long as shifting right one place. This is important as it can influence how you write certain types of code and I’ll give you some examples later on in this webinar.

Finally, what other specialized bits of HW are on your CPU that you can use to boost performance? I’m always amazed when I see people using fancy processors with all sorts of neat features on them and most of them are not being used. So if you really want to build a high performance system, look carefully into the hardware available and use it.

Slide 11: Algorithms

Let’s face it, Hardware is great for certain problems, but once things get complicated it’s time to turn to Software. In terms of Software there is nothing you can do that outperforms choosing the correct algorithm. As the quote on the slide states, “A good algorithm badly coded will always beat a bad algorithm coded superbly’.

Unfortunately, it’s been my experience that engineers can get quite passionate about algorithms – which I’ve always found puzzling since it’s a topic about which one should be dispassionate, not passionate! Consequently it’s always worthwhile to research algorithms and to ignore the folklore that’s out there. For example, I’ve been told on many occasion that the best way to determine a square root is to use Newton’s method, or that qsort is the best sorting algorithm. Well, rather than engage in a meaningless debate, I always prefer to simply benchmark the various algorithms in my application – and that’s a key point – in my application – and then make an objective and informed decision. Incidentally when comparing algorithms, don’t worry too much about the implementation. Implementation details normally give you a few percent improvement, while the best algorithm is typically a factor of two or ten times better than the worst algorithm.

Slide 12: Friden Integer Square Root

I just mentioned square root algorithms. I needed a high performance square root algorithm a few years ago and spent a lot of time researching them. The best one I found is this algorithm shown here described by Jack Crenshaw in his wonderful book: Math Toolkit for Real-Time Programming, available from CMP Books. ISBN 1-929629-09-5. Having read Crenshaw’s description of the algorithm and having studied the code, I feel fairly confident that this algorithm is the best one out there. However, if someone comes along with another algorithm then you can bet I’ll be most interested in comparing it to Crenshaw’s method.

By the way, you’ll notice a link at the bottom of the slide. For those of you that don’t know, I maintain a popular blog on embedded systems and the link is to a blog posting I made on this topic. You’ll see other links throughout this webinar.

Slide 13: Sorting

To give you another illustration about algorithms, let me tell you about an exercise I went through back in 2009. I found myself needing to sort a small array of about 20 integers. I blithely dropped in a call to qsort() since I ‘knew’ that qsort supposedly gave good results and I moved on. A while later I decided to go back and verify that qsort was indeed doing a good job for me. What I found to my horror was that it was consuming 1500 bytes out of a total available memory of 16K, and furthermore it wasn’t exactly getting the job done quickly either. I therefore decided to investigate alternatives and ended up doing a systematic evaluation of different algorithms. You can see the results here for sorting a 32-element array of integers. I also did the same tests for smaller and larger arrays and also looked at the code size of the different algorithms.

I published this in a blog posting which got picked up on reddit and which resulted in a lot of heated commentary. I found it remarkable how many people confidently declared that ‘algorithm X’ is the best sorting algorithm even when faced with objective evidence to the contrary. For example, for small data sets such as what is being tested here, the shell sort performs very well across a wide range of conditions, whereas the bubble sort has a massive ratio of 21:1 between best and worst.

However, if your data is always close to being sorted, then a bubble sort might be the best bet for you, since it will sort an already sorted array six times faster than a shell sort. The bottom line – test competing algorithms for your application and make a decision based on objective evidence.

Slide 14: Middle of 3 Values

Sometimes we aren’t even aware that a problem needs an algorithm. For example, consider the problem of finding the middle of three values. Superficially this seems to be quite easy. However when you get into it, it turns out to be a bit harder than you’d first think. Furthermore, the obvious or direct solution ends up being suboptimal as it fails to take into account some of the rules of Boolean algebra. The example shown here is as close to optimal as I’ve found so far. More importantly it outperformed my attempt at rolling my own code, which shows the advantage of searching for algorithms even when you are not sure you even need one.

Slide 15: Median Filtering

If you do any sort of processing of real world signals, you’ll soon find that the signals are often corrupted by noise spikes. While you can attempt to remove these spikes with a classical low pass filter, you’ll often find that the spike still has a significant effect on the output. One way to reject these noise spikes is to use a median filter. However, median filtering typically involves some sort of sorting algorithm – which as we’ve just seen can be quite time consuming.

To this end a gentlemen named Phil Ekstrom published an article in embedded systems programming magazine many years ago that uses a linked list approach, which is both dramatically faster and more stable than approaches based on sorting. Unfortunately, the code published in Ekstrom’s original article had some issues – which just goes to show that you should never blindly use code you find on the web! That caveat applies of course to code in this webinar as well.

Anyway, I’ve provided a link to my blog where you can read all about the algorithm – and find code that I’m reasonably sure is correct.

Slide 16: Integer Log

The last algorithm I’ll mention is this integer log function. This is another gem from Jack Crenshaw. It computes the log to base 2 of a 16-bit integer. I think you’ll find its performance quite astonishing and with this I’ll move on to other topics.

Slide 17: Choose the Best Tools

So you are using the HW to its full capabilities and you know that you’ve identified the best algorithms, and so now its time to start implementing your high performance system. It’s at this point I see so many people shoot themselves in the foot by looking for low cost, and preferably free tools. While there is nothing intrinsically wrong with low cost or free tools, I don’t think I’m exactly being controversial in observing that high performance tools cost money. For example Microchip now offers free versions of all their compilers, but requires you to pay if you wish to unlock the full optimizer. I’d love to know what the ratio of free to paid Microchip compilers is, but I’d guess that less than 10% of Microchip’s compilers are paid for.

So, in making the case to management that they should pay for high performance tools, just ask yourself the question ‘what’s my time worth’? If management would rather you spend a week hand tuning a piece of code, instead of paying out a few hundred dollars more for an optimizing compiler, then I’d suggest it’s time to look for a new job.

The bottom line is: High performance systems require high performance tools.

Slide 18: Configure Compiler for C99

Having chosen a high performance compiler, I’m always surprised at how many people leave the compiler in its default mode of being C89 compliant. I strongly recommend that you configure the compiler for C99 compliance as it brings about access to a number of features that allows you to build higher performance systems. I’ve listed a few of the benefits here. I’ll be talking about some of them in more detail shortly.

Slide 19: Allow Compiler Extensions

The next thing you should do is to forget about ensuring your code is ‘standard C’ and instead turn on as many of the compiler extensions as possible. Very often these extensions are crucial to getting maximum performance out of the system. For example they typically allow you to access registers in an optimal way. They also often include intrinsic functions such as the ones shown here. Note that while you can nearly always do what intrinsic functions do by writing standard C code, the bottom line is that intrinsic functions are normally dramatically faster. In addition, language extensions are normally necessary for interrupt support and for accessing different memory spaces. These are typically two areas that can have a dramatic impact on performance and I’ll talk a little more about these in future slides.

Slide 20: Use Full Optimization

My next observation is that if you’re interested in building a high performance system then you should turn on full optimization. The vast majority of embedded code that I see doesn’t use optimization – or at least doesn’t use full optimization. When I ask engineers about this, I normally get told that the code doesn’t need it or that the optimizer breaks the code. Well even in systems that quote ‘don’t need it’, you can usually reduce the system’s energy consumption by optimizing – and let’s face it, the world desperately needs us to minimize the energy consumption of our embedded designs.

So what then of the argument that optimization breaks the code? Well, it’s been my experience that code broken by the optimizer usually has one or more of the characteristics listed here. If you don’t know how and when to use volatile then do some research.

Next up is misuse of ‘restrict’. We’ll talk about it some more shortly – just make sure you know what you are doing when you use it.

The third and fourth points are favorites of mine – if you insist on walking into the dark corners of the language and / or writing horrendously convoluted code then don’t be surprised when your actions blow up in your face.

Next, if the compiler is issuing warnings and you choose to ignore them, then you only have yourself to blame.

Regular readers of my blog will know I’m a huge fan of PC-Lint. It’s found bugs in my code more times than I care to admit. Importantly, sometimes these bugs only manifest themselves once the optimizer is turned on, and so Lint is a great front-line defense against ‘optimizer induced bugs’.

Finally, if it genuinely is the compiler breaking your code, then are you using a cheap compiler? If so, see my previous comments about whether you can really afford cheap tools or not.

Slide 21: Speed Optimization > Size Optimization

Talking of optimization, it’s the default on most compilers to optimize for size. While this made sense twenty years ago, it rarely makes sense today. On the assumption that your application will fit into the available memory, then there is zero benefit to optimizing for size – and multiple benefits for optimizing for speed.

As listed here, speed optimization gives faster code – which after all is the point of higher performance systems.

Speed optimization can also use dramatically less stack space than size optimization. The reason is that size optimization works in part by breaking up code into dozens or hundreds of code fragments which are then called as subroutines. If you’ve ever programmed in assembly language on a tiny memory device then you’ll know exactly what I’m talking about. Each subroutine call will often call another subroutine and so on such that the call stack can get very big in a hurry.

Finally there are some cases in which optimizing for speed can result in smaller code size than optimizing for size!

The bottom line – use speed optimization.

Slide 22: Memory Models

Some processors have a single linear address space whereas others have multiple memory spaces. For example my recollection is that the 8051 has five different memory spaces. For processors that have multiple memory spaces it’s common for compilers to support the concept of memory models. These models usually have names such as Tiny, Small, Large and Huge.

The difference between the various memory models is where they place variables by default. Typically the smaller memory models place variables by default in memory that is quickly accessed but is limited in size, while the bigger memory models place variables by default into larger, but much slower, memory.

With this in mind it’s easy to see how choosing the wrong memory model can really impact your system performance. Apropos of this, the default memory model for many compilers is to use the largest model possible commensurate with the target processor. While this is the easiest model to program for, it’s rarely the best performing – and is yet another good reason for configuring your compiler carefully.

On a related topic, on higher end processors it’s common for some memories to be cached and others not to be. Unless you have a very specific reason for requiring that a variable not be cached, you’ll get a huge increase in performance by using cached memory spaces.

Slide 23: Integer Sizes

It’s time now to start looking at some coding constructs.

Without a doubt one of the more important things you can do is to choose the correct sized integer for the task and the processor at hand. Now for all processors, choosing an integer size larger than the natural word length of the CPU is usually a bad idea. Thus for an 8 bit processor, using an 8-bit integer whenever possible is almost certainly the correct thing to do, while 16 / 32-bit integers will carry a significant penalty.

For 16 and especially 32-bit processors, however, we also have to consider the opposite condition of what happens when we choose an integer type smaller than the natural word length of the machine. Whether this is good, bad or indifferent depends upon the CPU and also sometimes the compiler.

It used to be that you had to pore over compiler manuals to find out what was the best integer word length to use. However, with the advent of C99 this problem was effectively solved by the introduction of the C99 data types shown here. Of particular note are the ‘fast’ types. Thus a uint_fast16_t is the fastest data type that is at least 16-bits wide. Thus if you need the fastest integer capable of holding at least 16-bits, then use uint_fast16_t. Similarly you can use uint_fast8_t, uint_fast32_t and uint_fast64_t.

Particularly on smaller processors, there’s also usually a performance advantage to using unsigned data types.

Finally, I would be amiss in not mentioning the C integer promotion rules. The promotion rules can lead to unexpected results and also some performance hits – so make sure you understand the rules when writing high performance code!

Slide 24: Structure Alignment

Another common area where small changes can make a big difference is in the area of structure alignment. While on most 8-bit systems this makes little or no difference, on 16 and 32-bit systems correctly ordering parameters within a data structure can have a significant impact on system throughput. Pay particular attention to the packing and alignment rules so as to prevent non-aligned memory accesses.

On a related topic, be very careful with bit fields. Not only are they inherently non-portable / ill defined, they also can be extremely inefficient – particularly if you define field widths greater than 1 bit.

Slide 25: High Performance Coding Techniques

Having looked at some general things that you can do to make your system run faster, it’s time to mention some of my favorite tips for speeding things up. In particular, we will look at these three techniques – integer division by a constant, fixed-point arithmetic and lookup tables.

Slide 26: Integer Division by a Constant

I’m sure it’s not news to you that division by a constant is equivalent to multiplication by the reciprocal of the constant. To illustrate this fact, consider the examples shown here. You’ll note that I’ve listed some common integer divisors as well as some non-integer divisors. These expressions are many times faster than a standard divide.

You’ll notice that all of the expression in the top half of the slide have a shift right of 16 followed by another shift right of some other value. The reason this is done is because on 8 / 16-bit processors, the compiler will recognize that a shift right of 16-bits is equivalent to just using the upper half of a 32-bit word and so no shifting of 16 is actually done, resulting in a significant performance boost. By contrast if you are on a 32-bit processor with a barrel shifter then you’re better off combining the shifts into a single operation as shown in the lower half of the slide.

Incidentally, there is nothing magic about these expressions. It’s quite easy to develop an expression for any constant divisor. If you go to the link shown you’ll see the entire algorithm explained as well as a link to an include file that includes expressions for every integer divisor up to 32768!

Slide 27: Fixed-point Arithmetic

Unless you are using a processor with built in floating-point support, the use of floating-point arithmetic will usually have an adverse impact on your system performance. A good way to get many of the benefits of floating-point arithmetic while still maintaining high performance is to make use of fixed-point arithmetic. Fixed-point integers are often referred to as Q format numbers.

What is great about fixed-point numbers is they are fast – particularly on processors that have direct HW support for them such as many DSPs. Furthermore while it’s possible to handle fixed-point variables in standard C, many compilers offer direct / intrinsic support for these data types. In addition, unlike many floating-point libraries, fixed-point support is usually fully re-entrant.

The main downside to fixed-point arithmetic is the limited dynamic range. However it’s my experience that many real world problems have a very limited dynamic range and so this limitation is more theoretical than practical.

The topic of fixed-point arithmetic is way too big for me to cover here today. All I can do is to make sure that you are aware of it and to point you to some links that will help you get started in understanding this invaluable tool which should be in your arsenal.

Slide 28: Lookup Tables

Probably my favorite technique for speeding up a system is to use lookup tables. Although I’m sure most of you use lookup tables, it’s been my experience that lookup tables are woefully underused. For example, earlier in this webinar I presented code for computing the square root of a 16-bit integer. While the algorithm is incredible, the simple fact is that it would be destroyed by a lookup table approach. Nominally, such a table would require 16K of memory. Now back in the day when memory was expensive, having a 16K lookup table was absurd. Today, however, a 16K lookup table is quite reasonable on many systems. Despite this, as an industry there seems to be a reluctance to use lookup tables that are any larger than a few hundred bytes.

Furthermore, it’s been my experience that there are many cases where one can apply range reduction techniques to pare down the size of the table. Sticking with the square root problem, what happens if the variable whose square root you need is constrained by other limitations to exist over a smaller dynamic range than that offered by a uint16_t? If this is the case then you can reduce the size of the table accordingly.

Another good technique is to combine calculation with a lookup table. The calculation’s task is to effectively reduce the required lookup table range, such that one gets a good combination of calculation speed and memory usage.

This does bring up the point that even when using lookup tables, you still need to pay attention to your inputs. This comes about in two ways – firstly to make sure that you aren’t indexing beyond the table, and secondly to make sure that you aren’t violating any rules. For example, if I asked you to build a lookup table of log to the base 2 of all integers up to 256, what would you put in the table for an integer value of 0, since as I’m sure you know, the log of zero is undefined.

Slide 29: Lookup Table Syntax

Before leaving the topic of lookup tables, I think it’s worth discussing how to implement them. The slide here shows a number of details that I think are important. Some things to note are as follows:

Use a manifest constant to declare the table size. I do this because we’ll use it in several places.
The index that is passed to the function is unsigned. On the assumption that your lookup table is zero based this means that you can eliminate the lower end bounds check.
Next if you look at the table declaration, the lookup table is declared static and const. Static ensures that only this code can access the table – which is normally what you want. Const is used to indicate that it’s your intent to only read from the table. While there are cases where you want to have on the fly modifiable lookup tables, they are very rare in my experience.
You’ll also note the presence of the __flash modifier. Most of the time I want my lookup tables to remain in flash and not be copied to another memory at run time. This is especially true for large tables on small systems. If by contrast you have a high-end system which normally runs out of RAM then by all means remove the flash designation. The bottom line though is to make sure that the lookup table ends up in the memory space that you want.
You should also note that I’ve explicitly specified the table size in the array declaration. This is an excellent way of ensuring that you do actually have the correct number of elements in the table.
Finally I explicitly check that the index is within bounds before indexing into the table

Slide 30: Optimal C Constructs – Restrict

OK, let’s now take a peek at some optimal C constructs. As part of my regular work I spend a lot of time looking at other people’s code. I’ve looked at code written by some of the biggest corporations in the world and I’ve looked at code written for startups. I’ve looked at code written in the USA, Canada, Europe and China and I’ve looked at code written by undergraduates and also code written by grizzled veterans. In all that time I have never seen the keyword restrict used. I’m guessing that most of you listening to this webinar have never heard of it, or if you have heard of it you only have a vague idea of what it is.

The usual syntax for restrict is shown on the slide. Restrict was added to the C99 language so as to allow you, the programmer, to inform the compiler that pointers are not being aliased. Without this information, whenever you pass a pointer to a function, the compiler must assume that there’s a possibility that the pointer is pointing to another named parameter. This has serious implications for the poor compiler writer. If however there is no aliasing then the compiler can generate significantly faster code.

Thus if you want an immediate speed increase in your code, do the following:

Make sure your compiler is set up for C99 compliance.
Apply restrict to all pointers passed to functions – assuming of course that the pointer isn’t aliased.
Turn on full optimization
Enjoy the results.

Slide 31: Optimal C Constructs – Const

In a similar vein, const can also give you a performance boost – particularly when applied to pointers. In the example shown here I’ve combined const with restrict such that this pointer declaration reads that ptr is a restricted pointer to const uint16_t data.

As well as giving you a potential for a performance increase, const makes your code more readable and more maintainable with essentially no downside to its use.

The bottom line – apply const judiciously but as frequently as possible and enjoy the results.

Slide 32: Optimal C Constructs Static+ Inline

As well as giving us restrict, C99 formalized the syntax for declaring inline functions. If you declare a function as both static and inline, then the chances of the compiler actually inlining it jumps enormously. What this means is that I can write a top-level function that consists almost entirely of function calls – such that the structure of the function is as clear as day. Then I put all the implementation details into functions declared static and inline. At compile time these functions all get inlined, thus eliminating the overhead of the function call. This means that I get the documentation advantage of a function call with the speed of a mega function without any function call overhead.

Try it – I think you’ll like the results.

Slide 33: Optimal C Constructs - PTR;++PTR

While I could drone on for hours about various corners of the C language, I thought I’d share something that blew me away when I first came across it. If you’ve read K&R, as I’m sure most of you have, then you’ll find that it’s littered with post increment / decrement operations – particularly on pointers and array indices.

Well it turns out that post increment / decrement can generate all sorts of problems for compiler writers. If you want to understand why, then take a look at the link shown here to a paper on IAR’s web site. If you can’t be bothered to, then simply remember that post increment / decrement can slow your code and thus replacing the standard code shown here with two separate statements will usually give you immediate benefits – especially when the post increment is part of a looping construct.

Slide 34: Parameter Passing

If you’ve ever done a lot of assembly language programming then you undoubtedly found that it took a lot of code to pass parameters to functions. If the called function doesn’t do a great deal of work, this parameter passing overhead can be significant. One of the easiest ways to reduce this overhead is to look for functions that switch based upon passed parameters – and to replace them with multiple smaller functions where the switched parameter is now implicit. An example I like to give is the code for turning on / off the LEDs on a board. The slow approach is to define one large function that is passed both the LED to address and also the desired action on the LED. If you replace this single large function with multiple small functions, each of which does one thing to one LED then you’ll find that the multiple function approach will be considerably faster.

Incidentally, this is not a recommendation to avoid parameter passing by using globals. Although this can lead to faster code, the downside is usually way too great to justify the approach.

Slide 35: Be Wary of Formatted Output

Sometimes building high performance systems is about knowing what to avoid. High on the list of things to avoid is formatted output. If you benchmark printf and its brethren, then you’ll find that their execution time can be horrendous and highly variable – and their stack usage enormous. Furthermore, it’s not uncommon for the code to be non-reentrant.

The bottom line – formatted output is contra-indicated for high performance systems.

Slide 36: Interrupts

The easiest way I know to cripple the throughput of an embedded system is to have high frequency interrupts. High frequency interrupts suck up an enormous amount of CPU time. This comes about in the four ways I’ve listed here. The intrinsic interrupt overhead is the CPU cycles consumed in handling an interrupt request – for example loading and restoring the program counter. If you’re running on a system that uses cached memory, then an ISR will almost certainly cause the various cache’s to be dumped as the CPU suddenly starts executing completely different code that acts on completely different data.

On highly pipelined architectures, an interrupt will also usually result in a pipeline stall. Finally, the ISR will usually require a number of registers to be stacked. To illustrate this point, consider this innocuous two line interrupt service routine. It’s for a timer. All it does is stop the timer and then post an event to an event queue. Let’s take a look at what the code generated by the compiler looks like:

Slide 37: Register Stacking for Two Line ISR

The compiler has generated so much code, that I suspect for most of you it’s barely readable. Hopefully what you can see is the note that 15 registers needed to be stacked and un-stacked for this trivial little ISR. The reason this has occurred is because the ISR makes a function call and so the compiler is forced to assume that the called code will use all of the available registers.

Slide 38: Occasional Function Calls from an ISR

If you think this is bad, it’s even worse if the interrupt only results in a function call once every N interrupts. For example consider the code in the top left of this slide. Here you can see a counter that is counting down and only when it reaches zero is the function call made. Despite this, the compiler will usually stack all 15 registers – even though for most of the interrupt invocations they won’t actually be used.

One way around this is to replace the function call with a software interrupt, as shown in the second code block. Because a software interrupt is usually an intrinsic operation, there’s no need to stack any registers for it, and thus our high frequency interrupt only needs to stack those registers associated with decrementing the counter. Of course, within the software interrupt handler you do have to make the function call – and thus the registers are still stacked. However, at least they are now stacked only when they are actually needed.

Depending upon your perspective, this technique is either really neat or a complete kludge. It does however illustrate quite nicely the tradeoffs we sometimes have to make when building high performance systems.

Slide 39: Tips for ISRs

When it comes to ISRs, here are some tips to keep in mind. Minimize the frequency – by for example using FIFOs. Don’t use function calls. If you must make a function call, then try and make it to a function that’s declared in the same module as the ISR as this will likely minimize the number of registers that need to be stacked. Consider the use of the software interrupt technique I’ve just shown you. Don’t use floating-point arithmetic and finally always use maximum compiler optimization.

Incidentally this is one area where size optimization may actually be better than speed. The reason is that size optimization may use fewer registers than speed optimization – and this could result in less stacking overhead.

Slide 40: Simplify – Then Add Lightness

Finally, let me introduce you to my personal mantra when it comes to writing code. Simplify – then add lightness. This quote is actually from Colin Chapman the founder of Lotus cars. He was of course talking about building high performance cars. However I think it’s equally applicable to building high performance software.

What it means to me is that one should strive to write code as simply and elegantly as possible. Doing so allows the compiler (and others) to easily understand your intentions, and consequently to generate code that is as efficient as possible.

The bottom line – high performance code should look like a Lotus and not a ten-ton truck!

Slide 41: Key Takeaways

To summarize – here’s your punch list for building high performance systems:

Max out your use of the hardware.
Choose objectively great algorithms.
Use the best tools you can – and configure them to perform their best.
Use full speed optimization combined with the correct memory spaces and integer types.
Make extensive use of techniques such as fixed-point arithmetic and lookup tables.
Use the language constructs such as inline, const, restrict, static and volatile to allow the compiler to generate optimal code
Pay special attention to high frequency ISRs.
Write clean, simple code.

Back to Main
Share