How to Find and Fix the Most Common Embedded Software Bugs

Posted September 09, 2015

As if debugging traditional PC/server software or even smartphone apps wasn't hard enough, debugging embedded software adds significant challenges. This webinar introduces you to some of the most common embedded software bugs and teaches you how to find/fix/prevent them.

Many embedded systems have serious memory limitations and/or CPUs that are not powerful enough to support certain debugging activities. Many of the most difficult bugs to track down happen extremely infrequently and are borderline impossible to replicate. The main reason for this is that multiple things have to happen in the "right" sequence with very particular timing in order for these bugs to manifest. Introducing debugging code changes the execution timing so it is entirely possible that introducing debugging code makes a bug “disappear” (Heisenbug).

Slide 1: Debugging Embedded Software

Good afternoon and thank you for attending Barr Group’s webinar on Finding and Fixing the Most Common Embedded Software Bugs. My name is Jennifer and I will be moderating today’s webinar. Our presenters today will be Michael Barr, Chief Technical Officer at Barr Group, and Salomon Singer.

Today's presentation will be about 45 minutes long after which, there will be a moderated question and answer session. To submit a question during the event, please type your question into the rectangular space near the bottom right of your screen and then click the ‘Send’ button. Please hold your questions until the end when the presenters will be addressing them.

At the start of the webinar check to be sure that your speakers are on and that your volume is turned up. Please make sure that all background programs are turned off so they do not affect your audio feed.

I am pleased to present Michael Barr and Salomon Singer’s webinar on Finding and Fixing the Most Common Embedded Software Bugs.

Slide 2: Michael Barr, CTO

Thank you Jennifer, and welcome everyone to today's webinar. My name is Michael Barr; I’m the Chief Technical Officer and the Co-Founder of the Barr Group.

Slide 3: Barr Group - The Embedded Systems Experts

The Barr Group, as a company is focused on really fundamentally two things: Making embedded systems reliable and making embedded systems secure. Of course some systems have safety concerns because they could injure a person in which case the word safety also applies. So in a nutshell: we help embedded developers and a large number of companies all across different industries to make their embedded systems safer more reliable and more secure. We don’t have products of our own, instead we have services that align with this mission.

Slide 4: Barr Group - Product Development

One of the services we provide is: performing product development assistance. So we take on projects, typically a part of the project but also sometimes the whole. We have a large staff of embedded software experts who assist companies with different parts of their products, and we also have electrical engineers and design ships on board as well, and we have mechanical capability of designing enclosures.

Slide 5: Barr Group - Engineering Guidance

In addition we offer consulting services where we can solve typically with a Vice President of Engineering or Director of Engineering or Engineering Manager. And we address issues of architecture or re-architecture and software development process. In particular, sometimes exploring the choices between modern methods such as Agile and Test Driven Development versus traditional waterfall models helping companies move in each direction.

And we have some specific services like code audits, where we read code for different purposes and write a report or presentation about the code; and we also do designer, views where again we’re providing independent advice and assessment. And finally we perform security audits of particular products.

Slide 6: Barr Group - Skills Training

The third service that we provide is training. So one of the ways that we’re able to reach as many engineers as possible is, through our training. We provide public training courses and also private on-site training courses. We have a whole catalog of firmware training courses that you can find on our website and also these courses are typically hands on courses and they’re always specific to working on embedded systems and the unique challenges that we encounter.

We also sometimes develop custom courses for companies which are typically a variation of one of the courses that we already offer but sometimes even a new course for a private company. Sometimes we put our courses together into training programs for younger engineers coming into the company and several large companies work with us doing that.

Slides 7-9: Upcoming Courses

Before we get to today’s course, I just want to alert you that we regularly do trainings both at private companies and you can see on our website a list of all the courses that we offer. And we also have public trainings including some upcoming boot camps. These are our weeklong, four and a half day, hands-on, intensive software training courses.

The course titles, dates, and other details for all of our upcoming public courses are available in the firmware training area of our website.

Slide 10: Salomon Singer, Principal Engineer

Now let’s get on to today’s webinar. Your instructor for today is Salomon Singer. He’s a Principal Engineer at the Barr Group and Salomon and I have been working together for about seven years now. He has more than 30 years of embedded systems design experience and a degree in computer science or computer engineering.

He’s worked on a number of different industries from industrial controls and transportation equipment to telecommunications. As you can see here he has an extensive list of experience with different real time operating systems and processors and also programming languages. At the Barr Group he does software engineering everyday writing software.

He also consults with a number of different companies on a number of different projects and additionally as he’ll do here for you today in the webinar, he’s an excellent instructor and speaker and Salomon will be teaching the upcoming course on reliable multithreaded programming in November as well as our software boot camp in the fall in the United States in Minneapolis as he teaches a number of courses for us. So without any further ado I turn the course over to Salomon and thank you again for coming.

Slide 11: Debugging Embedded Software

Thank you Mike for the introduction and good afternoon ladies and gentlemen and welcome to one more of Barr Group’s series of webinars.

In this webinar we are going to talk about finding and fixing some of the most common embedded software bugs. In the next 45 minutes or so I will point out which bugs are these and how we go about finding them, fixing them, and maybe even preventing them so that we don’t have to be chasing them all over again.

Slide 12: Introduction

Debugging traditional PC or server software or even a smartphone app is already hard enough; but, debugging embedded software is even harder because there are some additional challenges that don’t exist in traditional software. So let’s take a look at some of the extra challenges brought in by embedded software.

The first one could be memory or CPU limitations, in other words our system is resource constrained. It is entirely possible that we don’t have a lot of memory or that our CPU is not too powerful or that both are true, that we don’t have enough memory and that we don’t have enough CPU. If this is the case then tracing and/or logging is virtually impossible. In the case where we don’t have enough CPU then there is just not enough cycles to perform the activities to do the tracing and the logging. And if we don’t have memory, then there is no place to do that tracing and logging. So tracing and logging will be a weapon or a tool that is now removed from our arsenal, in that case.

Another good example of issues that we don’t have to worry about when we’re talking about traditional software are issues related to the hardware. Is hardware malfunctioning? Did we not program the hardware correctly? And that’s why we don’t get the results that we’re expecting. So some of the other complexities related to debugging embedded software is brought on by the fact that we have to deal with the hardware in the system.

A third family of problems that are unique to embedded systems are problems related to using a real time operating system. So these problems can include: race conditions, starvation, priority inversion, and deadlogs amongst others. Some of this I will talk about in some more detail later but for the time being suffice it to say that these are problems that we don’t typically encounter in PC software that we write.

Slide 13: Introduction (cont'd)

I want to talk about three more reasons why debugging embedded software is some what more challenging.

The first one is a combination of two things. The fact that most cross compilers do not support printf and the fact that many an embedded system has a very limited or a totally non-existent user interface. So there is no terminal so to speak, there is no place where we can do printf in the middle of the code. So debugging the printf is for the most part out of the question and not one of those things that we can do in embedded systems. In fact most of the times or many of the times all you have is an LED or a group of LEDs and you can use those to either have a solid or blink at a particular rate or something like that to signify that everything is okay or a problem has occurred and you will maximize the use of those LEDs to do your debugging.

The second issue, and probably the most interesting of the three, is that fact that inserting debugging code changes the timing. And so if we had a timing related bug, say a race condition, it is entirely possible that our debugging code is masking the problem and so we have a race condition. We now introduce some debugging code and the bug goes away or it doesn’t manifest itself because in reality it hasn’t gone away at all, but now it does manifest itself and so we can no longer debug it. We take away the debugging code, the bug comes back and so forth and so on and this is a good example of a Heisenbug.

So the bug is there, its latent but it doesn’t manifest itself once we insert the debugging code. It is entirely possible that we crafted this piece of code with the debugging code on; and now that it’s not on for any particular reason now the bug appears. So if there is difficult type of bug to find and deduct.

The last type of bug I want to talk about in this section is missing the volatile keyword in front of a data structure or a global variable. In this particular case the compiler or the optimizer might optimize some of your code away and making it look like there is a bug. It is not a bug per se, the omission is a bug, but what's happening is that code that should be there is being optimized away and it’s not longer there and so the code is not doing what it’s supposed to be doing because the code has been removed. One way of debugging this type of problem is by looking at what the compiler is doing with our C code, which brings me to my next slide.

Slide 14: Looking at Assembly Listings

Take a look at what the compiler did with your C code. You have basically two options. The first option is to launch your debugger and then open the assembly or this assembly window. Your second option is to use one of the compiler options for it to generate a file that contains the C code in all the assembly that is generated for each particular statement and whether you use one or the other what you will see typically is a line of C followed by the lines of assembly that are required to execute that C statement.

Slide 15: Assembly Listing Uses

Assembly listings have multiple uses so we can use them to do a little bit of debugging without a debugger or try to understand what exactly the compiler does with the C code we write. One good application for the assembly listing is tracking down a missing volatile our keyword within the assembly listing we would notice that a particular line or group of lines of C code have been translated into no code. So the compiler of the optimizer has optimized some of our code away.

It is also useful to see the effects of compiler optimization and so we can generate the assembly listing without optimization and then we can start cranking the optimization up. And as we go along we can see what the compiler and the optimizer are doing with our code or how much of it, how much gain are we getting from the optimization.

If we ever reach the conclusion that we want to hand optimize some code, the assembly listing is a good place to start so we can look at what the compiler did and we can try to optimize on top of that. This conclusion should almost never be reached but in rare occasions it is warranted.

One last thing that I want to talk about that we can use the assembly listings for is to spot the very, very rare compiler bug. Once in a blue moon it will happen that the compiler contains the bug and by inspecting the assembly that it’s generating by the complier we can relatively easily spot the issue.

Slide 16: Simple C Code Example

In order to be able to see the differences between optimized and un-optimized code I want you to look at this particular trivial example. In this example, all we have is main which makes a call to readRegister function. ReadReg is going to read out from a location in hardware in this particular example is location X 20000200. ReadReg will return the value Read from the register and in main we will store that into a volatile variable.

Slide 17: ARM Example Code without Optimization

This is the assembly generated or the C code that I showed you in the previous slide. We have optimizations totally disabled and as you can see the flow is pretty much the same as in C. The constant $C$CON1 contains the value X20000200 and so that’s the substitution for that value and you can see how that values move first into R0 and then stored into the PU value variable which is in the stack and then it is retrieved from the stack in order to do the indirect read and store it back into R0 which is what is going to be returned to main. And if we follow a little further down we’ll see the main how its preparing to make the call to readRegister then the call itself and at the end upon return, the extraction of the value from R0 and storing it into the variable in main

Slide 18: ARM Example Code with Optimization

This is the output of the same compiler when optimization is full on, and as you can see readReg is nowhere to be found; all we have is code for main. And so what has happened is that the compiler of the optimizer have inline readReg, so there really is no need to prepare for the branches of routine and go into readReg and come back and pass the parameters in R0 etc., etc. And so the compiler in its infinite wisdom and very rightly so has inline this function and readReg is no longer. Notice also how the constant $C$CON1 still contains that value X of 20000200 from where were reading in the hardware.

Slide 19: Why is the Linker Map File Important?

After the compiler is done compiling every single source module that makes up this system, the compiler must step in and link all those object modules together to generate a single object module that includes all the code in the system.

One or the other outputs of the linker is the map file. This map file contains information that is very useful and contains things like the stored address of the code segments and how big it is. They stored the address of the data segment in how big it is. It also shows for every single public symbol where the function or the data starts and how big it is. You can use this map file as a validation strategy when comparing to physical memory map and you can also see how much room you have available. So you can see how much RAM is left and how much code space is left in your system.

Slide 20: Linker Map File Example

This is a small section from the map file generated for the code that I showed you earlier so it’s that main and readReg and that is all there is to it. And so you can see how the map file shows us where the flash section starts, and how big it is, and also how much of it is used, and how much of it is still left unused, and what the attributes are each of these sections is. So you have information for the flash section and information for the RAM section as well.

Slide 21: Linker Map File Example (cont'd)

So this is another section of the same linker map file for the same exact piece of code. And here you can see that main starts at address 00000517 and the stack starts at 20000000 X and is 200 bytes long.

So one interesting observation is that the address for the readReg function is nowhere to be found. So earlier we said that the map file will contain the addresses for all public symbols. If we look back at that slide the readReg function was declared static. That’s the reason it doesn’t appear in the linker map file.

Slide 22: What is Atomicity?

In the next few slides I want to talk about Atomicity because not knowing what atomicity is or not paying attention to it can create significant bugs that are incredibly difficult to debug. And so atomicity is instructions that are executable that cannot be interrupted. In other words they are done atomically. And so test and set, compare and swap, and some load and store instructions must be atomic.

Those instructions are absolutely fundamental to machine synchronization and some of the RTOS synchronization options like mutexes and semaphores are based upon some of those basic machine synchronizations instructions. We have to be particularly careful with a load or/and store that can be “torn” between two threads of execution. In other words we’re in the middle of a read that cannot be done atomically or in the middle of a write and then there is an interrupt where there is another task starting to run and data will be trashed because the read and/or the write was not done atomically.

Slide 23: What are the Basic Requirements of Atomicity?

So how can we assure that load and store operations to an object are atomic. So there are two basic requirements. The first one is that the size of the object must match the native word size of the processor, in other words if it’s a 32 bit processor it should be a 32 bit piece of data. The second requirement is that the address of the object must be aligned with the native word boundary size on the processor. In other words a 32 bit object needs to be at a 32 bit address. If this is not the case then multiple load and store instructions will take place just in order to read and or write the object. In some processors this can even result in unaligned data access exception.

Slide 24: Alignment and Implications for Processor Performance

Even if the processor we are using does not throw an exception when trying to access on aligned data. The compiler will generate extremely inefficient code which will result in multiple bus accesses for single memory access whether it be a read or a write. So it’s not only time that is wasted also code space that it gets you get stopped by the fact that more inefficient code will be generated. It is also important to remember that the TMA accesses have very straight alignment requirements.

Slide 25: Data Structure Padding

Another source of interesting bugs is data structure padding. The compiler wants to generate a code that is extremely efficient and in order to do that it wants to ensure that all the elements in a structure are aligned. And so it will automatically insert padding so that the elements are properly aligned and it can generate the most efficient code. So in the example you can see that I’m declaring a structure that is only using up 8 bytes but the reality is that it will consume 12 bytes of data once the compiler has aligned with the structure such that all the elements are aligned and the compiler could generate code that’s very efficient when it accesses any of the members of that structure.

And so in this particular case we are sacrificing some memory space so that the compiler can create really efficient code. If you have memory constraints and your data structure does not map the hardware you can manually reorder the elements to completely bypass all the compiler padding.

Slide 26: Data Structure Padding (cont'd)

Here you can see the same data structure that I had in the previous slide but I have now reordered all the elements such that the first few elements are all the ones that are 8 bits wide. They align with any address. After that comes all the elements, in this particular case just one, that are 16 bits wide and after that all the elements that are 32 bits wide; that way the compiler only has to do minimal or no padding at all.

In this particular example there was absolutely no need to do a padding because the 16 bit address aligned at a 16 bit boundary and a 32 bit address aligned at the 32 bit boundary. If mode wasn’t there the compiler would have had to insert some padding because counter would not have aligned with a 32 bit address.

Slide 27: Using #pragma pack

So what if you still want no padding but you can reorder the elements and the structure say for example, when you’re overlaying data structure on to memory map hardware or when you are moving data structures between different processors. The only weapon at your disposal is use the preprocessor pragma pack that would tell the compiler that you don’t want it to insert any padding and you will have to deal with on alignment if there is any. And so that is still something you need to worry about. You are now in charge, the compiler will not insert any padding but if there are alignment exceptions then you are responsible for that.

Slide 28: Using #pragma pack (cont'd)

This is again the original structure from a couple of slides ago. And now with a preprocessor pragma pack option I’ve told the compiler that I don’t want any padding and so the compiler is following my directions and there is no padding whatsoever. 8 bytes supposedly in my structure and my structure is indeed taken up 4 bytes. Note however that the variable mode which is a 16 bit wide variable is completely on aligned. It is entirely possible that we’ll have an access violation when we try to access the variable mode inside the structure. It is also possible that the compiler is smart enough to know that the processor would crash upon trying to do an unaligned access and it is possible that it might generate code as there are two bytes sitting there instead of a store and otherwise it might synthesize the read and the write into multiple store and load instructions.

Slide 29: Common Embedded Bugs

Now I want to turn you attention to some of the most common embedded bugs.

The first one is race condition. It happens when two tasks interrupt each other and manipulate the common global data or shared resource or in the case of an ISR firing while main is running and also manipulating some shared resource.

A second type of bug which is very common is stack overflow. In this particular case we didn’t allocate enough room for the stack and the stack grows past the end and trampling some of the data or balling up at the edge of the ramp. In the case of a multithread system and its entirely possible that overflowing the stack is going to go into another task’s data or stack making this very fun to debug because the problem is in one task and yet you see a problem in a different task.

The third type of very common bug is priority inversion. This is a byproduct of multitasking and sharing and so to prevent race conditions we are going to use mutexes and mutexes will bring with them some baggage, in this particular case priority inversion and so we must find a way to prevent this.

Slide 30: Race Condition - Description

Let’s take a look at race conditions in a little more detail. What are the necessary conditions for race condition to happen? And so the first thing we must have is either two tasks or one task and an ISR. We need to have preemption of some kind; it can be an interrupt or a task switch done by the RTOS. And we need to have some kind of shared resource, and this shared resource can be either some global data or some registers and hardware.

In order for the actual race conditions to happen Task A must be in the middle of a non-atomic operation on the shared resource. When it gets preempted either by an interrupt or task switch and then task B takes over and manipulates the same data registers thus trashing what task A was doing. As you can already see this type of bug would be extremely difficult to duplicate because Task B needs to interrupt Task A exactly at the right time, in other words task A needs to be in the middle of the manipulation of the shared resource when task B gets to run and that may take some very precise timing.

Slide 31: Real World Consequences: Therac-25

There can be some really nasty consequences in the real world to race conditions. The example that I want to talk about is the Therac-25 a radiation therapy machine. The symptom was that the machine ended up giving two orders of magnitude more dose of radiation that it should have. There were a total of six incidents documented, three of them ended up resulting in deaths. After analysis the root cause was deemed to be a rare counter overflow with a mode switch. The counter was sampled at the wrong time in other words in the middle of a race condition.

Slide 32: Race Condition – One Writer Thread, One Reader Thread

I want to show you now a very simple example of how a race condition can happen even if you have only one writer task and one reader task.

So let’s take a look at a very simple piece of code. On the top left I have incremented ticks by 15 and then checking if ticks is greater than or equal to 100 and then setting ticks to 100 if that is the case.

But if the writer task gets preempted in between the increment and the if statement then the reader task will get a value that is out of bounds or invalid. One way to prevent this race condition is to perform the critical section code within disabling interrupt. And so we no longer have a race condition, the writer task will disable interrupts, manipulate the text variable and then enable interrupts to DOS that ticks variable will never have an illegal value available to the reader task.

But we can do better than that. Assuming that ticks is a variable that can be written atomically then we can rewrite the code as you see in the bottom left such that ticks never has an out of bounds or illegal value, no overflow. And so the reader task will navigate the wrong value. This actually is the preferred approach.

Slide 33: Race Condition Example

Let’s take a look at another pretty simple piece of code and show how there is a race condition here. And so on top we have an isr, this isr belongs to a timer and so every so often the timer goes off, we get an interrupt, we go execute the isr and within the isr all we do is we set uCounter and give it a value of 0.

In the meantime in main we have an infinite loop, and in this infinite loop we increment the value of counter, and then we do some processing depending on the value of counter. Well, the increment of counter is a read modify write and so if we are in the middle of the read modify write when the isr triggers then we will go into the isr. But the isr will set uCounter to zero but immediately upon return we will have the original value uCounter updated by one. And so now potentially uCounter is out of range and if it’s not out of range now it will be soon because the code was banking on the fact that this periodic isr was going to set the uCounter to 0 and now it’s completely missed the opportunity to do so.

Slide 34: Race Condition - How to Prevent/Fix

Here we have the same code as in the previous slide but we have now removed the race condition. In fact not only did we remove the race conditions, we’ve added a couple of other improvements.

The first thing we did is to add a ‘g’ prefix to the name. Add ‘g’ for global, and so we want all globals to stand out so that it becomes obvious that they might be the reason for a race condition and that we have to pay attention to the way we manipulate it.

The second program we did is to add an assert, this assert will allow us to see if the variable has ever gone beyond the values that it’s supposed to have.

The third thing we did was actually remove the race condition by putting disabled and enable interrupts around the critical section and adding a warning to the note that this is critical section code.

Slide 35: Race Condition - How to Debug

So how do we go about debugging race conditions? It is always a good idea to start by looking at the code, anything that touches the shared resource should be inspected or enforced this is going to be a tremendous onus, it’s going to be tedious and extremely time and labor intensive. You could also set a breakpoint and a code that manipulates the shared resource, inspect the call stack. You might discover that you thought it was Task A this time manipulating the resource but actually Task B has snuck in and is now doing the manipulation. You should instrument your code to do some tracing and logging.

This will give you visibility and to which task is the one that is manipulating the shared resource. One of the other things you could do is start a process of elimination, remove as much code as you can. What's left is what is causing you the problem. If you remove a piece of code and the problem went away, it is quite possible that that is the offending code.

Slide 36: Race Condition - Difficult to Debug

By nature race conditions are difficult to debug, in fact they are difficult to duplicate. The smaller the critical section is, the more difficult it’s going to be to reproduce the bug because the opportunity for the task that its interrupting is much smaller as the critical section is smaller. As this is a type of bug that depends heavily on timing it is entirely possible that adding or removing any kind of code including debug code will make the bug appear or disappear sometimes making you think that somehow magically you’ve solved or removed the bug when in reality it isn’t so. It just isn’t manifesting itself because you change the timing the bug is still lurking.

Two tools that might become effective trying to figure out this type of bug are: a logic analyzer since that will not affect timing, the second tool is a static analysis tool, some of them are capable of finding race conditions. The best approach overall is keeping race conditions out of the code. You do that by design and by inspection and by heavy testing.

Slide 37: Priority Inversion – Mars Pathfinder

I now want to switch topics to priority inversion. I want to talk about a real life example and so when NASA launched the Mars Pathfinder, the code for the Rover wasn’t ready. And so NASA decided that they had enough time while the mission was in transit to Mars to finish developing the code and that they would then remote download the code to the Rover and so they put in a sophisticated bootloader onto the Rover and started downloading the code once the Rover was in Mars.

However after several attempts of downloading the code and in the middle of the download the Rover rebooted, they started experimenting with the one Rover that they had left on earth and they discovered that they had a race condition between the tasks. And so one of the tasks wasn’t checking in on time and the software watch dog was making sure that the software rerouted and that why they weren’t able to download remotely.

Once we discover what the problem was that we’re able to fix the bootloader and upload that and then upload the rest of the code but they were pretty close to having to scrub the whole mission.

Slide 38: Debugging the Mars Pathfinder Priority Inversion

I’ve already mentioned that lucky for NASA and JPL they have built two exact replicas of the Rover. One that went to Mars and one that stayed on earth. And so when they were studying debugging the problem they were able to use the one on earth and they knew that the RTOS had tracing capabilities and so we were able to look at context switches, interrupts and all kinds of system events.

And when they replicated the problem they noticed that it was the input task that was starving due to priority inversion. And then they also noticed that even though the RTOS that we’re using had priority inheritance it wasn’t enabled by default. So then they enabled the feature and they rebuilt the bootloader with priority in heritance and then they downloaded that and using a new boot loader they were able to download the rest of the code to the Rover.

Slide 39: Priority Inversion - How to Debug

Debugging a priority inversion can be a real nightmare. It is extremely difficult to reproduce to begin with. One of the things that you might want to try is to do some corner testing and take the system to extremes, increase your chances of debug showing up. Collecting and analyzing trace data and logging might help.

In the case where it’s a reboot by the watchdog, it might have been a good idea to connect the watchdog to NMI instead of to reset and NMI could have done some further logging. For example, log which is the task that didn’t check in on time and the reason why the watchdog expired. Reproducing is going to be key here and reproducing is not going to be simple.

Slide 40: Priority Inversion - How to Prevent/Fix

So how do you prevent priority inversions? Number one by leveraging the RTOS workaround and so the RTOS more than likely already has implemented one of the protocols designed to eliminate priority inversion and so it can be priority inheritance, priority ceiling, or highest locker.

Make sure you never use the semaphore for mutual exclusion and the right tool for the right job. Here is a mutex. You may also want to consider redesigning some of your code so that the use of mutex is eliminated. You can go for client server type of architecture and serializing things by the use of semaphores and queues.

Slide 41: Stack Overflow

The last thing I want to talk about today is stack overflow. It’s a problem that I see quite often. What happens is we have a stack of a pre-determined size, we get for the most part to decide what size that’s going to be. And eventually we start putting too much stock on the stack and we go beyond the limits of the array that we’re using for the stack.

And so it is very difficult to estimate how much stock you’re going to need; if you are doing recursion, it’s even more difficult to estimate. Generally you need to allow for your max calling depth plus all the nestings of interrupts, you must assume that its possible for all the interrupts to fire in such a way that they go from low priority to high priority and they all might nest in your stack.

One other thing to remember if you are running without an RTOS your stack will grow in the opposite direction of the heap and if you go beyond the limit of your stack where you will start encroaching on your heap. So if you have any dynamic memory allocation you will start trampling on some of those variables that were dynamically allocated by the same token when you assign values to some of those variables, those rights might be riding on to the stack unbeknownst to them.

Slide 42: Stack Overflow - How to Find/Fix

In order to debug a stack overflow we must rebuild entire stack with a known pattern. At breakpoint we must inspect the stack and see how far we can still discern the watermark. If no watermark is present we have definitely exhausted the stack and might have even gone beyond it. It is always advisable to leave a certain percentage of stack free because you don’t know what set of circumstances is going to require some additional stacks so I would say at least 20% of free stack must be available in worst case scenario.

You can also write some code that does the inspection for you. It starts from the end of the stack and starts working backwards looking for that pattern that we use to pre-fill and so that piece of code can decide what percentage of the stack is still available and somehow report it to the user or assert if there isn’t enough stack left. If you are using an RTOS most RTOSs already have a stack checking code that you could just leverage and so by all means it’s already been written, use it.

Slide 43: Key Takeaways

Let’s talk about the key takeaways for today’s webinar. There is no doubt about the fact that debugging embedded software is extremely hard and therefore it is preferable and easier and cheaper to try to prevent debugs in the first place. It is much, much easier to keep the bugs out than to find them and remove them at a later stage. Not only is it easier it is also significantly less expensive. Each type of bug or each family of bugs gets debugged differently and you use different tools.

For some it is easier to just inspect the code, for others you might want to use a debugger or a logic analyzer or a static analysis tool or whatever you might have. One thing to remember is you must leverage what your tool chain is already providing; be it in way of assembly listings or your map file provided by the linker, those might be extremely helpful when chasing some of these bugs.

Question & Answer

Q: What coding standards do you prefer?

I use the Barr Group Embedded C Coding Standard.

Q: Can't the use of "volatile" be handy to avoid the race condition?

The use the volatile keyword will not help prevent a race condition.

Q: Do you always need to declare data that is shared between two threads (e.g. ISRs) as "volatile"? If not, what are the cases where you do not need to?

You always have to declare data that is shared between two threads as volatile.

Q: How much does memory fragmentation contribute to stack overflow?

Memory fragmentation does not contribute to stack overflow.

Q: How do you debug stack trashing by a routine?

Examine the stack at different places throughout the routine and verify that what you see is what you expected to see.

Q: Is dynamic memory allocation recommended is the embedded systems?

Dynamic allocation (e.g., malloc, in C) should be avoided as much as possible in embedded systems.

Q: Is there an existing tool that you have used to detect memory leaks?

Many static analysis tools can now help detect memory leaks.

Back to Main
Share