What's the difference between a mutex and a semaphore? Even very experienced firmware developers too often fail to fully appreciate the importance of using the correct tool for the job at hand. And, unfortunately, misuse of these two distinct types of synchronization primitives can lead to difficult to debug defects in embedded software, with potentially severe consequences in safety-critical medical devices, avionics and transportation equipment.
Learn how to use both mutexes and semaphores most safely and most effectively and why it is important to do so.
Slide 1: Mutexes and Semaphores Demystified
Good afternoon and thank you for attending Barr Group's webinar on Mutexes and Semaphores Demystified. My name is Jennifer and I will be moderating today's webinar. Our presenters today will be Michael Barr, Chief Technical Officer at Barr Group, and Salomon Singer.
Today's presentation will be about 45 minutes long, after which, there will be a moderated question and answer session. To send a question during the event, please type your questions into the rectangular space near the bottom right of your screen then click the send button. Please hold your technical questions until the end when the presenters will be addressing them.
At the start of the webinar please check to be sure that your speakers are on and your volume is turned up. Please make sure that all background programs are turned off so they do not affect your audio feed.
I am pleased to present Michael Barr and Salomon Singer's webinar on Mutexes and Semaphores Demystified.
Slide 2: Michael Barr, CTO
Thank you Jennifer, and welcome everyone to today's webinar. My name is Michael Barr; I'm the Chief Technical Officer and a cofounder of Barr Group.
Slide 3: About Barr Group
Barr Group, as a company, is focused on really fundamentally two things: making embedded systems reliable and making embedded systems secure. Now of course some systems have safety concerns because they could injure a person in which case the word safety also applies. So in a nutshell, we help embedded developers and a large number of companies all across different industries to make their embedded systems safer, more reliable, and more secure. We don’t have products of our own; instead we have services that align with this mission.
Slide 4: About Barr Group - Product Development
One of the services we provide is performing product development assistance. So we take on projects, typically a part of the project but also sometimes the whole. We have a large staff of embedded software experts who assist companies with different parts of their products and we also have electrical engineers who can design chips and boards as well and we have mechanical capability of designing enclosures.
Slide 5: About Barr Group - Consulting
In addition we offer consulting services where we consult typically with a Vice President of Engineering or Director of Engineering or Engineering Manager. And we address issues of architecture or re-architecture and software development process in particular, sometimes exploring the choices between modern methods such as Agile and Test Driven Development versus traditional Waterfall models helping companies move in each direction.
And we have some specific services like code audits where we read code for different purposes and write report or presentation about the code. We also do design reviews. We are, again we are providing independent advice and assessment and finally we perform security audits of particular products.
Slide 6: About Barr Group - Training
The third service that we provide is training. So, one of the ways that we are able to reach as many engineers as possible is through our training. We provide public training courses and also private onsite training courses. We have a whole catalogue of stock courses that you can find on our website and also these courses are typically hands-on courses and they are always specific to working on embedded systems and the unique challenges that we encounter.
We also sometimes develop custom courses for companies, which are typically a variation of one of the courses that we already offer but sometimes even a new course for private company. As soon as we put our courses together, it’s a training program for younger engineers coming into the company and several large companies work with us doing that.
Slides 7-9: Upcoming Public Courses
Before we get to today’s course, I just want to alert you that we regularly do trainings both at private companies and you can see on our website a list of all the courses that we offer. And we also have public trainings including some upcoming boot camps. These are our weeklong, four and a half day, hands-on, intensive software training courses.
The course titles, dates, and other details for all of our upcoming public courses are available in the Firmware Training Courses area of our website.
Slide 10: Salomon Singer, Principal Engineer
Now let’s get on to today’s webinar. Your instructor for today is Salomon Singer. He’s a Principal Engineer at the Barr Group and Salomon and I have been working together for about 7 years now. He has more than 30 years of embedded systems design experience and a degree in computer science or computer engineering. He’s worked on a number of different industries from industrial controls and transportation equipment to telecommunications. As you can see here he has an extensive list of experience with different real-time operating systems and processors and also programing languages.
At the Barr Group he does software engineering every day, writing software. He also consults with a number of different companies on a number of different projects. And additionally, as he’ll do here for you today in the webinar, he’s an excellent instructor and speaker and Salomon will be teaching the upcoming course on Reliable Multithreaded Programing in June as well as our Software Boot Camp in the Fall in the United States in Minneapolis as he teaches a number of courses for us.
So without any further ado I turn the course over to Salomon and thank you again for coming.
Mutexes Semaphores and Demystified
Thank you Mike for that introduction and good afternoon everybody. I would like to welcome you to Barr Group Mutexes and Semaphores Demystified. This webinar assumes that you are familiar with RTOS features and APIs. Maybe you don’t understand why we can’t use Mutexes and Semaphores interchangeably or maybe you have seen some weird behavior in your product and I am wondering if perhaps Mutexes and/or Semaphores are the root cause. This course will demystify Mutexes and Semaphores, it will explain when to use each one, how to use them, and hopefully clarify why they are not interchangeable.
Slide 11: Shared Resources
When we are using an RTOS it is entirely possible that two or more tasks may want to manipulate a global variable or I/O registers at about the same time. This registers or global variables are called Shared Resources since multiple tasks might want to manipulate them. Not only do this stuff may want to manipulate them, but they might want to use them at about the same time. The pieces of code that manipulate this variables and/or I/O registers are called Critical Sections and if they are left unprotected we might have a race condition.
A good example here is very simplistic, you have two tasks: X and Y and they both want to increment A, by doing A=A + 1.
Slide 12: Race Condition Example: Global Variable
So this is how the race condition happens. Task X is running and he’s doing a re-modified write operation on to A. In the middle of this re-modified write he has just read and modified but he hasn’t performed the write when there was the context switch and Y gets to run and performs the same kind of read-modified write operation that now leaves A with the wrong value in the temporary variable in R1.
Slide 13: Race Condition Play by Play
So this is how the race condition happens. To begin with A has the value of 3 and X is the tasks that is going to run first. And so X runs and when it is doing that the contents of A are moved into R1, A is now 3 so R1 is 3. R1 is incremented so now R1 is 4.
At this point there is a context switch and Y gets to run and so the contexts get saved and when Y run he grabs A and puts it into R1, A is still 3 so R1 is 3. It increments R1, so now R1 is 4 and now they write, will write R1 into A, so now A has a value of 4.
Eventually task Y finishes running and task X gets to run again upon restoring the context R1 is still 4 and now R1 is written to A so we write 4 again onto A. The desired value would have been 5, since the initial value was 3 and supposedly A was incremented twice.
Slide 14: Critical Section Analogy
We can say that a critical section is like cars approaching an intersection, the intersection being the shared resource. And so if two or more cars enter the intersection at about the same time, accidents will happen and this is exactly what happened with tasks X and Y when they were still trying to manipulate variable A.
So we need a way to prevent accidents from happening. Just like there are many protocols to protect an intersection, such as a stop sign or a stop light, in the same manner there are different protocols to protect a critical section such that only one task manipulates the shared resource at a given time. The one protocol I’m going to talk about this afternoon is Mutexes.
Slide 14: Mutual Exclusions
It is a fact that task are going to share resources, there is no doubt about it. The question is, can the OS can the RTOS do anything to prevent these critical sections? And the reality is that the RTOS has no way of identifying this critical section. So it’s up to the implementer to figure out where these critical sections of code are and do something about it.
The only thing that the RTOS can do, is provide API calls that look like wait and leave, wait being wait until it’s safe to enter this critical section. In other words no other task is currently doing anything with this shared resource and the other call would be leave to let the RTOS know and the RTOS will let all the tasks know that it is now safe to go manipulate that particular resource.
Slide 16: Mutex
So let me define Mutex. Mutex is an operating system data structure that is going to be used by tasks to make sure that you have exclusive access to a shared resource. It is short for MUTual EXclusion and it is going to be used to prevent race conditions.
Slide 15: Mutexes
So we said that a Mutex is an RTOS data structure, deep down, it is a multitasking safe binary flag with some other functionality that we will talk about a little later. The Pseudocode for using the Mutexes looks like this: First acquire the Mutex and that might involve some waiting on if some of the task is currently using that shared resource. Eventually, we will get the Mutex then we will perform the critical section code and upon finishing that critical section we will release the Mutex back to OS so that it can be given to another task if another task is waiting.
Each Mutex should be used to protect exactly one shared resource and all tasks that might want to use that particular resource must use that particular Mutex.
Slide 18: Bathroom Key Analogy
A good analogy for Mutex is the key for the bathroom in a gas station. So the Mutex is the bathroom key, the tasks are the customers, and the shared resource is the bathroom itself (there is only one of those) and only person can use it at a time.
And obviously they need the key to use the bathroom. And so customer X arrives nobody is using the bathroom so the key is at the counter, customer X gets the key, goes the bathroom. While using the bathroom, customer Y arrives she wants to use the bathroom but the bathroom is now being used so the key is not available.
Eventually, customer X will come out of the bathroom and return the key to the counter, now customer Y can grab the key and go use the bathroom.
Slide 19: Mutex Data Structure Details
This is a conceptual implementation of the Mutex Data Structure. It must contain two key elements. One is the value or the current Mutex stage: is it, for you available or is it busy or being used? And the other element is a pointer to a queue or tasks that are blocked and waiting for that Mutex. A value of one would mean available and a value of zero would mean unavailable. When there are no tasks waiting for that Mutex that list would be empty or the pointer will have no value.
Slide 18: Mutex Creation Details
Before a Mutex can be used it has to be created. So it is going to be created at the request of a task. The create API will return some kind of handle and this handle must be known to all the tasks that are going to use the Mutex.
And so this is the Pseudocode for the Mutex create. Upon creation, we must set the pointer to the list of waiting tasks to know, the value to available and then we will return a handle; the handle can be a pointer to a data structure, a Mutex index, or some other kind of ID that will uniquely identify each and every Mutex.
Slide 21: A Scalable Solution
Mutexes are a totally scalable solution and so this Mutex protocol will work for any number of tasks. If N tasks try to acquire the Mutex only the first one to actually issue the call will succeed the other N – 1 task will wait in the queue until the Mutex is available again.
When a task releases then the Mutex after finishing the critical section, one of these tasks in the queue will become ready to run because he has now acquired the Mutex. Any number of Mutexes can be created subject only to practical limitations.
Slide 22: Mutex Summary
So to summarize: a Mutex must be created in the available state, and we must always use the Mutex by protocol meaning that we are going to wait first, once we get the Mutex we will execute the critical section occurred and then we will release the Mutex back to the RTOS.
So we can think about this as a stop sign at an intersection. We approach the intersection when we get there; we wait until it is our turn to cross the intersection. Once we know it is our turn we go in, we do it safely no accidents and upon exit we signal that we are out of the intersection and it’s okay for somebody else to come into the intersection.
It is very important to remember never to release a Mutex that we didn’t previously acquire. This might cause serious problems to other tasks that might come behind you.
Slide 23: Reentrant Software
What if there are two or more tasks that want to call the same function at about the same time? So, each one of these functions must be reentrant. We have to remember that your typical C-Standard libraries are not reentrant including those that ship with the new compiler. If you are using new you should look for a newlib or some other kind of reentrant library. The C-Standard library from vendors of cross compliers like IAR, are already safe or reentrant. You need to remember also that your driver code needs also to be reentrant.
Non-reentrant code or non-reentrant functions are the source of many problems that are extremely difficult it is not borderline impossible to reproduce because it takes special timing. Two or more tasks need to be calling a particular non-reentrant function at about the same time for this problem to manifest itself.
Slide 24: Non-Reentrant Function Risks
So drivers and library routines are good example of functions that might get called from two or more tasks at about the same time and so they must be reentrant. If they are not reentrant, there is a very good possibility that if these functions are called at about the same time, the second caller might clobber values while the first caller was running.
So task A runs calls a library function, say MEM copy. MEM copy is in the middle of performing that copy when it gets interrupted and now task B runs, it calls the same function and upon the call, if the library is not reentrant, some of the values will get clobbered and when task A resumes the copy will be from the wrong place or to the wrong place or the length will be changed, in other words something will go awry.
Slide 25: Requirements For Reentrancy
So let’s talk about the requirements for reentrancy. And so for variables it is totally okay to use automatic variables because they will be allocated in the stack for each task. So there is no sharing there.
It is unsafe to use statics or dynamics unless you protect them with a mutex. You can use dynamics that are allocated and freed within the call but not dynamics that are allocated somewhere else.
If the function contains access to peripheral registers then this access must always be protected with some kind of mechanism, a Mutex or some other way of protecting the critical section.
Slide 26: Example: TCP/IP Networking
As an example of code that needs to be reentrant let’s take a look at TCP/IP Networking. The protocol stack is nothing but a library of functions and so to send a package from a user task, what we need to do is make a socket API call.
From within that call, a call will be made to a TCP layer function and that in turn will make an IP layer function and it will percolate all the way down to the network driver. When it gets there the driver controls exactly one chip and for that reason it is a shared resource. So because it is a shared resource, it needs to be protected just like with any other global variable; and so somehow this code needs to be protected and it needs to be treated as a critical section.
Slide 27: Example (cont’d): TCP/IP Networking
This graphic illustrates what we were just talking about. It is entirely possible that A is trying to send a packet and the code percolated all the way down to the driver and we are in the middle of sending a packet. Now, code gets interrupted, task B makes an API call, he wants to send a packet as well and it percolates down again all the way to the driver. If the driver is not reentrant, task B will now write things to the device that might jeopardize the sending of the packet that A was trying to achieve.
Slide 28: Reentrant Functions – The Very Best Use of Mutexes!
Reentrant functions are the best place to use Mutexes. Instead of having Mutex code distributed among all the task that might want to access a shared resource, it is preferable to have a reentrant function that contains the Mutex access and the critical section code all within it.
If we do it this way, the access on the critical sections are totally transparent to the application and implementers don’t need to remember that access to a certain global or a shared resource need to be protected with the Mutex.
Slide 29: When and Where to Use Mutexes
Two good rules to adhere to when we are using Mutexes: Rule number one, use Mutexes only in leaf node source files. Meaning in source files, meaning in source files that don’t make any calls to any other source files. A good example is device drivers with an API and a Mutex hidden within it. So the access, the critical section code, and the release are all in there. One extra bonus of doing it this way is that you can never attempt to get more than one Mutex at a time because there are no further calls from this present function.
Rule number two, avoid Mutexes in task level source files. It is always preferable to use directional inter-task communications, semaphores, mailboxes or message queues to serialize things and remove the Mutex completely from the equation. Mutex, failure to release a Mutex can cause all tasks that are attempting to use the Mutex to get lost forever. So a task level bar can bring the system to its knees.
Slide 30: What is a Semaphore?
Let’s take now about Semaphores. What exactly is a Semaphore? So it is an RTOS provided object and at its core it’s nothing but an integer, a global variable but unlike other global variables. It is totally safe to use by parallel running tasks. It is safe to use by dispersed because the RTOS is going to provide within its atomicity. In other words every time we go and manipulate this variable, it will be in an interruptible fashion.
Slide 31: Types of Semaphores
There are three types of Semaphores. The first one is a counting Semaphore. At it is core again, it’s a counter, it is multitasking safe, and it can be incremented or decremented atomically by making calls to the RTOS via the established API.
The second type is the Binary Semaphore and here it’s a multitasking safe binary flag; in other words, it can only have values of zeros and ones. It can be set to one or reset to zero atomically again via RTOS provided API calls.
The third type is the Mutual Exclusive Semaphore or Mutex something we already talked about and so it is a special case of a binary Semaphore but with some additional capabilities that do not exist in your regular Binary Semaphore.
Slide 32: Semaphore Data Structure
This is the conceptual implementation of the Semaphore data structure. At its core it has two elements. The first one is a pointer to a TCB or Task Control Block or a queue of blocked tasks waiting for the Semaphore to unlock, to be available. The other element is the value and the case of a counting Semaphore is any value, zero or any other integer in the case of a binary Semaphore the values could be zero or one.
If you remember conceptually the data structure for Mutexes looks extremely familiar and as an OS implementer the similarities don’t end just here. As an OS user, which most of you are, you need to forget that I ever showed you this because Semaphores and Mutexes have completely different uses.
Slide 33: Waiting for Semaphores
So who can wait or pend for a Semaphore? ISRs can only make non-blocking RTOS calls so it is not okay to attempt to pend or wait for a signal from Semaphore from inside an ISR. But it is totally okay to post or put or signal to a Semaphore from within an ISR because that call will never block.
So ISRs can very safely signal via a Semaphore to a task for example and let it know that an awaited event just happened. Say for example, somebody just pushed button number one. ISRs are restricted in the sense that they can post but not pend; tasks have no such restrictions. Tasks can both pend and post to a Semaphore.
Slide 34: Mutexes VS. Semaphores
I want to be very clear about the fact that the uses for Semaphores and Mutexes are totally distinct. Mutexes should be used only for Mutual Exclusion. They each control the access to a single shared resource. The value inside the data structure will always be initialized to one meaning available and we will always wait first and then leave and that’s the only order in which the API must be used.
Semaphores, on the other hand, are for one to one signaling, they can be initialized to any value although typically they will be initialized to zero. There is no prescribed order to use the API, you can actually signal or wait first, there is no necessary order here. You can actually do it in a loop or signal, signal, signal or wait, wait, wait it doesn’t really matter.
Slide 35: Examples: A/D Conversions Semaphores
This is a good example of how to use a Semaphore to let a task know that an event that it was waiting for has occurred. So we are going to handle an A/D conversion so the task goes in and selects the resource or sources of interest, starts the A/D conversion and now pends on a Semaphore.
Eventually, when the conversion is over, we will get an interrupt and we will jump into the ISR. The ISR will handle the end of the A/D conversion maybe retrieve the values if necessary and then will post a Semaphore.
Upon posting to the Semaphore and exiting from the ISR, the task will be unblocked and the task can now resume and handle whatever needs to be handled with the values are retrieved from the conversion.
Slide 36: Semaphores in Practice
So these are the many ways that you can use a Semaphore. You can use a Semaphore to post from an ISR to waken a task. You can post to a Semaphore from a task to awaken another task. Or you can even post to a Semaphore from a task to awaken multiple tasks.
Slide 37: Problem: Multiple Identical Resources
A few slides ago, we talked about the problem of a single bathroom in a gas station and how we could use the analogy for the Mutex. And so now if we have the same gas station with multiple identical bathrooms does the analogy extend? Can we now deal with the case of two plus bathrooms?
If you Google this problem you will see that many sources that say that this is the case but in reality they are all wrong. A Counting Semaphore cannot protect two or more resources; the biggest problem here is that the counting Semaphore has no knowledge whatsoever of which of the bathrooms is the one that is available and which one is not.
Slide 38: Solution: Distinct Mutexes
The right solution for this problem would be to use distinct Mutexes. Remember we said that one Mutex protects one shared resource, so each bathroom will have one key. When customer X grabs the key he knows if it is for bathroom A or for bathroom B; and if there is only one key available when customer Y arrives he knows exactly which bathroom is available because he only has the key to say bathroom B.
Slide 39: Another Problem: Priority Inversions
Here is another problem we have to deal with: the possibility of a priority inversion and it’s a pretty sticky problem. So let me show you how the inversion actually works. Low priority task wants to run, nobody higher wants to run, so low priority task runs, he runs for a while and eventually he wants a resource. He finds the resource available so now he’s still low priority task but he is running as dark blue.
While he is running with the resource, a higher priority task wants to run so there is a context switch and now the high priority task runs for a while. Eventually the high priority task wants the same resource that the low priority task is already holding. Because the resource is not available, the high priority task is going to block and the low priority task is going to resume.
And so low priority task resumes, runs for a little bit and now medium priority task wants to run and because he is higher priority than medium there is a context switch so he runs until he no longer wants the CPU or he blocks and only then can low priority task resume its work. So it does a little bit more work and eventually he is done with the resource at which point high priority task unblocks.
And so now high priority task acquires the resource, he runs with the resource for a while then he returns the resource, does some more work and eventually relinquishes the CPU or blocks or ask for a delay or something like that. And only then can low priority task be swapped in again and finish all the work that it needed to do.
Notice that the medium priority task doesn’t even want to use the resource. It’s interrupting the low priority task only because he has a higher priority than the lower priority task. So in essence, he is introducing a delay to the completion of a low priority task with the resource, which is in turn delaying the completion of the work that high priority task needs to do with the resource and thus maybe causing high priority task to miss a deadline.
Slide 40: Implication of Priority Inversion
So what is the big deal about Priority Inversions anyway? If priority inversion can happen in our system, then it is absolutely impossible to calculate how much time does it take a task to execute a path through each code.
And if we cannot measure that, then it is impossible to figure out how much CPU are we using and if we cannot figure out how much CPU are we using, then it is totally impossible to ascertain if certain tasks are going to meet your deadlines. And so in order to prevent all this from happening we need to prevent priority inversions from happening in our system.
Slide 41: Priority Inversion Workarounds
There are several ways to prevent priority inversions, I’m going to talk about two of them. The first one is called Highest Locker or Priority Sealing Protocol. This can be implemented above or outside of the RTOS and so in this protocol each shared resource is going to be given a priority that is just above the priority of any of the tasks that use this particular shared resource.
In this particular protocol, when a task requests a resource and it is granted, its priority will be raised to the priority of the shared resource until the task completes the use of the resource and returns it upon which its priority will be lowered to its original priority.
In the second protocol, which is Priority Inheritance Protocol, this can be implemented only inside the RTOS and the priority of the task that holds a resource is going to be increased automatically but only until a higher priority task that wants the resource is demanding that resource and it finds the resource not available. Upon return of the resource the task will be demoted to its original priority.
Slide 42: Example: Highest Locker
Let me show you how Priority Sealing Protocol works. Again in light blue is the low priority task without the resource, and dark blue the low priority task running with the resource, in yellow high priority task running without the resource, and in brown high priority task running with the resource.
And so to begin with the low priority tasks starts running he doesn’t need a resource, it runs for a while eventually he needs the resource. So immediately he gets promoted to priority high plus one. Sometime while low is running at high plus one, high wants to run but he will not swap in because there is a higher priority task running.
Eventually, low who has been promoted to high plus one, will be done with the resource. He will release it at which point high will be swapped in he’ll run for a while without the resource when he wants the resource he will acquire it and he will be promoted to priority high plus one. Run at that priority until he’s done with the resource at which point he will be demoted to his original priority of high, he’ll run until he’s done with his task and then low priority task will be swapped in so he can finish his duties.
Slide 43: Priority Inheritance
Let me show you how Priority Inheritance Protocol works. In light blue is the low priority task running without the resource, in dark blue it’s the low priority task running with the resource, in yellow it’s the high priority task running without the resource, and in beige it’s the high priority task running with the resource.
And so, to begin with low priority task is running; it runs for a while without a resource. Eventually he wants the resource and the resource is available so it is granted to him and so it runs for a while with the resource and now high priority task wants to run so there is a context switch and high priority task starts running and he runs for a while and eventually, high priority task wants the resource.
But the resource is not available because low has it and so immediately low gets promoted to high. So now low who has been promoted to high runs with the resource for a while until he no longer needs it, he returns it upon which it awakens task high and so now there is a context switch and high runs with the resource then eventually returns the resource, continues doing some more work until it finishes its work, and only then can low priority task get switched in again so it can finish the work he has remaining.
Slide 44: Priority Inheritance Scales Nicely
In the prior example I showed you how Priority Inheritance Protocol works for two tasks but I want to tell you that the protocol scales very nicely. So I’m going to show you an example with three tasks.
And so at the beginning low priority task starts to run without the resource, eventually he acquires the resource, and while he is using the resource, medium priority task wants to run. And so he gets switched in and medium priority runs for a while eventually he wants the resource and so the resource is not available. So he will block but low priority task will now be promoted to medium priority task. And he will run with the resource for a little bit and now high priority task wants to run so he gets switched in and high priority task runs without the resource for a while.
Now he want to resource and the resource is not available, so he will block. And now low priority task who had previously been promoted to medium priority is now promoted again to high. And so he runs again with the resource until he completes the use of the resource at which point he’ll get demoted to a low priority and high priority can acquire the resource and run.
So high runs with the resource for a while eventually gives up the resource and runs some more without the resource and then he is done with all the work that he had to do so medium can run. And so medium now acquires the resource successfully, runs with the resource for a while, eventually gives the resource up and it is done with all the work he had to do; only then can low priority task resume and finish its task.
Slide 45: Mars Pathfinder Story
This is an interesting priority inversion story. In 1997 NASA launched a Mars Pathfinder mission and when time came to launch, the software for the rover wasn’t ready. So instead of delaying the mission, NASA engineers decided that they would put a boot loader onto the rover and they would have time between the launch and the time that the vehicle gets to Mars to finish developing the code and so they built two of these rovers one went to Mars and the other one stayed on Earth.
And the engineers finally finished writing the code by the time the rover got to Mars and when it got to Mars they started downloading the new code onto the rover and midway through the download the rover rebooted and so they tried again and the same exactly thing happened, so they tried several times I don’t know how many exactly, but the bottom line is that they couldn’t download because in the middle of the download, the rover rebooted.
And so luckily for them they had another of those rovers down on Earth and so they started testing and trying to figure out what the problem was. And what it turned out to be is that they had a Priority Inversion and so the software watchdog was going off and resetting the system.
The reason was that Priority Inheritance Options was not on as default in VxWorks and so all they had to do was turn the option on and they stopped the Priority Inversions and then they had no resets and they were able to download the code onto the rover and so they saved the mission.
Slide 46: Mutexes VS. Semaphores
So let me recap. Semaphores don’t have priority inversion workarounds built into them and so they can be used by Mutex Protocol to prevent race conditions but they should never be used in real-time systems with hard deadlines because if you do priority inversions will happen.
Use Semaphores only for signaling. So you can signal from one task to another task or from ISR to a task to let the waiting task know that the awaited event has now happen or happened again etc. You need to think of Semaphores as messages without data. The actual post is the data itself.
Slide 47: Key Takeaways
We must remember that Mutexes and Semaphores are distinct. Mutexes are meant to prevent race conditions and to make sure that priority inversions do not happen; we also must remember that they cannot be used with an ISR.
Counting Semaphores can be used to signal events from either an ISR to a task or one task to another task; and we should always remember that a binary Semaphore is a special case of a Counting Semaphore.
Question & Answer
Q: Are there foolproof methods to test for re-entrancy?
Some static analysis tools such as PC-Lint version 9 have the capability to detect unsafe/unprotected access to a shared resource by more than one thread.
Q: How can I be sure no thread is accessing a shared resource without getting the mutex?
To do this properly, we recommend performing a code review. A good first step is to name all of the shared objects that require mutex-style protections in a distinctive style (e.g., prepending "g_" to the start of their variable names); if one isn't already named this way, you should be able to perform a global search and replace. Then go through each such variable, one at a time, reviewing all of the places in the code it is used to ensure that all such uses are surrounded by a mutex acquisition and release. Note that each unique such object* should have its own unique mutex to protect it.
Note that sometimes a pair of shared objects (e.g., an x, y coordinate pair) is really a single object that needs to be protected. In that case, the best practice is to combine the x coordinate and the y coordinate into a single struct and then to name the instance of that struct with a g_ and protect the pair of data points inside of it with a single mutex.
Q: How do I check if the code I use is reentrant?
If it’s code you bought and you don't have source code, you will probably need to check with the vendor. If the code in question was developed in-house, code inspection as above is your best bet. Note that a set of reentrant functions to manipulate a single global or memory-mapped register set (e.g., a device driver) should create an abstraction wherein the the mutex used to protect this object is declared inside the file as a static global.
Q: Is nesting possible in mutex use?
It is possible but not advisable. It could lead to a condition called a “deadly embrace” where two tasks block each other forever by holding a resource and waiting for another one that the other task has.
Q: Isn't a mutex a shared resource? How are writes to the mutex protected?
The RTOS does it internally, by disabling interrupts.
Q: Are mutexes preferred for priority inversions?
Mutexes provide the right granularity. Only other tasks waiting for that resource will be prevented from running if they want the resource but it is not currently available.
Q: Regarding mutexes - using shared memory/variable - is the locking of the mutex required for both read and write? Are there any caveats?
If it’s a single variable and there are no coherency issues with other variables and the read and the write are atomic then you don’t need a mutex.
Q: What is the point of using a mutex in a reentrant function?
A reentrant function is one that can be called by more than 1 task/thread safely, even with a preemptive RTOS. You can make a function with a shared variable re-entrant by encapsulating the manipulation of the shared valuable with the mutex. This technique will also eliminate mutex code at the task level.
Q: What happens if a task's semaphore signals but the other task has not yet reached that wait state?
When a task signals the value of the semaphore object will be set(binary case) or incremented (counting case). When the other task reaches the wait state the value will still be there and the call will not block and successfully return.
Q: What is a practical example use of a counting semaphore?
An ISR posting to a semaphore every time a button is pressed, the receiving task doesn’t run often enough to keep up with the ISR.
Q: What will happen if Task A acquires a mutex and Task B calls release?
First of all, this is a violation of the protocol as we discussed; only the owning task should release the mutex. A solid OS will not allow the non-owning task to release. That being said, if there is a task that is blocked waiting for the mutex it will unblock and become ready and if it’s priority is higher than that of Task A it will preempt Task A and start executing it’s critical section.
Q: With counting semaphores, how do most RTOS handle counter overflow?
That would depend on the RTOS implementation. But if your count is that high maybe something else has gone wrong?