I've got recently a task to implement a use case in the linux kernel. Due to some confidential info I cannot explain all details, but I'll try to do my best to explain our goal, what is known and what not. We have a linux distro that runs on a smp machine. This is what we know about the system and flow:
- SMP system, let's assume we have 8 cores
- Each core must be brought to idle for, let's take as example, 15ms each 50ms. By idle i mean a forced mwait state in x86. This applies for all cores. Therefore, within 50ms all 8 cores must be suspended 15ms. During this time we consider IRQs to be disabled.
- Of course it's not possible to take all cores off at once as it blocks the system.
- We can group the cores in any combination to be taken down, however, a strict constraint here is that the halt time of cores must not overlap partially within this interval, they are allowed to overlap only entirely, in which case the cores can form groups.
- The constraint extends to the fact that, if we decide to split the cores in 2 groups (4 + 4) then all cores within one group must hit the halt/mwait with a time distance between first and last of 100us. To make it more clear, the time interval allowed between the first core within one group that hits the halt/mwait/wfi and the last one should not be longer that 100us.
- We need a system optimized a bit for real time, but in the same time allowing us to design a deterministic system (to the extent it can), therefore a desktop configuration is out of discussion.
- As it looks now we won't bring the kernel RT patch in. We expect to have a HZ value of at least 1000, pin most of threads to specific cores, prevent or control migration of threads between cores...
- Is not yet known whether the preemption will be enabled.
- The 15ms interval can be split in multiple windows, the minimum time allowed 1ms/window.
Summary: all cores in a SMP system must be forced to stop for a period of 15ms, every 50ms, in one window or multiple windows. The off windows of the cores can overlap completely or not at all.
Problems here:
- I/O interrupts: we take into account migrating the interrupts before running cores offline
- other interrupts: like the IPIs for TLB flush or other
smp_call_function_*
API: here we don't know the impact. There are also the IPIs for rescheduling, trigger mainly by load balance. As most of threads will be bound to cores and our threads will have high RT priority we assume the rescheduling of background threads is not an issue (?) - timers: we take into account migrating the timers
- threads: have the affinity of each thread on 2 cores, one from each group to be able to migrate in case it's needed.
I'm mainly concerned about the responsiveness of the system due to blocked IRQs, threads, timers, etc and I'm looking for solutions on how to have the use case implemented with small impact on the overall system. We also take into account to keep cpus stopped with irqs enabled, this will improve the responsiveness but will also impact our use case which will need most likely a much more complex design.
What I tried so far:
- Used the hotplug state machine. I implemented a piece of code which tries to take the cpu offline from scheduler and migrate the threads + interrupts + timers to other cores. I stopped the state machine at CPUHP_AP_IDLE_DEAD. I could reach an offline transition time of ~10 ms after I removed from the state machine the callback to the cpu freq governor. However the time is too high and most likely it won't be accepted. It can be that hotplug feature is not suitable for other reasons, so I dropped this.
- Wrote a module that creates N real time threads + 1, where N is the number of cores. There is a supervisor thread with prio MAX_USER_RT_PRIO-1 and all other threads (let's call them idle threads) MAX_USER_RT_PRIO-2 (set with
sched_setscheduler
on SCHED_FIFO). The number of cores is split in 2 groups. As 15ms is a bit long time to keep a cpu completely off, I split the time interval in 5 slots, each of 3ms. So we have 2 groups of 4 cpus that should run the idle threads 5 times each 50 ms. The supervisor thread keeps track of time and callswake_up_process
for each cpu in the group that should run. When idle threads in one group complete they set the corresponding flag in astruct cpumask
to signal completion.
However, one issue that I'm trying to figure out is how to force all threads mapped to cores in one group to reach the halt/mwait point more or less in the same time (as wrote above max 100us time between 1st and last). The wake_up_process
called from supervisor thread leads to a delta between 2 consecutive threads in a group of ~50us. This is not acceptable, for 4 cores we go to 200us.
I have to mention the current kernel I have 4.9 runs with HZ=1000 and CONFIG_PREEMPT disabled, therefore, afaik, although my idle threads run in the RT scheduler as long as CONFIG_PREEMPT is disabled they must wait for voluntary relinquish of the core by other threads that run in CFS scheduler, so no preemption happens. True?
Here my concern is how to make sure all my threads in a cpu group get launched simultaneously so that the halt/mwait is reached very close to eachother.
Here below I tried to make a graphical representation:
Group1 Group2
____|___ ____|____
/ \ / \
C0 C1 C2 C3 C4 C5 C6 C7
-------------------------- TIME 0
| | | | <- OFF time group 1
| | | | <- OFF time group 2
| | | | ......
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
--------------------------- TIME 50ms
I'm also looking for other ideas on how to implement the concept.
I know it's a lot of text here, I hope I managed to clarify the use case.
Any help/idea much appreciated.
Regards, Daniel.