Adventures in NUMA and HyperThreading

I was trying to configure Open vSwitch with DPDK support and it asked for CPU pinning parameters for optimal performance. On a small system, this isn't that bad, but on a system with a bunch of cores (and HyperThreads) this can be a bit daunting - and worse, some of the tutorials and blogs out there don't explain at all how they come up with their numbers. I'm going to muddy the waters a bit here and try to answer some questions you might have when trying to unwind this.

This is not a tutorial on how to configure DPDK for Open vSwitch - instead it's just my attemtpt to show how to do a few things that might come up while doing that, or perhaps when doing something else entirely.

Is NUMA Even Enabled?

$ dmesg | grep -i numa

This should show something like:

[    0.000000] NUMA: Initialized distance table, cnt=2
[    0.000000] NUMA: Node 0 [mem 0x00000000-0xdfffffff] + [mem 0x100000000-0x81fffffff] -> [mem 0x00000000-0x81fffffff]
[    0.000000] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[    0.752569] pci_bus 0000:00: on NUMA node 0
[    0.756249] pci_bus 0000:40: on NUMA node 1

If NUMA is enabled. If you don't see anything, that means your kernel doesn't support it.

Determine How Many NUMA Nodes There Are

$ numactl --hardware

One of the lines of the output of this command will tell you how many available NUMA nodes there are.

It also will tell you which CPUs correspond to which NUMA node.

Determine What Threads are HyperThreads

My system has 32 threads (2x8 core CPUs each with hyperthreads enabled) - and the counting starts at 0 and so 31 is the last one. Change the 31 below to match what you have.

for core in {0..31}; do
  echo -en "$core\t"
  cat /sys/devices/system/cpu/cpu$core/topology/thread_siblings_list
done

The output of this command will show a list of all cores and what their hyperthread siblings are. If you're trying to schedule something to busy poll on given cores, best to not have that happening on hyperthreads of the same core. This is how you tell how to avoid that.

How do I tell what NUMA node a given PCI device is on?

First, find the devices that you care about by looking at the output of lspci and if needed, from ethtool -i. Pay attention to the PCI coordinates that tell you where on the bus it is (the first column of lspci. Then look at the output of:

 $ lspci -vmms <pci coordinates> | grep NUMANode

This is especially useful for NICs.

How do I generate bitmasks for CPUs?

This is the money question. If you want to avoid other things running on a given core as well as on its hyperthread sibling, it's best to tell the kernel not to use them at bootup by passing them as kernel commandline args using the isolcpus parameter. This basically tells the kernel not to schedule random stuff on that list of cores - they're left alone for jobs that specifically request them. So that stuff I mentioned above about determine what hyperthreads are siblings? This is where you use that. Come up with a list of cores and their hyperthreads and then go read up on how to specify that on the kernel command line with isolcpus. There are lots of good documents that describe how to do this for a bunch of common distros. I'll assume you've done that and rebooted and that now you have a handful of cores carved out that are ready to use.

You then have to generate bitmasks. That's just a big number where each binary digit (bit) represents a CPU. The least significant bit is CPU0 and then on up to whatever your max is. I usually figure this out using a calculator program that lets you enter binary data. I literally type out the 0s and 1s for what I want and then convert that to hexadecimal - and that's generally what you are asked to supply by whatever needs this data. You may or may not need to specify a leading 0x for it - that'll depend on the thing that consumes it.

You probably do not want to specify both threads here for a given core so that this job can run without contention from its sibling. So the bitmask should probably only contain about half of what you reserved via isolcpus - those other hyperthreads are going to go to waste so that their siblings can run unmolested. That's one of the costs of optimal performance.

Putting it all together, say your NUMA data looks like:

 $ numactl --hardware
 available: 2 nodes (0-1)
 node 0 cpus: 0 2 4 6 8 10 12 14
 node 1 cpus: 1 3 5 7 9 11 13 15

and you want to specify CPU 4 and 6 from NUMA 0 and 5 and 7 from NUMA 1 (I assume those are not hyperthreads on the same core and that the cores that are hyperthread siblings of them are reserved along with these ones already. That means you want to reserve 4, 5, 6, and 7. That binary is 11110000 - that's four 0s for 0, 1, 2, and 3, and then 4 1s for 4, 5, 6, and 7. The cores above 7 are all 0s and so you don't have to specify those. If you had more and there were more 1s, you'd have to include those then too. Anyways, when you convert 11110000 to hex you get 0xF0 and that's the value to use. Easy, right?

You're welcome. Maybe.