atomic_loadstore(9) - NetBSD Manual Pages

Command:	Section:	Arch:	Collection:
ATOMIC_LOADSTORE(9)    NetBSD Kernel Developer's Manual    ATOMIC_LOADSTORE(9)


NAME

     atomic_load_relaxed, atomic_load_acquire, atomic_load_consume,
     atomic_store_relaxed, atomic_store_release -- atomic and ordered memory
     operations


SYNOPSIS

     #include <sys/atomic.h>

     T
     atomic_load_relaxed(const volatile T *p);

     T
     atomic_load_acquire(const volatile T *p);

     T
     atomic_load_consume(const volatile T *p);

     void
     atomic_store_relaxed(volatile T *p, T v);

     void
     atomic_store_release(volatile T *p, T v);


DESCRIPTION

     These type-generic macros implement memory operations that are atomic and
     that have memory ordering constraints.  Aside from atomicity and order-
     ing, the load operations are equivalent to *p and the store operations
     are equivalent to *p = v.  The pointer p must be aligned, even on archi-
     tectures like x86 which generally lack strict alignment requirements; see
     SIZE AND ALIGNMENT for details.

     Atomic means that the memory operations cannot be fused or torn:

     ·   Fusing is combining multiple memory operations on a single object
         into one memory operation, such as replacing
                 *p = v;
                 x = *p;
         by
                 *p = v;
                 x = v;
         since the compiler can prove that *p will yield v after *p = v.  For
         atomic memory operations, the implementation will not assume that
         -   consecutive loads of the same object will return the same value,
             or
         -   a store followed by a load of the same object will return the
             value stored, or
         -   consecutive stores of the same object are redundant.
         Thus, the implementation will not replace two consecutive atomic
         loads by one, will not elide an atomic load following a store, and
         will not combine two consecutive atomic stores into one.

         For example,

                 atomic_store_relaxed(&flag, 1);
                 while (atomic_load_relaxed(&flag))
                         continue;

         may be used to set a flag and then busy-wait until another thread
         clears it, whereas

                 flag = 1;
                 while (flag)
                         continue;

         may be transformed into the infinite loop

                 flag = 1;
                 while (1)
                         continue;

     ·   Tearing is implementing a memory operation on a large data unit such
         as a 32-bit word by issuing multiple memory operations on smaller
         data units such as 8-bit bytes.  The implementation will not tear
         atomic loads or stores into smaller ones.  Thus, as far as any inter-
         rupt, other thread, or other CPU can tell, an atomic memory operation
         is issued either all at once or not at all.

         For example, if a 32-bit word w is written with

               atomic_store_relaxed(&w, 0x00010002);

         then an interrupt, other thread, or other CPU reading it with
         atomic_load_relaxed(&w) will never witness it partially written,
         whereas

               w = 0x00010002;

         might be compiled into a pair of separate 16-bit store instructions
         instead of one single word-sized store instruction, in which case
         other threads may see the intermediate state with only one of the
         halves written.

     Atomic operations on any single object occur in a total order shared by
     all interrupts, threads, and CPUs, which is consistent with the program
     order in every interrupt, thread, and CPU.  A single program without
     interruption or other threads or CPUs will always observe its own loads
     and stores in program order, but another program in an interrupt handler,
     in another thread, or on another CPU may issue loads that return values
     as if the first program's stores occurred out of program order, and vice
     versa.  Two different threads might each observe a third thread's memory
     operations in different orders.

     The memory ordering constraints make limited guarantees of ordering rela-
     tive to memory operations on other objects as witnessed by interrupts,
     other threads, or other CPUs, and have the following meanings:

     relaxed  No ordering relative to memory operations on any other objects
              is guaranteed.  Relaxed ordering is the default for ordinary
              non-atomic memory operations like *p and *p = v.

              Atomic operations with relaxed ordering are cheap: they are not
              read/modify/write atomic operations, and they do not involve any
              kind of inter-CPU ordering barriers.

     acquire  This memory operation happens before all subsequent memory oper-
              ations in program order.  However, prior memory operations in
              program order may be reordered to happen after this one.  For
              example, assuming no aliasing between the pointers, the imple-
              mentation is allowed to treat

                      int x = *p;
                      if (atomic_load_acquire(q)) {
                              int y = *r;
                              *s = x + y;
                              return 1;
                      }

              as if it were

                      if (atomic_load_acquire(q)) {
                              int x = *p;
                              int y = *r;
                              *s = x + y;
                              return 1;
                      }

              but not as if it were

                      int x = *p;
                      int y = *r;
                      *s = x + y;
                      if (atomic_load_acquire(q)) {
                              return 1;
                      }

     consume  This memory operation happens before all memory operations on
              objects at addresses that are computed from the value returned
              by this one.  Otherwise, no ordering relative to memory opera-
              tions on other objects is implied.

              For example, the implementation is allowed to treat

                      struct foo *foo0, *foo1;

                      struct foo *f0 = atomic_load_consume(&foo0);
                      struct foo *f1 = atomic_load_consume(&foo1);
                      int x = f0->x;
                      int y = f1->y;

              as if it were

                      struct foo *foo0, *foo1;

                      struct foo *f1 = atomic_load_consume(&foo1);
                      struct foo *f0 = atomic_load_consume(&foo0);
                      int y = f1->y;
                      int x = f0->x;

              but loading f0->x is guaranteed to happen after loading foo0
              even if the CPU had a cached value for the address that f0->x
              happened to be at, and likewise for f1->y and foo1.

              atomic_load_consume() functions like atomic_load_acquire() as
              long as the memory operations that must happen after it are lim-
              ited to addresses that depend on the value returned by it, but
              it is almost always as cheap as atomic_load_relaxed().  See
              ACQUIRE OR CONSUME? below for more details.

     release  All prior memory operations in program order happen before this
              one.  However, subsequent memory operations in program order may
              be reordered to happen before this one too.  For example, assum-
              ing no aliasing between the pointers, the implementation is
              allowed to treat

                      int x = *p;
                      *q = x;
                      atomic_store_release(r, 0);
                      int y = *s;
                      return x + y;

              as if it were

                      int y = *s;
                      int x = *p;
                      *q = x;
                      atomic_store_release(r, 0);
                      return x + y;

              but not as if it were

                      atomic_store_release(r, 0);
                      int x = *p;
                      int y = *s;
                      *q = x;
                      return x + y;

   PAIRING ORDERED MEMORY OPERATIONS
     In general, each atomic_store_release() must be paired with either
     atomic_load_acquire() or atomic_load_consume() in order to have an effect
     -- it is only when a release operation synchronizes with an acquire or
     consume operation that any ordering guaranteed between memory operations
     before the release operation and memory operations after the acquire/con-
     sume operation.

     For example, to set up an entry in a table and then mark the entry ready,
     you should:

     1.   Perform memory operations to initialize the data.

                  tab[i].x = ...;
                  tab[i].y = ...;

     2.   Issue atomic_store_release() to mark it ready.

                  atomic_store_release(&tab[i].ready, 1);

     3.   Possibly in another thread, issue atomic_load_acquire() to ascertain
          whether it is ready.

                  if (atomic_load_acquire(&tab[i].ready) == 0)
                          return EWOULDBLOCK;

     4.   Perform memory operations to use the data.

                  do_stuff(tab[i].x, tab[i].y);

     Similarly, if you want to create an object, initialize it, and then pub-
     lish it to be used by another thread, then you should:

     1.   Perform memory operations to initialize the object.

                  struct mumble *m = kmem_alloc(sizeof(*m), KM_SLEEP);
                  m->x = x;
                  m->y = y;
                  m->z = m->x + m->y;

     2.   Issue atomic_store_release() to publish it.

                  atomic_store_release(&the_mumble, m);

     3.   Possibly in another thread, issue atomic_load_consume() to get it.

                  struct mumble *m = atomic_load_consume(&the_mumble);

     4.   Perform memory operations to use the object's members.

                  m->y &= m->x;
                  do_things(m->x, m->y, m->z);

     In both examples, assuming that the value written by
     atomic_store_release() in step 2 is read by atomic_load_acquire() or
     atomic_load_consume() in step 3, this guarantees that all of the memory
     operations in step 1 complete before any of the memory operations in
     step 4 -- even if they happen on different CPUs.

     Without both the release operation in step 2 and the acquire or consume
     operation in step 3, no ordering is guaranteed between the memory opera-
     tions in steps 1 and 4.  In fact, without both release and acquire/con-
     sume, even the assignment m->z = m->x + m->y in step 1 might read values
     of m->x and m->y that were written in step 4.

   ACQUIRE OR CONSUME?
     You must use atomic_load_acquire() when subsequent memory operations in
     program order that must happen after the load are on objects at addresses
     that might not depend arithmetically on the resulting value.  This
     applies particularly when the choice of whether to do the subsequent mem-
     ory operation depends on a control-flow decision based on the resulting
     value:

             struct gadget {
                     int ready, x;
             } the_gadget;

             /* Producer */
             the_gadget.x = 42;
             atomic_store_release(&the_gadget.ready, 1);

             /* Consumer */
             if (atomic_load_acquire(&the_gadget.ready) == 0)
                     return EWOULDBLOCK;
             int x = the_gadget.x;

     Here the decision of whether to load the_gadget.x depends on a control-
     flow decision depending on value loaded from the_gadget.ready, and load-
     ing the_gadget.x must happen after loading the_gadget.ready.  Using
     atomic_load_acquire() guarantees that the compiler and CPU do not con-
     spire to load the_gadget.x before we have ascertained that it is ready.

     You may use atomic_load_consume() if all subsequent memory operations in
     program order that must happen after the load are performed on objects at
     addresses computed arithmetically from the resulting value, such as load-
     ing a pointer to a structure object and then dereferencing it:

             struct gizmo {
                     int x, y, z;
             };
             struct gizmo null_gizmo;
             struct gizmo *the_gizmo = &null_gizmo;

             /* Producer */
             struct gizmo *g = kmem_alloc(sizeof(*g), KM_SLEEP);
             g->x = 12;
             g->y = 34;
             g->z = 56;
             atomic_store_release(&the_gizmo, g);

             /* Consumer */
             struct gizmo *g = atomic_load_consume(&the_gizmo);
             int y = g->y;

     Here the address of g->y depends on the value of the pointer loaded from
     the_gizmo.  Using atomic_load_consume() guarantees that we do not witness
     a stale cache for that address.

     In some cases it may be unclear.  For example:

             int x[2];
             bool b;

             /* Producer */
             x[0] = 42;
             atomic_store_release(&b, 0);

             /* Consumer 1 */
             int y = atomic_load_???(&b) ? x[0] : x[1];

             /* Consumer 2 */
             int y = x[atomic_load_???(&b) ? 0 : 1];

             /* Consumer 3 */
             int y = x[atomic_load_???(&b) ^ 1];

     Although the three consumers seem to be equivalent, by the letter of C11
     consumers 1 and 2 require atomic_load_acquire() because the value deter-
     mines the address of a subsequent load only via control-flow decisions in
     the ?: operator, whereas consumer 3 can use atomic_load_consume().  How-
     ever, if you're not sure, you should err on the side of
     atomic_load_acquire() until C11 implementations have ironed out the kinks
     in the semantics.

     On all CPUs other than DEC Alpha, atomic_load_consume() is cheap -- it is
     identical to atomic_load_relaxed().  In contrast, atomic_load_acquire()
     usually implies an expensive memory barrier.

   SIZE AND ALIGNMENT
     The pointer p must be aligned -- that is, if the object it points to is
     2^n bytes long, then the low-order n bits of p must be zero.

     All NetBSD ports support cheap atomic loads and stores on units of data
     up to 32 bits.  Some ports additionally support cheap atomic loads and
     stores on 64-bit quantities if __HAVE_ATOMIC64_LOADSTORE is defined.  The
     macros are not allowed on larger quantities of data than the port sup-
     ports atomically; attempts to use them for such quantities should result
     in a compile-time assertion failure.

     For example, as long as you use atomic_store_*() to write a 32-bit quan-
     tity, you can safely use atomic_load_relaxed() to optimistically read it
     outside a lock, but for a 64-bit quantity it must be conditional on
     __HAVE_ATOMIC64_LOADSTORE -- otherwise it will lead to compile-time
     errors on platforms without 64-bit atomic loads and stores:

             struct foo {
                     kmutex_t        f_lock;
                     uint32_t        f_refcnt;
                     uint64_t        f_ticket;
             };

             if (atomic_load_relaxed(&foo->f_refcnt) == 0)
                     return 123;
     #ifdef __HAVE_ATOMIC64_LOADSTORE
             if (atomic_load_relaxed(&foo->f_ticket) == ticket)
                     return 123;
     #endif
             mutex_enter(&foo->f_lock);
             if (foo->f_refcnt == 0 || foo->f_ticket == ticket)
                     ret = 123;
             ...
     #ifdef __HAVE_ATOMIC64_LOADSTORE
             atomic_store_relaxed(&foo->f_ticket, foo->f_ticket + 1);
     #else
             foo->f_ticket++;
     #endif
             ...
             mutex_exit(&foo->f_lock);

     Some ports support expensive 64-bit atomic read/modify/write operations,
     but not cheap 64-bit atomic loads and stores.  For example, the armv7
     instruction set includes 64-bit ldrexd and strexd loops (load-exclusive,
     store-conditional) which are atomic on 64-bit quantities.  But the cheap
     64-bit ldrd / strd instructions are only atomic on 32-bit accesses at a
     time.  These ports define __HAVE_ATOMIC64_OPS but not
     __HAVE_ATOMIC64_LOADSTORE, since they do not have cheaper 64-bit atomic
     load/store operations than the full atomic read/modify/write operations.

C11 COMPATIBILITY
     These macros are meant to follow C11 semantics, in terms of
     atomic_load_explicit() and atomic_store_explicit() with the appropriate
     memory order specifiers, and are meant to make future adoption of the C11
     atomic API easier.  Eventually it may be mandatory to use the C11 _Atomic
     type qualifier or equivalent for the operands.


LINUX ANALOGUES

     The Linux kernel provides two macros READ_ONCE(x) and WRITE_ONCE(x, v)
     which are similar to atomic_load_consume(&x) and
     atomic_store_relaxed(&x, v), respectively.  However, while Linux's
     READ_ONCE and WRITE_ONCE prevent fusing, they may in some cases be torn
     -- and therefore fail to guarantee atomicity -- because:

     ·   They do not require the address &x to be aligned.

     ·   They do not require sizeof(x) to be at most the largest size of
         available atomic loads and stores on the host architecture.


MEMORY BARRIERS AND ATOMIC READ/MODIFY/WRITE

     The atomic read/modify/write operations in atomic_ops(3) have relaxed
     ordering by default, but can be combined with the memory barriers in
     membar_ops(3) for the same effect as an acquire operation and a release
     operation for the purposes of pairing with atomic_store_release() and
     atomic_load_acquire() or atomic_load_consume().  If atomic_r/m/w() is an
     atomic read/modify/write operation in atomic_ops(3), then

             membar_release();
             atomic_r/m/w(obj, ...);

     functions like a release operation on obj, and

             atomic_r/m/w(obj, ...);
             membar_acquire();

     functions like a acquire operation on obj.

     On architectures where __HAVE_ATOMIC_AS_MEMBAR is defined, all the
     atomic_ops(3) imply release and acquire operations, so the
     membar_acquire(3) and membar_release(3) are redundant.

     The combination of atomic_load_relaxed() and membar_acquire(3) in that
     order is equivalent to atomic_load_acquire(), and the combination of
     membar_release(3) and atomic_store_relaxed() in that order is equivalent
     to atomic_store_release().


EXAMPLES

     Maintaining lossy counters.  These may lose some counts, because the
     read/modify/write cycle as a whole is not atomic.  But this guarantees
     that the count will increase by at most one each time.  In contrast,
     without atomic operations, in principle a write to a 32-bit counter might
     be torn into multiple smaller stores, which could appear to happen out of
     order from another CPU's perspective, leading to nonsensical counter
     readouts.  (For frequent events, consider using per-CPU counters instead
     in practice.)

             unsigned count;

             void
             record_event(void)
             {
                     atomic_store_relaxed(&count,
                         1 + atomic_load_relaxed(&count));
             }

             unsigned
             read_event_count(void)
             {
                     return atomic_load_relaxed(&count);
             }

     Initialization barrier.

             int ready;
             struct data d;

             void
             setup_and_notify(void)
             {
                     setup_data(&d.things);
                     atomic_store_release(&ready, 1);
             }

             void
             try_if_ready(void)
             {
                     if (atomic_load_acquire(&ready))
                             do_stuff(d.things);
             }

     Publishing a pointer to the current snapshot of data.  (Caller must
     arrange that only one call to take_snapshot() happens at any given time;
     generally this should be done in coordination with pserialize(9) or simi-
     lar to enable resource reclamation.)

             struct data *current_d;

             void
             take_snapshot(void)
             {
                     struct data *d = kmem_alloc(sizeof(*d));

                     d->things = ...;

                     atomic_store_release(&current_d, d);
             }

             struct data *
             get_snapshot(void)
             {
                     return atomic_load_consume(&current_d);
             }


CODE REFERENCES

     sys/sys/atomic.h


SEE ALSO

     atomic_ops(3), membar_ops(3), pserialize(9)


HISTORY

     These atomic operations first appeared in NetBSD 9.0.


CAVEATS

     C11 formally specifies that all subexpressions, except the left operands
     of the `&&', `||', `?:', and `,' operators and the kill_dependency()
     macro, carry dependencies for which memory_order_consume guarantees
     ordering, but most or all implementations to date simply treat
     memory_order_consume as memory_order_acquire and do not take advantage of
     data dependencies to elide costly memory barriers or load-acquire CPU
     instructions.

     Instead, we implement atomic_load_consume() as atomic_load_relaxed() fol-
     lowed by membar_datadep_consumer(3), which is equivalent to
     membar_consumer(3) on DEC Alpha and __insn_barrier(3) elsewhere.


BUGS

     Some idiot decided to call it tearing, depriving us of the opportunity to
     say that atomic operations prevent fusion and fission.

NetBSD 10.99                   February 11, 2022                  NetBSD 10.99