From a software point of view, what is the latency between an instruction that dirties a memory page and when the core actually marks the page dirty in the Page Table Entry (PTE)?
In other words, if an instruction dirties a page, can the very next instruction read the PTE and see the dirty bit set?
I don't care about the actual elapsed cycles, only if there is a software visible window in which the dirty bit is not yet set. I can't seem to find any guarantees in the reference manuals.
From the AMD's manual (circa 2005), Volume 2: System Programming:
Ditto from Intel (circa 2006), Volume 3-A: System Programming Guide, Part 1:
UPDATE:
From the latest Intel manual (vol 3A, System Programming Guide):
From the rest of the text in sections 8.1 and 8.2 it follows that once the CPU sets the dirty bit using the locked operation, the other CPUs should start seeing the updated value.
Of course, you may have a race condition in that you first read the dirty bit as 0 on one CPU (or in one of its threads) and later another CPU (or another thread on the same CPU) causes this bit to be set to 1, but that isn't any unusual.