GHC forkIO双峰性能

I was benchmarking forkIO with the following code:

import System.Time.Extra
import Control.Concurrent
import Control.Monad
import Data.IORef

n = 200000

main :: IO ()
main = do
    bar <- newEmptyMVar
    count <- newIORef (0 :: Int)
    (d, _) <- duration $ do
        replicateM_ n $ do
            forkIO $ do
                v <- atomicModifyIORef' count $ \old -> (old + 1, old + 1)
                when (v == n) $ putMVar bar ()
        takeMVar bar
    putStrLn $ showDuration d

This spawns 20K threads, counts how many have run with an IORef, and when they have all started, finishes. When run on GHC 8.10.1 on Windows with the command ghc --make -O2 Main -threaded && main +RTS -N4 the performance varies remarkably. Sometimes it takes > 1s (e.g. 1.19s) and sometimes it takes < 0.1s (e.g. 0.08s). It seems that it is in the faster bucket about 1/6th of the time. Why the difference in performance? What causes it to go faster?

When I scale n up to 1M the effect goes away and it's always in the 5+s range.


I can confirm the same behavior on Ubuntu as well. Except when I set n=1M this behavior does not go away and runtime ranges for me from 2 to 7 sec.


atomicModifyIORef' is implemented with CAS (compare-and-swap), so depending on how threads are executed the function old + 1 will be recomputed more or less times. In other words if a thread B updates the count ref before thread A gets a chance to update the count ref, but after it started the update, it will have to start over with the update operation from the beginning, thus reading new updated value from the ref and recomputing old + 1 once again.

If you run main +RTS -N1, you will see that not only it takes a lot less time to run the program, but also the runtime is pretty consistent between executions. I suspect it is because only one thread can run at any time and there is no preemption until atomicModifyIORef' is done.

希望其他对Haskell RTS有深入了解的人可以对此行为提供更多的见解,但这就是我的看法。