GHC forkIO双峰性能

I was benchmarking forkIO with the following code:

import System.Time.Extra
import Control.Concurrent
import Control.Monad
import Data.IORef


n = 200000

main :: IO ()
main = do
    bar <- newEmptyMVar
    count <- newIORef (0 :: Int)
    (d, _) <- duration $ do
        replicateM_ n $ do
            forkIO $ do
                v <- atomicModifyIORef' count $ \old -> (old + 1, old + 1)
                when (v == n) $ putMVar bar ()
        takeMVar bar
    putStrLn $ showDuration d

This spawns 20K threads, counts how many have run with an IORef, and when they have all started, finishes. When run on GHC 8.10.1 on Windows with the command ghc --make -O2 Main -threaded && main +RTS -N4 the performance varies remarkably. Sometimes it takes > 1s (e.g. 1.19s) and sometimes it takes < 0.1s (e.g. 0.08s). It seems that it is in the faster bucket about 1/6th of the time. Why the difference in performance? What causes it to go faster?

When I scale n up to 1M the effect goes away and it's always in the 5+s range.