Enhanced rep movsb stosb. [PATCH 1/5] x86_64: use REP MOVSB in copy_page()

Discussion in '2018' started by Majin , Wednesday, February 23, 2022 12:59:05 AM.

  1. Nikokazahn

    Nikokazahn

    Messages:
    61
    Likes Received:
    5
    Trophy Points:
    5
    ERMS is extremely slow for small sizes. There are far more efficient ways to move data. The correct area is obtained by starting with original data in currentapplying all the triplets in the current set, using memcpy provided by the C library, and copying the current area to correct. Post as a guest Name. If you want the higher speeds you are seeing from from memcpy, you can dig up the source for it.
    Dell EMC PowerEdge XE8545 BIOS and UEFI Reference Guide - Enhanced rep movsb stosb. History and Official Advice
     
  2. Goltijind

    Goltijind

    Messages:
    631
    Likes Received:
    21
    Trophy Points:
    2
    An instance of enhanced REP STOSB with ECX= is decoded as a long micro-op flow provided by hardware, but retires as one instruction. There.So it is both a software maintenance benefit no need to change source and a benefit for existing binaries no need to deploy new binaries to take advantage of the improvement.
    Enhanced rep movsb stosb.
     
  3. Mazulrajas

    Mazulrajas

    Messages:
    957
    Likes Received:
    17
    Trophy Points:
    1
    rutex.online › Intel-C-Compiler › td-p.For example, the Intel manual has Figure
     
  4. Dudal

    Dudal

    Messages:
    915
    Likes Received:
    4
    Trophy Points:
    6
    Solved: This question is about assembly code. Many people believe that the implementation of 'enhanced rep movsb/stosb' obtains an aligned address.It certainly still uses a non-RFO protocol, at least for large copies, since it gets performance that is really only possible with non-RFO this is most obvious for stosb but it applies to the mov variants too.
     
  5. Kazirr

    Kazirr

    Messages:
    835
    Likes Received:
    11
    Trophy Points:
    1
    This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).Jim seems to have answered your question.
     
  6. Nikojora

    Nikojora

    Messages:
    349
    Likes Received:
    26
    Trophy Points:
    0
    Enhancement availability is indicated by CPUIDEBX[9] (Enhanced REP MOVSB/STOSB). Intel 64 and IA SDM and performance optimization guide may include.Intel Optimization Manual 2.
     
  7. Basho

    Basho

    Messages:
    101
    Likes Received:
    26
    Trophy Points:
    4
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit.So on other platforms you might find that NT stores are less useful at least when you care about single-threaded performance and perhaps rep movsb wins where if it gets the best of both worlds.
     
  8. Samuramar

    Samuramar

    Messages:
    114
    Likes Received:
    15
    Trophy Points:
    1
    REP STOS/MOVS is the right way to move memory now, and its operands are SI, now recommend 'implementing memcpy using Enhanced REP MOVSB and STOSB might.Is this related to this bit from the Intel Optimization Manual.
     
  9. Mezizahn

    Mezizahn

    Messages:
    865
    Likes Received:
    9
    Trophy Points:
    7
    X86 memcpy: use REPMOVSB instead of REPMOVS{Q,D,W} for inline copies when the Enhanced REP MOVSB and STOSB operation (ERMSB).Here are my results on the same system from tinymembnech.
    Enhanced rep movsb stosb.
     
  10. Gur

    Gur

    Messages:
    594
    Likes Received:
    21
    Trophy Points:
    7
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB.Show 6 more comments.
     
  11. Goltibar

    Goltibar

    Messages:
    760
    Likes Received:
    12
    Trophy Points:
    1
    On my Broadwell-era Xeon copying page with REP MOVSB is ~% faster than with REP MOVSQ. Some CPUs are adding enhanced REP MOVSB/STOSB instructions.This question is about assembly code.
     
  12. Sazuru

    Sazuru

    Messages:
    575
    Likes Received:
    32
    Trophy Points:
    5
    Enhanced REP MOVSB/STOSB, Enables or disables Enhanced REP MOVSB/STOSB support. This setting can affect performance, depending on the application running on.It draw on the above and introduces a few new ideas.Forum Enhanced rep movsb stosb
     
  13. Mole

    Mole

    Messages:
    429
    Likes Received:
    10
    Trophy Points:
    7
    Table 1.
     
  14. Tataxe

    Tataxe

    Messages:
    583
    Likes Received:
    14
    Trophy Points:
    6
    Here you might be forced to use only older instruction sets, which rules out any AVX, etc.
    Enhanced rep movsb stosb.
     
  15. Vill

    Vill

    Messages:
    493
    Likes Received:
    30
    Trophy Points:
    2
    If you study your library's implementation, you will probably be amazed.
     
  16. JoJor

    JoJor

    Messages:
    265
    Likes Received:
    24
    Trophy Points:
    1
    View solution in original post.
    Enhanced rep movsb stosb.
     
  17. Doukree

    Doukree

    Messages:
    592
    Likes Received:
    26
    Trophy Points:
    3
    For example, architectures may have wider internal data paths that the ISA exposes 5 and rep movs could use that internally.
     
  18. Nizragore

    Nizragore

    Messages:
    138
    Likes Received:
    19
    Trophy Points:
    7
    GCC does exactly this see e.
    Enhanced rep movsb stosb.
     
  19. Teramar

    Teramar

    Messages:
    751
    Likes Received:
    16
    Trophy Points:
    6
    The combination of fairly low memory latency and modest 2-channel bandwidth means this particular chip happens to be able to saturate its memory bandwidth from a single-thread, which changes the behavior dramatically.
     
  20. Kebar

    Kebar

    Messages:
    333
    Likes Received:
    26
    Trophy Points:
    3
    That has to happen to make room in the ROB for following uops.
     
  21. Gujar

    Gujar

    Messages:
    297
    Likes Received:
    14
    Trophy Points:
    6
    forum? I want to hear more about that.
     
  22. Tarr

    Tarr

    Messages:
    322
    Likes Received:
    7
    Trophy Points:
    5
    It would be interesting to see tingybenchmark on a Ivy Bridge system.
     
  23. Akijar

    Akijar

    Messages:
    619
    Likes Received:
    11
    Trophy Points:
    2
    Why I am going on and on about this?
     
  24. Nijar

    Nijar

    Messages:
    460
    Likes Received:
    31
    Trophy Points:
    3
    As copy sizes get much larger, however, the relative importance of this diminishes rapidly e.
     
  25. Dabar

    Dabar

    Messages:
    19
    Likes Received:
    19
    Trophy Points:
    3
    Here are my results on the same system from tinymembnech.
    Enhanced rep movsb stosb.
     
  26. Yozshurr

    Yozshurr

    Messages:
    755
    Likes Received:
    22
    Trophy Points:
    4
    Asked 4 years, 11 months ago.
     
  27. Dulrajas

    Dulrajas

    Messages:
    916
    Likes Received:
    21
    Trophy Points:
    4
    However, even here there are reports of the opposite result on earlier hardware like Ivy Bridge.
     
  28. Voodoorr

    Voodoorr

    Messages:
    719
    Likes Received:
    3
    Trophy Points:
    6
    Yes, exactly.
    Enhanced rep movsb stosb.
     
  29. Mazukus

    Mazukus

    Messages:
    503
    Likes Received:
    22
    Trophy Points:
    6
    Code Size The executed code size a few bytes is microscopic compared to a typical optimized memcpy routine.
     
  30. Maramar

    Maramar

    Messages:
    218
    Likes Received:
    28
    Trophy Points:
    4
    LGTM too.
     
  31. Zologrel

    Zologrel

    Messages:
    925
    Likes Received:
    5
    Trophy Points:
    3
    This turns turbo back on so I had to disable turbo after.
    Enhanced rep movsb stosb.
     
  32. Zuluzuru

    Zuluzuru

    Messages:
    605
    Likes Received:
    11
    Trophy Points:
    0
    When a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred.
     

Link Thread