AMD's Ryzen CPUs (Ryzen/TR/Epyc) & Vega/Polaris GPUs

Discussion in 'Hardware Components and Aftermarket Upgrades' started by Rage Set, Dec 14, 2016.

  1. ajc9988

    ajc9988 Death by a thousand paper cuts

    Reputations:
    1,501
    Messages:
    5,605
    Likes Received:
    7,933
    Trophy Points:
    681
    Seeing the 21% uplift in multi core performance on Epyc, plus the I/O core standardizing memory latency, plus simplifying the latency by forcing all CCX to CCX through the I/O die (see Anandtech's Epyc article for that latency discussion), there is no way it has the regression. Microsoft sees the chip as a single NUMA node instead of 4 nodes now. The scheduler no longer shifts around tasks fighting for core 0 either, so no need for core prio anymore.

    Basically, everything that effected the WX chips is now fixed. And if you doubt memory bandwidth can handle the core count, look at the 64 core Epyc. It will have the same amount of memory bandwidth per core as the 32 core TR.

    It's fine to wait for reviews, but just watch them quickly because I believe there will be large day one demand.
     
    jaybee83 and hmscott like this.
  2. TANWare

    TANWare Just This Side of Senile, I think. Super Moderator

    Reputations:
    2,479
    Messages:
    9,275
    Likes Received:
    4,614
    Trophy Points:
    431
    So long as the 32 core is a 2+2 ccx configuration. It is possible it could be 1+1+1+1 ccx where then there are memory issues like the WX. As I said I need to wait and see as they are not giving any info out.
     
    hmscott likes this.
  3. ajc9988

    ajc9988 Death by a thousand paper cuts

    Reputations:
    1,501
    Messages:
    5,605
    Likes Received:
    7,933
    Trophy Points:
    681
    No, that is impossible! The memory controllers are on the I/O die and are not tied to a specific CCX. There is no extra hop. There is no limited access. It is shared access by all CCX. No CCX gets any priority to any memory channel!!!

    Edit: Here is more info on the design-
    upload_2019-8-9_13-28-43.png
    https://www.servethehome.com/amd-epyc-7002-series-rome-delivers-a-knockout/4/

    upload_2019-8-9_13-31-41.png
    https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen/2

    upload_2019-8-9_13-35-26.png
    https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen/7

    upload_2019-8-9_13-37-53.png
    https://www.servethehome.com/amd-epyc-7002-series-rome-delivers-a-knockout/6/
    That one shows the topology as seen by the system, with ALL memory as a SINGLE DOMAIN!!!

    Edit 2:
    upload_2019-8-9_13-51-2.png
    This explains a bit on how you can set the NPS modes on the new Epyc CPUs. The important part about it is to read the quote from the AMD engineering team on accessing near or far DRAM channels on the I/O die, where the nearest channels are 6-8ns, second closest are 8-10ns, then the range for the two farthest channels is 20-25ns (closest of the far channels likely 20-23ns and the farthest of the far channels 22-25ns). But, it is still seen as a single memory domain, so a UMA instead of a NUMA unless told to go into a NUMA type mode.

    But, even with that, this added latency going between the near or far channels of only up to 15ns, with the total range of delta mem channel being 19ns, you do not get the problem of nearby mem being in the 60-80ns while the far mem channels would be 180-240ns and two dies ALWAYS having the 180-240ns to read from memory.

    "In NPS4, the NUMA domains are reported to software in such a way as it chiplets always access the near (2 channels) DRAM. In NPS1 the 8ch are hardware-interleaved and there is more latency to get to further ones. It varies by pairs of DRAM channels, with the furthest one being ~20-25ns (depending on the various speeds) further away than the nearest. Generally, the latencies are +~6-8ns, +~8-10ns, +~20-25ns in pairs of channels vs the physically nearest ones."

    https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen/8
     
    Last edited: Aug 9, 2019
    bennyg likes this.
  4. TANWare

    TANWare Just This Side of Senile, I think. Super Moderator

    Reputations:
    2,479
    Messages:
    9,275
    Likes Received:
    4,614
    Trophy Points:
    431
    Flow charts will not convince me. Actual Threadripper white paper or better yet tests of the silicon. Even the Epyc flowchart shows each of the four dual channel ram paths to the I/O chaplet closely tied to 2 ccx units.

    So I am not convinced. In fact I am even more confused. Now they may have lowered latency and hop delays but they may still be there. Although I do not see it, to save money, the 16 core could even be a 1+1+1+1 ccx but I doubt they would risk crippling the chip against the 9-3950x

    Edit; I also believe AMD should not NDA TR info. I do not expect the TR to be as powerful clock for clock to the Epyc but it would alleviate confusion for HEDT consumers.
     
    Last edited: Aug 9, 2019
    tilleroftheearth and ajc9988 like this.
  5. ajc9988

    ajc9988 Death by a thousand paper cuts

    Reputations:
    1,501
    Messages:
    5,605
    Likes Received:
    7,933
    Trophy Points:
    681
    So, more information will come from Hot Chips in 2 weeks. That is where the deep dive is scheduled.

    Now, let me explain what is going on a bit more. AMD has, by default, the memory controllers set to Unified Memory Architecture, with the system seeing all controllers as 1. But, there is the ability, at least for Epyc, to break that UMA into four NUMA nodes, with them seeing each controller as its own node (with the two closest chiplets being part of one node).

    There is still variable latencies for each of the four controllers. But with them all centralized, and since doing a memory call, regardless which CCX is doing the memory call, requires the same latency to go to the I/O die for each CCX, then the latency variance, depending on which memory controller must be accessed, is reduced to 25ns for travel on the I/O die at the most, and 6ns at the least. This greatly reduces the overall latency to do a memory call for the farthest latencies seen on the WX processors, while increasing the smallest latencies on the WX processors. What that means is, under my theory of stale data being a huge part of the problem on TR 2970WX and 2990WX, that the memory calls will be, for the most part, standardized moving forward.

    Now, because of the added latency for the smallest memory latencies, they doubled the L3 cache. That actually slowed down L3 cache calls. But, it also reduced the number of memory calls. Because it reduced the number of memory calls, it overall enhances the performance of the CPU, which is efficiency through inefficiency.

    Now, why I hope they do all 4 controllers instead of just 2 controllers is so that each potential "node" has a nearby memory controller for the lowest possible latency, while having the average latency on memory calls equalized. It also allows, potentially, for each of the four memory controllers to clock faster or be stressed more because at max you have 2 ranks per controller (4 ranks with high capacity memory like the 32GB DIMMs with 4 ranks) instead of 4 ranks (or 8 ranks if using high capacity DIMMs like the 32GB with 4 ranks). This would reduce the stress on each of the controllers, allowing for, potentially, better clocks, although this would need tested. AMD says each controller, by default, is set to interleaving, meaning the writes and reads are alternating which controller gets hit. But this also means that you would have to use the full I/O die of Epyc for TR, which means they would fuse off half of the PCIe, which seems wasteful for the I/O dies.

    Now, for Epyc, it is actually the 1+1+1+1 CCX that is listed as the top 16-core SKU, listed with the most cache. If they did that for Threadripper, it would quadruple the L3 cache on the chip while likely only using the top 2 cores per CCX, which also spreads out the heat even further allowing for higher clock speeds. Where that would hurt performance is in lightly threaded workloads where you would need to go to the I/O die to another CCX node for anything requiring more than 2 cores. So for heavily multi-threaded workloads, the L3 cache would be a huge boon, but you get a hit on lightly threaded stuff. For server chips, it explains exactly why the top 16-core is setup with 4 dies, as servers are often meant to have heavier multi-threaded workloads. But, it is unclear about the benefit for a workstation (although I'm sure AMD has tested it, whatever decision was made).

    I see 4 dies not as a way to save money, rather as a way to spread out the heat and increase the cache size of the workstation chips, while also using chips where one core was damaged or whatever. You would still need those remaining two cores per CCX to be able to clock FAST! But, spreading out the heat that much would also mean they would likely clock significantly faster, which could make up for some of the drawbacks.

    I'm sure we will see more information on TR as we approach the launch.
     
  6. hmscott

    hmscott Notebook Nobel Laureate

    Reputations:
    6,397
    Messages:
    19,705
    Likes Received:
    24,504
    Trophy Points:
    931
    AMD released the Epyc Horizon Event in full with improved audio (no echo), and it's much easier to listen to than the previously posted 3rd party coverage:

    AMD EPYC™ Horizon Event - Full Show
    AMD
    Published on Aug 9, 2019
    AMD EPYC™ Horizon event full run of show Aug. 7, 2019. For more information including endnotes, please visit: https://www.amd.com/en/processors/epyc-7002-series


    bgtubber 2 hours ago
    2:08:36 "You came for this." <== Continue from here for Lisa's closing with Google.

    Dennis C 54 minutes ago
    "Great presentation, great closing statement by Dr. Lisa Su at the end. AMD EYPC is the new standard."
     
    Last edited: Aug 9, 2019
    ajc9988 likes this.
  7. TANWare

    TANWare Just This Side of Senile, I think. Super Moderator

    Reputations:
    2,479
    Messages:
    9,275
    Likes Received:
    4,614
    Trophy Points:
    431
    Epyc;
     
    jaybee83, ajc9988 and hmscott like this.
  8. hmscott

    hmscott Notebook Nobel Laureate

    Reputations:
    6,397
    Messages:
    19,705
    Likes Received:
    24,504
    Trophy Points:
    931
    This man is a benchmarking demon! 900+ benchmark runs over 3 16 hour days for this comparison review of the 3900x and 3600, it's nice to see Ryzen 3 / Zen 2 has caught up close enough to Intel in the gaming benchmarks so as not to matter enough for most of us to care about it

    Ryzen 5 3600 vs. Ryzen 9 3900X vs. Core i9-9900K: GPU Scaling Benchmark
    Hardware Unboxed
    Published on Aug 10, 2019


    Linus gets animated with excitement about AMD's Rome launch, after seemingly at odds about his Intel Sponsorship losing it's luster due to this bad timing - even earlier within this show - Linus goes gaga over Rome. [Intel isn't even in the competition any longer, Enterprise]

    Userbench CPU score DRAMA - WAN Show Aug 9, 2019
    Linus Tech Tips
    38:45 AMD EPYC 7002 Rome delivers a knockout
    42:00 AMD stock analysis?
    45:25 AMD EPYC technobabble
     
    Last edited: Aug 10, 2019
    ajc9988 likes this.
  9. hmscott

    hmscott Notebook Nobel Laureate

    Reputations:
    6,397
    Messages:
    19,705
    Likes Received:
    24,504
    Trophy Points:
    931
    The new AMD Epyc single socket CPU's have so much built in single socket performance it's outperforming 2 socket systems - the "standard", but given the higher than 2 socket system performance coupled with the much lower price points, Lenovo thinks it's time to bring out a range of 1U and 2U servers using the new AMD EPYC 7002 "Rome" family of processors.

    These might be good choices for and perhaps applicable to home (garage / closet) service as much - or more - than datacenters. ServerAtHome seems to think so as well, both are included below:

    Lenovo ThinkSystem SR635 Video Walkthrough
    Lenovo Data Center
    Published on Aug 7, 2019
    The Lenovo ThinkSystem SR635 is a 1-socket 1U server that features the new AMD EPYC 7002 "Rome" family of processors. With up to 64 cores per processor and support for the new PCIe 4.0 standard for I/O, the SR635 offers the ultimate in single-socket server performance in a space-saving 1U form factor.

    In this Lenovo Press walk-through, David Watts and Russ Resnick take you through the server and describe the major components.

    To learn more about the server, check these resources:
    * Datasheet: https://lenovopress.com/ds0099
    * Product guide: https://lenovopress.com/lp1160
    * Interactive 3D Tour: https://lenovopress.com/lp1182


    Lenovo AMD EPYC 7002 Servers Make a Splash
    Patrick Kennedy, August 11, 2019
    https://www.servethehome.com/lenovo-amd-epyc-7002-servers-make-a-splash/

    Here are some images of the 1U (thin) profile server, and the article continues with images for the 2U (double height) thick server.
    Lenovo-ThinkSystem-SR635-Front.jpg
    Lenovo-ThinkSystem-SR635-Internal-Diagram.jpg
    Lenovo-ThinkSystem-SR635-Rear-Options.jpg

    Here's the walk through by Lenovo for the 2U (double height) version of the new AMD server, for applications where more internal storage and/or more PCIe slots are required:

    Lenovo ThinkSystem SR655 Video Walkthrough
    Lenovo Data Center
    Published on Aug 7, 2019
    The Lenovo ThinkSystem SR655 is a 1-socket 2U server that features the new AMD EPYC 7002 "Rome" family of processors. With up to 64 cores per processor and support for the new PCIe 4.0 standard for I/O, the SR655 offers the ultimate in single-socket server performance.

    In this Lenovo Press walk-through, David Watts and Russ Resnick take you through the server and describe the major components.

    To learn more about the server, check these resources:

    * Datasheet: https://lenovopress.com/ds0103
    * Product guide: https://lenovopress.com/lp1161
    * Interactive 3D Tour: https://lenovopress.com/lp1183
     
    Last edited: Aug 12, 2019
  10. ole!!!

    ole!!! Notebook Prophet

    Reputations:
    2,100
    Messages:
    5,562
    Likes Received:
    3,498
    Trophy Points:
    431
    put memory controller back onto the chiplet, every chiplet. or somehow reduce memory latency all together that'll give us the lowest latency.
     
    tilleroftheearth likes this.
Loading...

Share This Page