Thursday, August 12, 2010

PCI Express And SLI Scaling: How Many Lanes Do You Need?

Are the most elaborate platforms really required to host the fastest GPUs, or can you get away with P55's lane-splitting scheme? As Nvidia’s latest graphics processors push 3D performance to new heights, we examine the interfaces needed to support them.
A mere seven months have passed since our most recent PCI Express scaling article showed modest performance differences between PCIe x8 and PCIe x16 slots. But it has been a very busy seven months!
The first salvo came when Nvidia’s much-delayed GeForce GTX 480 smoked AMD’s Radeon HD 5870 as the fastest single-GPU card on the market, and the mid-priced solution that followed showed the highest multi-GPU performance scaling we’ve ever seen.

Unfortunately, such an elevated degree of technological achievement is difficult to swallow for a motherboard reviewer, as it makes my earlier findings irrelevant to most users.

The focus of today’s question will center on you, the PC owner. Do you actually need an X58 platform to support the latest graphics technologies, or will something with fewer lanes suffice? MSI helped us to facilitate the answer with a single product, by producing an X58 motherboard that also has the x8 and x4 modes found on some P55 solutions.

We’ve already seen how X58 and P55 motherboards offer similar gaming performance when using a single x16 slot. And limiting ourselves to a single board allows us to focus exclusively on PCI Express lane width by eliminating every other variable. The name of that product is, of course, the Big Bang-XPower.

While it certainly doesn’t represent the P55 market’s moderate pricing, the XPower’s biggest liability becomes an asset for the purpose of today’s test. Its two PCIe 2.0 x16 slots are divided between up to three x16-length slots each, changing to x8-x0-x8-x8-x8-x0 modes when slots three and five are filled, and then to x8-x4-x4-x8-x4-x4 mode when slots two and six are filled. Thanks to MSI, we can now check x16, x8, and x4 transfer modes on a single motherboard, without using little fingers of tape to reduce the number of connections on the card itself.

Test System Configuration
CPU Intel Core i7-920 (2.66 GHz, 8 MB Shared L3 Cache), Overclocked to 4.00 GHz at 1.40 V, 160 MHz BCLK
Motherboard MSI Big Bang-XPower, BIOS V1.2 (06/09/2010), Intel X58 Express, LGA 1366
RAM Kingston KHX16000D3ULT1K3/6GX (6 GB), DDR3-2000 at DDR3-1600 CAS 7-7-7-21
GTX 480 Graphics MSI GeForce GTX 480 1.5 GB, 700 MHz GPU, GDDR5-3696
OS Hard Drive Western Digital VelociRaptor WD3000HLFS, 300 GB, 10,000 RPM, SATA 3Gb/s, 16 MB cache
Sound Integrated HD Audio
Network Integrated Gigabit Networking
Power OCZ-Z1000 1000 W Modular
ATX12V v2.2, EPS12V, 80 PLUS Gold
Software
OS Microsoft Windows 7 Ultimate 64-bit
GeForce Graphics Nvidia ForceWare 258.96
Chipset Intel INF 9.1.1.1020

Our Core i7-920 is overclocked to 4.00 GHz in an attempt to remove the “CPU cap” on 3D performance.

Thermalright’s MUX-120 keeps our overclocked CPU cool enough to pass stability tests.

With a mid-load efficiency of around 91% and an 80 PLUS Gold rating, OCZ’s Z1000 power supply provides optimal “full system” power testing. Because its efficiency curve dips to around 89% at its ends, readers can multiply today’s input power readings by 0.90 to calculate output power within ±1%.

Benchmark Configuration
3D Games
Aliens Vs. Predator Benchmark Alien Vs Predator Benchmark Tool
Test Set 1: Highest Settings, No AA
Test Set 2: Highest Settings, 4x AA
Call of Duty: Modern Warfare 2 Campaign, Act III, Second Sun (45 sec. FRAPS)
Test Set 1: Highest Settings, No AA
Test Set 2: Highest Settings, 4x AA
Crysis Patch 1.2.1, DirectX 10, 64-bit executable, benchmark tool
Test Set 1: Highest Quality, No AA
Test Set 2: Highest Quality, 4x AA
DiRT 2 Run with -benchmark example_benchmark.xml
Test Set 1: Highest Settings, No AA
Test Set 2: Highest Settings, 4x AA
S.T.A.L.K.E.R.: Call Of Pripyat Call Of Pripyat Benchmark version
Test Set 1: Highest Settings, No AA
Test Set 2: Highest Settings, 4x MSAA
Synthetic Benchmarks and Settings
3DMark Vantage Version: 1.0.1, GPU and CPU scores

3DMark does a great job of testing GPU and CPU performance, but we’re not yet certain how relevant its results will be in a bandwidth comparison.

The PCIe 2.0 x8 slot performs only around 1% slower than a 16-lane slot at the benchmark’s 1280x1024 “Performance” preset, while the x4 slot drops behind by another 3%.

The performance difference between x16 and x4 slots narrows to 2% at 3DMark’s 1920x1200 “Extreme” preset.

Experience tells us that Crysis is usually GPU-limited, and it appears that bandwidth limits are far less of a problem as resolution is increased.

The x4 slot suffers a 9% performance handicap at 1680x1050, while the x8 slot allows the GPU to reach 98% of its performance potential. That is to say, the mid-sized slot looks like an acceptable option for Crysis.

While most games show only modest differences between various slot configurations, Call of Duty: Modern Warfare 2’s unusually high variance accounts for 20% of our benchmark totals.

Builders can expect an average performance loss of 8% when going from a x16 to a x8 slot. That could be an important consideration when using a platform that has a limited number of PCI Express lanes, such as an LGA 1156 platform in SLI mode. But before we move on to the SLI tests, let’s see what effect these configurations have on power, heat, and efficiency.

Dropping PCIe lanes can reduce power consumption, but not enough to matter to most high-end PC owners.

We wouldn’t expect a difference in heat simply from using a different slot, so we weren’t surprised to find that none existed.

Losing moderate performance without a similarly-sized reduction in power is a recipe for an efficiency disaster, since the calculation compares performance to power.

Now that we know to expect an 8% average performance loss when moving a single GeForce GTX 480 from a x16 to a x8 PCIe 2.0 slot, let’s see how that difference translates to SLI. Do we really need more than sixteen PCIe lanes to support two high-end graphics cards?

We didn’t see much of a difference between the x8 and x16 slots in Alien Vs Predator from our single-card tests, so we don’t expect a big difference in SLI. It’s nice, however, to see how well SLI scales compared to a single card, with a peak SLI performance gain of 92%.

An oddity occurs as resolution is increased, with the dual x8 slots outpacing the dual x16 slots at 2560x1600. We can’t even begin to guess why that might happen, aside from some inefficiency attributable to SLI in this title. But maybe it's the motherboard instead.

DiRT 2 is slightly CPU-bound when using dual GeForce GTX 480’s at medium resolutions, gaining “only” 72% from the use of two cards at 1680x1050. Once again, GPU dependence increases as resolutions are increased, so the SLI advantage accelerates to 91% at 2560x1600.

DiRT 2 wasn’t very bandwidth-dependent with a single card, so the fact that the dual-x8 slot configuration trails the dual-x16 configuration by only 2-5% is no surprise here, either.

Everyone who has followed today’s scaling article to this point should be completely aware of the pattern that has emerged. Testing the GeForce GTX 480 in SLI has not yet made sense at medium resolutions, because the 4.00 GHz Core i7 CPU hasn’t been able to keep up with the cards at anything less than 2560x1600. S.T.A.L.K.E.R.: Call of Pripyat is proving to be an exception, only because the cards can’t keep up with the game at that resolution.

We still see the performance gain for SLI shoot up from 75% at 1680x1050 to 94% at 2560x1600, yet because the game isn’t completely playable at our highest test settings (with a 16 FPS minimum frame rate not shown), most users will be forced to sacrifice resolution, details, or anti-aliasing to regain a more fluid experience.

Our list of SLI-based benchmarks showed excellent scaling, but only at the highest-tested 2560x1600 resolution. A serious CPU “bottleneck” is the most likely cause for decreased SLI scaling at lower resolutions. For most games, it doesn’t even make sense to test a pair of GeForce GTX 480 graphics cards at anything less than 2560x1600, and one benchmark was completely crippled by the performance of our 4.00 GHz CPU, even at 1920x1200. Let’s see what effect this CPU limit had on our overall scaling performance:

While two cards outperform a single card by up to 90% in most games, that only happened at our highest test resolution. Poor scaling at lower resolutions dropped our average gain to only 63%. Moreover, the one game that was most bandwidth-dependent in our single-card tests was the same game that became almost completely CPU-bound in SLI, obliterating the 8% performance difference previously noted in our single-card PCIe evaluation.

While the performance gain of SLI exceeded the increased power consumption of today’s system, we again note that it happened only at high resolutions. A net loss in SLI power efficiency can be attributed exclusively to the inclusion of 1680x1050 in today’s tests.

One other peculiarity of today’s test was that our x8/x8 SLI configuration required the card coolers to be adjacent to each other, while the x16/x16 configuration had one empty space between cards. Yet, we never saw a card overheat. How much of a problem did shoving the cards together create?

Nvidia puts a hole in the back of its GTX 480 graphics card, behind the fan, so that the fan can take air in from both sides. The result is that we didn’t see a big difference in temperature between cards that were placed closer together. We expect this design to be less effective for the center card in three-way configurations, and we plan to scale our tests to even greater heights in future articles.

The big question today was whether or not we needed more than 16 lanes to feed multiple high-end graphics card in SLI, and the answer is a solid “perhaps not.”

OK, let’s call it a conditional "no.”

While we did see a fairly large difference between x8 and x16 slots when a single card was used, adding a second card shifted our limit to CPU performance. That is to say, for most of today’s tests, a faster CPU would be far more important than dual x16 slots in achieving the ultimate SLI performance.

That answer presents its own set of questions, since our high-flying Core i7 CPU was already pushed to 4.00 GHz. Most builders simply can’t go much higher with a daily-use gaming machine.

Another part of that conditional answer pertains to test resolution. GPU dependence increases with resolution, to the point that two cards eventually become a “bottleneck” far tighter than the CPU. Yet, that level of GPU dependence outweighs even PCIe x8 bottlenecks.

In the end, we simply needed a faster CPU to apply everything we learned about single-card bandwidth to multi-GPU configurations. This finding should come as some comfort to owners of “high-end” P55-based systems who might be considering an SLI upgrade for their GeForce GTX 480 graphics cards. If you have a high-performance processor in that motherboard, and you're overclocking to 4 GHz+, the extra CPU horsepower will have a more profound impact than an upgrade to an X58-based machine. Making X58 truly worthwhile requires an even faster processor and resolutions beyond 2560x1600.

No comments:

Post a Comment