First Skylake CPUs hit the streets

Author	Message
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,661,974 RAC: 0	Message 79027 - Posted: 9 Nov 2015, 20:06:12 UTC - in response to Message 79000. Very tempted to pull the trigger on that new micro tower dell box I mentioned above, now that a price drop has come in the form of a promotion that knocks 30% off of the original price.. I've been looking at the page for it off and on for a couple of days to the point where most sites I visit now have their Google ads hinting at me to go back and buy this thing. On that note, if I do pull the trigger, I will go to it via an ad posted on boincstats so that site can get a little support. :) ID: 79027 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1834 Credit: 124,260,318 RAC: 37	Message 79029 - Posted: 10 Nov 2015, 13:20:18 UTC - in response to Message 79000. Skylake is undoubtedly more efficient, but the lower temperatures on the CPU are due in part to taking the voltage regulator off-chip. That does not save power, only lowers the CPU temp. I believe Skylake's power savings over Broadwell (so eliminating the efficiency gains in the move to the 14nm process) are due to faster clocking down and more intelligent throttling by using PWM between the most efficient frequency and off, rather than running constantly at lower frequencies. I don't think any of that is applicable if running Rosetta as the machine won't be clocking down. I've just ordered a Skylake G4500 so will see how that performs. It's essentially half of a 6600k, but without the overclocking, and 1/4 the price. ID: 79029 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 79030 - Posted: 10 Nov 2015, 15:26:19 UTC Last modified: 10 Nov 2015, 15:28:11 UTC Hopefully the Zen architecture from AMD will spice things up in single-threaded performance department from both companies. Last few years all we've seen mostly is lower consumption CPUs + integrated GPUs. Not much improvement regarding raw single-threaded performance/watt ever since the release of the TriGate transistor from Intel. EDIT: But first... Rosetta should implement the usage of SSE. Pentium III technology. ID: 79030 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1834 Credit: 124,260,318 RAC: 37	Message 79034 - Posted: 10 Nov 2015, 20:41:07 UTC - in response to Message 79030. Hopefully the Zen architecture from AMD will spice things up in single-threaded performance department from both companies. Last few years all we've seen mostly is lower consumption CPUs + integrated GPUs. Not much improvement regarding raw single-threaded performance/watt ever since the release of the TriGate transistor from Intel. Hopefully it will, but Broadwell/Skylake/Kaby Lake are a node ahead of Zen so it's unlikely to compete on power efficiency unfortunately. Hopefully I'm wrong! ID: 79034 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2118 Credit: 12,390,943 RAC: 265	Message 79046 - Posted: 13 Nov 2015, 9:39:32 UTC Last modified: 13 Nov 2015, 9:39:53 UTC This cpu may be a good cpu for Rosetta I7 6950X :-O ID: 79046 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 79047 - Posted: 13 Nov 2015, 15:07:44 UTC - in response to Message 79046. This cpu may be a good cpu for Rosetta I7 6950X :-O This Broadwell CPU plugs right into the X99 motherboards shipped for Haswell. It clocks at a 3.0/3.5GHz CPU rate for 10/20 Cores/Threads. It can run 20 Rosetta simultaneously. 20 threads x 3.5GHz = 70 GHz of turbo compute (plus overclock) My Gigabyte motherboard allows me to tweak the CPU frequencies while without REBOOTING so I am running my 6/12 C/T i7-5930K CPU @ 3.50/3.7GHz currently at 4.1GHz (water cooled). Intel sells an 8/16 C/T version too. 12 threads x 3.7GHz = 44.4 GHz of turbo compute (plus overclock) 12 threads x 4.1GHz = 49.2 GHz today (current overclock). The 2011-pin v3 socket X99 motherboard will also take a Xeon E5-2695V3 CPU today with 14/28 C/T at 2.3/3.3GHz standard/turbo. 28 threads x 3.3GHz = 92.4 GHz of compute (plus overclock). Pricing may be an issue too. the Extreme parts are historically priced in the $1,000 per CPU range and the Xeon parts at 2x or 3x that. BUT, my D-1540 motherboard that I plugged into an old chassis, cost me $900 for 8/16 C/T for 2.0/2.6GHz for 41.6 GHz of compute. Of course the 1536-pin motherboard is not compatible with anything except the software. Since Rosetta is unable to take advantage of any vector operations, Rosetta performance is likely to scale with pretty closely with the CPU frequency. ID: 79047 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,661,974 RAC: 0	Message 79055 - Posted: 16 Nov 2015, 3:49:17 UTC - in response to Message 78984. As mentioned in a previous post, I am forced to stick with ultra-small-form-factor boxes (see this picture of my crunching farm) as to maintain approval of the wife who doesn't want my hobby taking up space or creating noise. I'm super happy to see that there are now '~T' series skylake CPUs (thermally efficient ones that can run full tilt in these mini-ITX formfactors) starting to become available from Dell (I have my eye on the i7-6700T (SkyLake) OptiPlex 7040 Micro) and other OEMs. Once they come down a little bit in price I will add another box or two to my super dense crunching farm. (The two Micro form factor Lenovo boxes in my pic above I picked up for under $900 CDN, compared to the $1,600CDN price tag of these new boxes) XD I gave in and pulled the trigger on buying one of these boxes XD, won't arrive for 2~3 weeks but when it does I'll be sure to benchmark it and specifically I'll post about it's power consumption and noise profile. This will be a good way to heat the house for the winter (or not if its efficiency is where I'm hoping it is.) Again, the reason I'm limited to buying this ultra-small form factor is it literally has to be 'out of sight' of the wife [right now my 'farm' of cruncher's is tucked away under a table in the back-room] and it can't make any noise. ID: 79055 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 79057 - Posted: 16 Nov 2015, 10:31:57 UTC - in response to Message 79047. Last modified: 16 Nov 2015, 10:32:44 UTC This Broadwell CPU plugs right into the X99 motherboards shipped for Haswell. It clocks at a 3.0/3.5GHz CPU rate for 10/20 Cores/Threads. It can run 20 Rosetta simultaneously. 20 threads x 3.5GHz = 70 GHz of turbo compute (plus overclock) Mhhh, according to my knowledge only one core will run at 3.5Ghz in turbo mode. ID: 79057 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2118 Credit: 12,390,943 RAC: 265	Message 79058 - Posted: 16 Nov 2015, 13:47:48 UTC - in response to Message 79047. Last modified: 16 Nov 2015, 13:48:17 UTC This Broadwell CPU plugs right into the X99 motherboards shipped for Haswell. It clocks at a 3.0/3.5GHz CPU rate for 10/20 Cores/Threads. It can run 20 Rosetta simultaneously. 20 threads x 3.5GHz = 70 GHz of turbo compute (plus overclock) Even if it's 3ghz, flops are 20x3x4= 240 Gflops (theoretical). A LOT of computational power, like Xeon/Opteron high-end cpus. (I remember my first cpu on this project, a Pentium IV 1,7 Ghz single core....) But i continue to consider that my gpu (AMD R7 260X), with 100$ or so, give me 1,2 TFlops in SP and 123 DP. It's a pity not to use it on R@H :-( ID: 79058 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 79059 - Posted: 16 Nov 2015, 15:06:30 UTC - in response to Message 79055. As mentioned in a previous post, I am forced to stick with ultra-small-form-factor boxes (see this picture of my crunching farm) as to maintain approval of the wife who doesn't want my hobby taking up space or creating noise. I'm super happy to see that there are now '~T' series skylake CPUs (thermally efficient ones that can run full tilt in these mini-ITX formfactors) starting to become available from Dell (I have my eye on the i7-6700T (SkyLake) OptiPlex 7040 Micro) and other OEMs. Once they come down a little bit in price I will add another box or two to my super dense crunching farm. (The two Micro form factor Lenovo boxes in my pic above I picked up for under $900 CDN, compared to the $1,600CDN price tag of these new boxes) XD I gave in and pulled the trigger on buying one of these boxes XD, won't arrive for 2~3 weeks but when it does I'll be sure to benchmark it and specifically I'll post about it's power consumption and noise profile. This will be a good way to heat the house for the winter (or not if its efficiency is where I'm hoping it is.) Again, the reason I'm limited to buying this ultra-small form factor is it literally has to be 'out of sight' of the wife [right now my 'farm' of cruncher's is tucked away under a table in the back-room] and it can't make any noise. Lots of Cache as well. It does tho have only a SINGLE RAM stick. Might consider adding another second stick of RAM to get the DDR going. RAM is inexpensive compared to the total of what you just spent anyways. ID: 79059 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 79060 - Posted: 16 Nov 2015, 15:39:42 UTC Last modified: 16 Nov 2015, 16:05:43 UTC if someone here is running r@h on a new Intel Skylake i7 6700k, i5 6600k etc do share your "r@h benchmarks" & your comments e.g. average credits for various r@h jobs & the run duration do let us know other related items e.g. overclock ghz, ram speeds e.g. ddr4 from size, cpu core temperatures full trottle / overclocked, cpu cooler/heatsink, etc :D r@h certainly deserve to be grouped into the hall of cpu benchmarks being a real job (not some syntactic jobs) & that close/more than to a million users/hosts round the world is running it :D ID: 79060 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 79061 - Posted: 16 Nov 2015, 16:08:45 UTC - in response to Message 79057. This Broadwell CPU plugs right into the X99 motherboards shipped for Haswell. It clocks at a 3.0/3.5GHz CPU rate for 10/20 Cores/Threads. It can run 20 Rosetta simultaneously. 20 threads x 3.5GHz = 70 GHz of turbo compute (plus overclock) Mhhh, according to my knowledge only one core will run at 3.5Ghz in turbo mode. The "K" on the end of the Intel CPU part means "KILL me if you can". You can raise the CPU frequency ABOVE THE MAX SINGLE CORE TURBO until it just overheats and the only option to the CPU is to shut down. I did that yesterday ... 8-} The manufacturer ratings reflect a guaranteed range. If you operate outside that, YOYO. BOINC determines the frequency & turbo and don't seem to reflect overclocked frequency. The Gigabyte board comes with an "App Center" that will allow you to play with and adjust frequency without a "REBOOT" and I routinely overclock the system so all the CPU run at 4.10GHz which is well above the 3.5/3.7GHz rated frequency. I bought the system with a liquid cooled heatsink. It is MUCH!!! quieter than the air fans and worth the extra $100-$150 they charge for it. ID: 79061 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 79064 - Posted: 16 Nov 2015, 16:29:28 UTC - in response to Message 79058. This Broadwell CPU plugs right into the X99 motherboards shipped for Haswell. It clocks at a 3.0/3.5GHz CPU rate for 10/20 Cores/Threads. It can run 20 Rosetta simultaneously. 20 threads x 3.5GHz = 70 GHz of turbo compute (plus overclock) Even if it's 3ghz, flops are 20x3x4= 240 Gflops (theoretical). A LOT of computational power, like Xeon/Opteron high-end cpus. (I remember my first cpu on this project, a Pentium IV 1,7 Ghz single core....) But i continue to consider that my gpu (AMD R7 260X), with 100$ or so, give me 1,2 TFlops in SP and 123 DP. It's a pity not to use it on R@H :-( If I plugged that CPU into my liquid cooled board, I would probably run it at 3.9GHz to 4.3GHz ... most likely 4.1GHz like I do my Haswell-E. Rosetta is a "cool running" project because most of the CPU is idle waiting on the low portion of the vector register to execute (Scalar mode). Most projects I have seen use DP so SP GPU computation would not be used on them. If you cannot vectorize a program, a GPU, which is just a wide vector processor, will be of no use. When it can be applied, it is pleasing to watch the credits roll up! If the project developer resorts to frequent use of complex functions like pow(), exp(), ... those function calls are tough to vectorize. The remaining code outside those function calls would not likely benefit from vectorization. The DENIS project code is very much like that. The code gets throttled as SCALAR floating point code and computation limited by memory and FP instruction latency. ID: 79064 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2118 Credit: 12,390,943 RAC: 265	Message 79067 - Posted: 17 Nov 2015, 8:18:51 UTC - in response to Message 79064. Most projects I have seen use DP so SP GPU computation would not be used on them. If you cannot vectorize a program, a GPU, which is just a wide vector processor, will be of no use. When it can be applied, it is pleasing to watch the credits roll up! Over 2 years ago, William Sheffler (member of Baker Lab) published 2 pdf with tests on gpu (now pdf are no longer accessible). I don't remember if they use PyOpenCl for their tests or other language. They said that gpu is very important in hpc, but "GPU code for Rosetta is not being actively pursued in our lab. Will posted lots of stuff about his work for GPUs a few days ago, but as he says what he did ends up being very specialized and is not of general use" ID: 79067 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 79077 - Posted: 18 Nov 2015, 17:39:10 UTC Last modified: 18 Nov 2015, 18:23:46 UTC off-topic: a muse on vectorized computing SSE/AVX/AVX2/OpenCL e.g. GPU there has been quite a bit of talk about vectorized computing i.e. the use os GPU and AVX2 for highly vectorized computing etc e.g. the opencl thread on ralph@home message board http://ralph.bakerlab.org/forum_thread.php?id=440#4716 and this thread i actually did a little bit of experiment, i'm running on haswell i7 4771 (non-k) asus h87-pro motherboard & 16G 1600ghz ram i tried openblas http://www.openblas.net/ https://github.com/xianyi/OpenBLAS ./dlinpack.goto these are the benchmarks SIZE Residual Decompose Solve Total 100 : 4.529710e-14 503.14 MFlops 2000.00 MFlops 514.36 MFlops 500 : 1.103118e-12 8171.54 MFlops 3676.47 MFlops 8112.38 MFlops 1000 : 5.629275e-12 45060.27 MFlops 2747.25 MFlops 43075.87 MFlops 5000 : 1.195055e-11 104392.81 MFlops 3275.04 MFlops 102495.20 MFlops 10000 : 1.529443e-11 129210.71 MFlops 3465.54 MFlops 127819.77 MFlops ok quite impressive ~128 Gflops on a haswell i7 desktop PC running at only 3.7ghz! that almost compare to an 'old' supercomputer Numerical wind tunnel in Tokyo https://en.wikipedia.org/wiki/Numerical_Wind_Tunnel_%28Japan%29 but what become immediately very apparent is also that only very large matrices 10,000 x 10,000 benefits from the vectorized codes (i.e. AVX2) in the decompose part. if you have tiny matrices say 100x100 in size that gives a paltry 514.36 Mflops, less than 100 times (or could say 1/200) of that speed of 10,000 x 10,000. The other apparent thing is the solve part of the computation, you could see that while the decompose part which involves a matrix multiplication (e.g. DGEMM) can reach speeds of 128 Ghz, but the solve part did not benefit from all that AVX2 vectorized codes showing little improvements for different matrices sizes! this has major implications, it means that whether you have a good cpu with AVX2 etc or that you have a large GPU that can process say vectorized / parallel 1000s floating point calcs per clock cycle. But if your problems are small (e.g. 100x100) or that it cannot benefit from such vectorized codes much of that GPU capacity and even for this instance AVX2 may simply be unused and will not benefit from all that expensive vectorized hardware (e.g. AVX2 and expensive GPU cards capable of processing possibly thousands of vector computation per cycle, e.g. thousands of gpu simd cores) i'd guess this reflect in a way Amdahl's law https://en.wikipedia.org/wiki/Amdahl%27s_law Gene Amdahl passed away recently & perhaps this could be a little tribute to him for having 'seen so far ahead' from back then. http://www.latimes.com/local/obituaries/la-me-gene-amdahl-20151116-story.html ID: 79077 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 79083 - Posted: 19 Nov 2015, 2:43:43 UTC - in response to Message 79077. off-topic: a muse on vectorized computing SSE/AVX/AVX2/OpenCL e.g. GPU there has been quite a bit of talk about vectorized computing i.e. the use os GPU and AVX2 for highly vectorized computing etc ...... i actually did a little bit of experiment, i'm running on haswell i7 4771 (non-k) asus h87-pro motherboard & 16G 1600ghz ram ........... ./dlinpack.goto these are the benchmarks SIZE Residual Decompose Solve Total 100 : 4.529710e-14 503.14 MFlops 2000.00 MFlops 514.36 MFlops 500 : 1.103118e-12 8171.54 MFlops 3676.47 MFlops 8112.38 MFlops 1000 : 5.629275e-12 45060.27 MFlops 2747.25 MFlops 43075.87 MFlops 5000 : 1.195055e-11 104392.81 MFlops 3275.04 MFlops 102495.20 MFlops 10000 : 1.529443e-11 129210.71 MFlops 3465.54 MFlops 127819.77 MFlops ........ but what become immediately very apparent is also that only very large matrices 10,000 x 10,000 benefits from the vectorized codes (i.e. AVX2) in the decompose part. if you have tiny matrices say 100x100 in size that gives a paltry 514.36 Mflops, less than 100 times (or could say 1/200) of that speed of 10,000 x 10,000. The other apparent thing is the solve part of the computation, you could see that while the decompose part which involves a matrix multiplication (e.g. DGEMM) can reach speeds of 128 Ghz, but the solve part did not benefit from all that AVX2 vectorized codes showing little improvements for different matrices sizes! this has major implications, it means that whether you have a good cpu with AVX2 etc or that you have a large GPU that can process say vectorized / parallel 1000s floating point calcs per clock cycle. But if your problems are small (e.g. 100x100) or that it cannot benefit from such vectorized codes much of that GPU capacity and even for this instance AVX2 may simply be unused and will not benefit from all that expensive vectorized hardware (e.g. AVX2 and expensive GPU cards capable of processing possibly thousands of vector computation per cycle, e.g. thousands of gpu simd cores) .......... The one problem that I see with your measurement and analysis is: All your tests are running vector code, even the 100x100 matrix. The AVX2 version of the DP linpack will crunch 4 64-bit DP values each operation. The increasing matrix size is increasing the loops and decreasing the impact of the setup/breakdown loop overhead by spreading over more AVX2 operations. Rosetta would be equivalent to a 1x1 matrix operation since Rosetta is scalar. Being able to process 2 DP operations in parallel would DOUBLE the performance of the inner loop. The front end of the dlinpack.goto results curve is the part that is of interest. ID: 79083 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 79086 - Posted: 19 Nov 2015, 15:11:31 UTC - in response to Message 79083. Last modified: 19 Nov 2015, 15:24:39 UTC The one problem that I see with your measurement and analysis is: All your tests are running vector code, even the 100x100 matrix. The AVX2 version of the DP linpack will crunch 4 64-bit DP values each operation. The increasing matrix size is increasing the loops and decreasing the impact of the setup/breakdown loop overhead by spreading over more AVX2 operations. Rosetta would be equivalent to a 1x1 matrix operation since Rosetta is scalar. Being able to process 2 DP operations in parallel would DOUBLE the performance of the inner loop. The front end of the dlinpack.goto results curve is the part that is of interest. i'd think you have a point there. i'd think an issue could be in part related to the complexity of the algorithms and problems, it may possibly be a compromise between usability (e.g. highly flexible models and analysis) vs pushing for speed as after all GPU (or SSE/AVX/AVX2) accelerated molecular dynamics is 'not impossible': http://en.wikipedia.org/wiki/Molecular_modeling_on_GPUs but those functionality may be different or more difficult to do compared to what rosetta can achieve with possibly less modelling codes e.g. it is very easy to vectorize bitcoin mining (every core simply do a different hash) naively parallel, vs possibly molecular dynamics which could be much harder as 'real world problems'to parallelize (i.e. lots of dependencies & complexities) i'd guess the developers and scientists using rosetta / rosetta@home could shed light on that ID: 79086 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2118 Credit: 12,390,943 RAC: 265	Message 79087 - Posted: 19 Nov 2015, 15:15:09 UTC This is a VERY intresting thread, like this, but i'm scared that it's only "academic discussion".......unless one of the project admins gives us more info or gives to some volunteers the source code to study it. ID: 79087 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2118 Credit: 12,390,943 RAC: 265	Message 79088 - Posted: 19 Nov 2015, 15:36:33 UTC - in response to Message 79086. as after all GPU (or SSE/AVX/AVX2) accelerated molecular dynamics is 'not impossible': http://en.wikipedia.org/wiki/Molecular_modeling_on_GPUs but those functionality may be different or more difficult to do compared to what rosetta can achieve with possibly less modelling codes e.g. it is very easy to vectorize bitcoin mining (every core simply do a different hash) naively parallel, vs possibly molecular dynamics which could be much harder as 'real world problems'to parallelize (i.e. lots of dependencies & complexities) Yeaph, we know that all rosetta's projects (rosetta@home, robetta server, fold.it, ecc) share a common "framework" and rosetta@home crunches different simulations (folding, ab initio, etc) over that. So, optimize framework or create "special/optimized" applications over? ID: 79088 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 79089 - Posted: 19 Nov 2015, 15:36:45 UTC Last modified: 19 Nov 2015, 15:37:16 UTC in the arena of molecular dynamics @ home, there is Folding@home https://en.wikipedia.org/wiki/Folding@home Folding@home is one of the world's fastest computing systems, with a speed of approximately 40 petaFLOPS:[6] greater than all projects running on the BOINC distributed computing platform combined. ID: 79089 · Rating: 0 · rate: / Reply Quote