Rosetta@home using AVX / AVX2 ?

Author	Message
rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 87778 - Posted: 1 Dec 2017, 7:19:59 UTC - in response to Message 87773. Hi everybody, I just wanted to ask if there are plans to use AVX or AVX2 or possibly even the coming AVX-512 in Rosetta? Avx 512 seems no so good Avx512 The gcc8 -march=skylake-avx512 compiled codes don't do very good when compared with the gcc8 -march=skylake compiled codes. My guess is that the -march=skylake codes turn on full AVX2 and other new instructions added through Broadwell. They say that the skylake-avx512 option also enables the AVX512F, AVX512VL, AVX512BW, AVX512DQ, and AVX512CD instruction families, but it is unclear whether the benchmark code has any code sequences that can take advantage of those AVX512 instructions beyond AVX512BW. I would expect that the FFT code would and it saw an 8% increase in performance using AVX512BW. If the benchmark code happened to be written like Rosetta, I could see the AVX512 code running slower. The compiler can only operate on 8 single precision or 4 double precision values in parallel if the code is written to allow it. For Rosetta, it is a moot point until the project developers see it as a need. The design of Rosetta dictates that it process all the floating point operations in SCALAR mode rather than in VECTOR mode. I guess I need to download and look at the code sequences again to see if they are doing anything performance related. Interestingly, there is a World Wide Grid project that is using Rosetta code too. Project Name: Microbiome Immunity Project wcgrid_mip1_rosetta_7.11_windows_intelx86.exe ID: 87778 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 87783 - Posted: 1 Dec 2017, 20:53:29 UTC My answer to all of this up to this point: FAHCore a7 I don't mean to be rude but most likely I am. Some certain programmers don't know what the hell they're doing. This is a badly managed project. I draw my conclusions from that. ID: 87783 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2199 Credit: 13,720,774 RAC: 73	Message 87791 - Posted: 2 Dec 2017, 13:50:32 UTC - in response to Message 87783. Last modified: 2 Dec 2017, 13:53:54 UTC This is a badly managed project. I draw my conclusions from that. During my years of partecipation (and talking with rjs5), seems to me that the development of this large code by many people of different istitution may have led to "confusion" about the code. Rjs5 has read the source code and he said that is like a "jumble" where every developer added what interested him, whitout centralized strict rules. So, it's difficult to optimize it, but seems, moreover, that they are not even interested in doing so.... I'm curious to see if the new "c++ wave" (conversion of all libraries to this language) will bring an improvement on writing the code. ID: 87791 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 87792 - Posted: 2 Dec 2017, 14:29:39 UTC - in response to Message 87791. Maybe they should just abandon this nonsensical attempt to support HPC (sort of) on smartphones and tablets... 4, 6 and even 8-core AVX-capable CPUs will be ubiquitous in a few years with a TDP in the 90-140W range. ID: 87792 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2199 Credit: 13,720,774 RAC: 73	Message 87799 - Posted: 3 Dec 2017, 10:30:23 UTC - in response to Message 87792. Maybe they should just abandon this nonsensical attempt to support HPC (sort of) on smartphones and tablets... Despite i'm crunching (usually on Ralph) with my phone, i'm agree with you. For example a Ryzen 1700 (only 65W) will overclass a top level smartphone by some orders of magnitude. ....not to mention an high level gpu.... :-P ID: 87799 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 87852 - Posted: 7 Dec 2017, 20:41:32 UTC - in response to Message 87792. Maybe they should just abandon this nonsensical attempt to support HPC (sort of) on smartphones and tablets... 4, 6 and even 8-core AVX-capable CPUs will be ubiquitous in a few years with a TDP in the 90-140W range. I would suspect that the cost of supporting phones and tablets is about the same as supporting Windows, Linux or MACOS. They just recompile the code and the execution is limited by scalar floating point and call/return chains. That porting effort is likely independent of the main developers. Going from scalar to 2 floating point computations in parallel is likely just a TYPEDEF change (add 4th dimension) that is sprinkled over the code. That has to be done before ANY reasonable parallel operation can be done. Going from 2 to 4 wide parallel AVX 128-bit to AVX 256-bit would be a recompile and require the Rosetta server to recognize a subset of machines to steer the 256-bit binary to. Alternatively, some of the compilers support the runtime detection of the machine type and fire up different versions of the compiled code. These "fat binaries" will run on an old CPU that only supports SSE or new one AVX2. I am not sure how good/poorly the project is managed. There is substantial evidence that comments from this group of crunchers carry little weight ... other than the valuable ... "it is broken" when problems develop. ID: 87852 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2199 Credit: 13,720,774 RAC: 73	Message 89143 - Posted: 25 Jun 2018, 7:37:20 UTC Some interesting videos about c++ and Rosetta (4 years ago) RosettaAccademy ID: 89143 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2199 Credit: 13,720,774 RAC: 73	Message 89480 - Posted: 3 Sep 2018, 8:31:45 UTC Very interesting conference (plus some courses) about C++ CppCon ID: 89480 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2199 Credit: 13,720,774 RAC: 73	Message 90316 - Posted: 8 Feb 2019, 7:28:46 UTC Optimization of the code. In Acoustics@Home project, a volunteer optimized the code and these are results: Here are results from Xeon E5-2683 v3 (Haswell): Original 39,788 SSE2 21,283 SSE4.1 20,943 AVX 19,658 AVX2 19,043 Xeon W-2102 Original-Linux 51,988 Original-Windows 46,707 SSE2 30,681 SSE4.1 28,680 AVX 24,659 AVX2 23,882 AVX512 21,569 Ryzen 2700x running 64-bit linux: Task completion times: project app: 3250 seconds sse2 opt app: ~1380 seconds AVX opt app: ~1080 seconds AVX2 opt app: ~1080 seconds But R@H guys are not interested... ID: 90316 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 90321 - Posted: 8 Feb 2019, 21:47:50 UTC - in response to Message 90316. My Skiylake-X system with AVX-512 support is scheduled to be delivered today. I think I might download the rosetta source and see what the developers have done to the code in the last year. 8-) Intel Core X (Socket 2066) w/ Core i9 9980XE (18 Core 36 Thread @ 4.5GHz) Asus ROG-Strix-X299-E XL Liquid Cooling Package 32GB DDR4-2666 (2x16GB Kit) 1TB Solid State Drive (SSD) - M.2 - NVMe Antec Silent Mid-Tower Case 850W - Power Supply w/ Active PFC DVD +/- RW Drive - Internal (SATA) Optimization of the code. In Acoustics@Home project, a volunteer optimized the code and these are results: Here are results from Xeon E5-2683 v3 (Haswell): Original 39,788 SSE2 21,283 SSE4.1 20,943 AVX 19,658 AVX2 19,043 Xeon W-2102 Original-Linux 51,988 Original-Windows 46,707 SSE2 30,681 SSE4.1 28,680 AVX 24,659 AVX2 23,882 AVX512 21,569 Ryzen 2700x running 64-bit linux: Task completion times: project app: 3250 seconds sse2 opt app: ~1380 seconds AVX opt app: ~1080 seconds AVX2 opt app: ~1080 seconds But R@H guys are not interested... ID: 90321 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2199 Credit: 13,720,774 RAC: 73	Message 90333 - Posted: 10 Feb 2019, 16:30:51 UTC - in response to Message 90321. I think I might download the rosetta source and see what the developers have done to the code in the last year. 8-) I'm not so optimist. Latest version of R@H code (4.07) is almost one year old... P.S. Great pc!!! ID: 90333 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 90426 - Posted: 25 Feb 2019, 19:29:33 UTC - in response to Message 90333. Last modified: 25 Feb 2019, 20:01:53 UTC I think I might download the rosetta source and see what the developers have done to the code in the last year. 8-) I'm not so optimist. Latest version of R@H code (4.07) is almost one year old... P.S. Great pc!!! The code version doesn't really matter to me as long as it runs the workloads. I am going to make some minor changes to the "vector" object definition ... from a array of 3 FP numbers to an array of 4 FP numbers. I am going to try to get the compilers to generate code that uses both halves of the SSE2 registers when computing. When that works, I will turn on AVX-512 and see how crunching the vectors in one operation changes performance. I am just going to try to get some "empirical" data that might interest the developers to introduce similar changes. BTW, the new machine is still accumulating Rosetta RAC and is around 21,800 and topping out. Top machine #34. With the liquid cooler, it is running about 65 degrees C. The only BIOS change I made was to tell the CPU to not exceed 80 degrees, but Linux tools says that the CPU is running at 3.8ghz. I am running predominantly Rosetta with a random WCG "Help TB" thrown in and GPU WUs too. ID: 90426 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 90428 - Posted: 25 Feb 2019, 21:06:06 UTC - in response to Message 79322. Vote with your feet ID: 90428 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2199 Credit: 13,720,774 RAC: 73	Message 90432 - Posted: 26 Feb 2019, 9:24:46 UTC - in response to Message 90426. The code version doesn't really matter to me as long as it runs the workloads. I am going to make some minor changes to the "vector" object definition ... from a array of 3 FP numbers to an array of 4 FP numbers. I am going to try to get the compilers to generate code that uses both halves of the SSE2 registers when computing. I said that i'm not optimist not for you. I know you have the knowledge to work on optimization side. You said "what developers have done to the code"... i think that, during this year, R@H developers have not work largely on the code ('cause the exe is still the same) ID: 90432 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 90433 - Posted: 26 Feb 2019, 9:33:11 UTC - in response to Message 90426. When that works, I will turn on AVX-512 and see how crunching the vectors in one operation changes performance. I am just going to try to get some "empirical" data that might interest the developers to introduce similar changes. My understanding is that AVX-512 is not likely to be widely adopted by Intel on future chips, due to space and power requirements. It seems to have been a special case for the Skylake generation. So I hope your efforts can be applied to something more mundane, such as AVX2. It appears that AMD is going to improve their implementation of it on Ryzen 3000, and so should be widely available before long. Good luck. ID: 90433 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2199 Credit: 13,720,774 RAC: 73	Message 90434 - Posted: 26 Feb 2019, 9:44:40 UTC - in response to Message 90426. The code version doesn't really matter to me as long as it runs the workloads. I am going to make some minor changes to the "vector" object definition ... from a array of 3 FP numbers to an array of 4 FP numbers. I am going to try to get the compilers to generate code that uses both halves of the SSE2 registers when computing. When that works, I will turn on AVX-512 and see how crunching the vectors in one operation changes performance. An example of your knowledge. You noticed, some times ago, that they are using a very old version of GCC compiler. Are you using the latest version? Are THEY using the latest version?? ID: 90434 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2199 Credit: 13,720,774 RAC: 73	Message 90435 - Posted: 26 Feb 2019, 10:40:33 UTC - in response to Message 90433. So I hope your efforts can be applied to something more mundane, such as AVX2. It appears that AMD is going to improve their implementation of it on Ryzen 3000, and so should be widely available before long. Good luck. +1 Avx seems to be enough An example: Acustics ID: 90435 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 90439 - Posted: 26 Feb 2019, 16:08:48 UTC - in response to Message 90435. So I hope your efforts can be applied to something more mundane, such as AVX2. It appears that AMD is going to improve their implementation of it on Ryzen 3000, and so should be widely available before long. Good luck. +1 Avx seems to be enough An example: Acustics Interesting link. Since Rosetta is working on only 3-dimensional arrays, it currently performs 3-loads, 3-operations and then 3-stores. Changing to 4-dimensions will have SSE2 do 2-loads, 2-operations and 2-stores. AVX-512 would not make much difference and few could run the binary. I think on Skylake forward, Intel was looking at not stalling on software prefetches. If the code issued a software prefetch and all the read/write buffers were busy, the software prefetch was not executed. 32-bit has a smaller code footprint and smaller data footprint and therefore makes the on-chip caches more effective. IMO, a 32-bit code version with a 4-dimensional vector so SSE2 does only 2 operations vs three would probably be the fastest. Rosetta probably measures performance running one copy on one of their servers with large caches. The optimizations they chose bloat the runtime size and running multiple Rosetta binaries stress the hardware and slows all copies down. They over tune the binary using single one execution. ID: 90439 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2199 Credit: 13,720,774 RAC: 73	Message 90609 - Posted: 4 Apr 2019, 7:14:26 UTC - in response to Message 90439. Since Rosetta is working on only 3-dimensional arrays, it currently performs 3-loads, 3-operations and then 3-stores. Changing to 4-dimensions will have SSE2 do 2-loads, 2-operations and 2-stores. Seems that they have some problems on 3-dimensional arrays 60692 Make the "Cannot normalize xyzVector of length() zero" error more informative. The "Cannot normalize xyzVector of length() zero" error is a pain in the neck, because it's hard to know exactly what vector was tripping it up, and in what context. @everyday847 had the idea a while back of adding more try/catch blocks around calls to xyzVector::normalize() which would themselves re-throw after adding more information about the context to the error message, to aid debugging. This is a first pass at that. ID: 90609 · Rating: 0 · rate: / Reply Quote

G.L.I.S. Send message Joined: 25 Dec 08 Posts: 26 Credit: 3,067,073 RAC: 0	Message 90698 - Posted: 20 Apr 2019, 22:37:45 UTC - in response to Message 90609. Without wishing to flare or sound rude, we regret to see projects in 2019 that do not exploit the potential of modern processors. I don't necessarily say 'AVX', but some SIMD, it would mainly benefit the project itself. Best regards ID: 90698 · Rating: 0 · rate: / Reply Quote