Rosetta@home using AVX / AVX2 ?

Message boards : Number crunching : Rosetta@home using AVX / AVX2 ?

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,054,272
RAC: 5,361
Message 87778 - Posted: 1 Dec 2017, 7:19:59 UTC - in response to Message 87773.  

Hi everybody,
I just wanted to ask if there are plans to use AVX or AVX2 or possibly even the coming AVX-512 in Rosetta?


Avx 512 seems no so good
Avx512


The gcc8 -march=skylake-avx512 compiled codes don't do very good when compared with the gcc8 -march=skylake compiled codes.

My guess is that the -march=skylake codes turn on full AVX2 and other new instructions added through Broadwell. They say that the skylake-avx512 option also enables the AVX512F, AVX512VL, AVX512BW, AVX512DQ, and AVX512CD instruction families, but it is unclear whether the benchmark code has any code sequences that can take advantage of those AVX512 instructions beyond AVX512BW. I would expect that the FFT code would and it saw an 8% increase in performance using AVX512BW.

If the benchmark code happened to be written like Rosetta, I could see the AVX512 code running slower. The compiler can only operate on 8 single precision or 4 double precision values in parallel if the code is written to allow it. For Rosetta, it is a moot point until the project developers see it as a need. The design of Rosetta dictates that it process all the floating point operations in SCALAR mode rather than in VECTOR mode.

I guess I need to download and look at the code sequences again to see if they are doing anything performance related.

Interestingly, there is a World Wide Grid project that is using Rosetta code too.
Project Name: Microbiome Immunity Project
wcgrid_mip1_rosetta_7.11_windows_intelx86.exe
ID: 87778 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 87783 - Posted: 1 Dec 2017, 20:53:29 UTC

My answer to all of this up to this point: FAHCore a7

I don't mean to be rude but most likely I am. Some certain programmers don't know what the hell they're doing.

This is a badly managed project. I draw my conclusions from that.
ID: 87783 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 87791 - Posted: 2 Dec 2017, 13:50:32 UTC - in response to Message 87783.  
Last modified: 2 Dec 2017, 13:53:54 UTC

This is a badly managed project. I draw my conclusions from that.


During my years of partecipation (and talking with rjs5), seems to me that the development of this large code by many people of different istitution may have led to "confusion" about the code. Rjs5 has read the source code and he said that is like a "jumble" where every developer added what interested him, whitout centralized strict rules.
So, it's difficult to optimize it, but seems, moreover, that they are not even interested in doing so....
I'm curious to see if the new "c++ wave" (conversion of all libraries to this language) will bring an improvement on writing the code.
ID: 87791 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 87792 - Posted: 2 Dec 2017, 14:29:39 UTC - in response to Message 87791.  

Maybe they should just abandon this nonsensical attempt to support HPC (sort of) on smartphones and tablets...

4, 6 and even 8-core AVX-capable CPUs will be ubiquitous in a few years with a TDP in the 90-140W range.
ID: 87792 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 87799 - Posted: 3 Dec 2017, 10:30:23 UTC - in response to Message 87792.  

Maybe they should just abandon this nonsensical attempt to support HPC (sort of) on smartphones and tablets...


Despite i'm crunching (usually on Ralph) with my phone, i'm agree with you.
For example a Ryzen 1700 (only 65W) will overclass a top level smartphone by some orders of magnitude.

....not to mention an high level gpu.... :-P
ID: 87799 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,054,272
RAC: 5,361
Message 87852 - Posted: 7 Dec 2017, 20:41:32 UTC - in response to Message 87792.  

Maybe they should just abandon this nonsensical attempt to support HPC (sort of) on smartphones and tablets...

4, 6 and even 8-core AVX-capable CPUs will be ubiquitous in a few years with a TDP in the 90-140W range.


I would suspect that the cost of supporting phones and tablets is about the same as supporting Windows, Linux or MACOS. They just recompile the code and the execution is limited by scalar floating point and call/return chains. That porting effort is likely independent of the main developers.

Going from scalar to 2 floating point computations in parallel is likely just a TYPEDEF change (add 4th dimension) that is sprinkled over the code. That has to be done before ANY reasonable parallel operation can be done.
Going from 2 to 4 wide parallel AVX 128-bit to AVX 256-bit would be a recompile and require the Rosetta server to recognize a subset of machines to steer the 256-bit binary to.
Alternatively, some of the compilers support the runtime detection of the machine type and fire up different versions of the compiled code. These "fat binaries" will run on an old CPU that only supports SSE or new one AVX2.

I am not sure how good/poorly the project is managed. There is substantial evidence that comments from this group of crunchers carry little weight ... other than the valuable ... "it is broken" when problems develop.
ID: 87852 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 89143 - Posted: 25 Jun 2018, 7:37:20 UTC

Some interesting videos about c++ and Rosetta (4 years ago)

RosettaAccademy
ID: 89143 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 89480 - Posted: 3 Sep 2018, 8:31:45 UTC

Very interesting conference (plus some courses) about C++
CppCon
ID: 89480 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 90316 - Posted: 8 Feb 2019, 7:28:46 UTC

Optimization of the code.
In Acoustics@Home project, a volunteer optimized the code and these are results:
Here are results from

Xeon E5-2683 v3 (Haswell):
Original 39,788
SSE2 21,283
SSE4.1 20,943
AVX 19,658
AVX2 19,043

Xeon W-2102
Original-Linux 51,988
Original-Windows 46,707
SSE2 30,681
SSE4.1 28,680
AVX 24,659
AVX2 23,882
AVX512 21,569

Ryzen 2700x running 64-bit linux:
Task completion times:
project app: 3250 seconds
sse2 opt app: ~1380 seconds
AVX opt app: ~1080 seconds
AVX2 opt app: ~1080 seconds


But R@H guys are not interested...
ID: 90316 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,054,272
RAC: 5,361
Message 90321 - Posted: 8 Feb 2019, 21:47:50 UTC - in response to Message 90316.  

My Skiylake-X system with AVX-512 support is scheduled to be delivered today. I think I might download the rosetta source and see what the developers have done to the code in the last year. 8-)

Intel Core X (Socket 2066)
w/ Core i9 9980XE (18 Core 36 Thread @ 4.5GHz)
Asus ROG-Strix-X299-E
XL Liquid Cooling Package
32GB DDR4-2666 (2x16GB Kit)
1TB Solid State Drive (SSD) - M.2 - NVMe
Antec Silent Mid-Tower Case
850W - Power Supply w/ Active PFC
DVD +/- RW Drive - Internal (SATA)


Optimization of the code.
In Acoustics@Home project, a volunteer optimized the code and these are results:
Here are results from

Xeon E5-2683 v3 (Haswell):
Original 39,788
SSE2 21,283
SSE4.1 20,943
AVX 19,658
AVX2 19,043

Xeon W-2102
Original-Linux 51,988
Original-Windows 46,707
SSE2 30,681
SSE4.1 28,680
AVX 24,659
AVX2 23,882
AVX512 21,569

Ryzen 2700x running 64-bit linux:
Task completion times:
project app: 3250 seconds
sse2 opt app: ~1380 seconds
AVX opt app: ~1080 seconds
AVX2 opt app: ~1080 seconds


But R@H guys are not interested...
ID: 90321 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 90333 - Posted: 10 Feb 2019, 16:30:51 UTC - in response to Message 90321.  

I think I might download the rosetta source and see what the developers have done to the code in the last year. 8-)


I'm not so optimist. Latest version of R@H code (4.07) is almost one year old...

P.S. Great pc!!!
ID: 90333 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,054,272
RAC: 5,361
Message 90426 - Posted: 25 Feb 2019, 19:29:33 UTC - in response to Message 90333.  
Last modified: 25 Feb 2019, 20:01:53 UTC

I think I might download the rosetta source and see what the developers have done to the code in the last year. 8-)


I'm not so optimist. Latest version of R@H code (4.07) is almost one year old...

P.S. Great pc!!!


The code version doesn't really matter to me as long as it runs the workloads. I am going to make some minor changes to the "vector" object definition ... from a array of 3 FP numbers to an array of 4 FP numbers. I am going to try to get the compilers to generate code that uses both halves of the SSE2 registers when computing. When that works, I will turn on AVX-512 and see how crunching the vectors in one operation changes performance.

I am just going to try to get some "empirical" data that might interest the developers to introduce similar changes.

BTW, the new machine is still accumulating Rosetta RAC and is around 21,800 and topping out. Top machine #34. With the liquid cooler, it is running about 65 degrees C. The only BIOS change I made was to tell the CPU to not exceed 80 degrees, but Linux tools says that the CPU is running at 3.8ghz.
I am running predominantly Rosetta with a random WCG "Help TB" thrown in and GPU WUs too.
ID: 90426 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 90428 - Posted: 25 Feb 2019, 21:06:06 UTC - in response to Message 79322.  

Vote with your feet
ID: 90428 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 90432 - Posted: 26 Feb 2019, 9:24:46 UTC - in response to Message 90426.  

The code version doesn't really matter to me as long as it runs the workloads. I am going to make some minor changes to the "vector" object definition ... from a array of 3 FP numbers to an array of 4 FP numbers. I am going to try to get the compilers to generate code that uses both halves of the SSE2 registers when computing.


I said that i'm not optimist not for you. I know you have the knowledge to work on optimization side.
You said "what developers have done to the code"... i think that, during this year, R@H developers have not work largely on the code ('cause the exe is still the same)
ID: 90432 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 90433 - Posted: 26 Feb 2019, 9:33:11 UTC - in response to Message 90426.  

When that works, I will turn on AVX-512 and see how crunching the vectors in one operation changes performance.

I am just going to try to get some "empirical" data that might interest the developers to introduce similar changes.

My understanding is that AVX-512 is not likely to be widely adopted by Intel on future chips, due to space and power requirements. It seems to have been a special case for the Skylake generation.
So I hope your efforts can be applied to something more mundane, such as AVX2. It appears that AMD is going to improve their implementation of it on Ryzen 3000, and so should be widely available before long. Good luck.
ID: 90433 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 90434 - Posted: 26 Feb 2019, 9:44:40 UTC - in response to Message 90426.  

The code version doesn't really matter to me as long as it runs the workloads. I am going to make some minor changes to the "vector" object definition ... from a array of 3 FP numbers to an array of 4 FP numbers. I am going to try to get the compilers to generate code that uses both halves of the SSE2 registers when computing. When that works, I will turn on AVX-512 and see how crunching the vectors in one operation changes performance.


An example of your knowledge.
You noticed, some times ago, that they are using a very old version of GCC compiler.
Are you using the latest version? Are THEY using the latest version??
ID: 90434 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 90435 - Posted: 26 Feb 2019, 10:40:33 UTC - in response to Message 90433.  

So I hope your efforts can be applied to something more mundane, such as AVX2. It appears that AMD is going to improve their implementation of it on Ryzen 3000, and so should be widely available before long. Good luck.


+1
Avx seems to be enough
An example: Acustics
ID: 90435 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,054,272
RAC: 5,361
Message 90439 - Posted: 26 Feb 2019, 16:08:48 UTC - in response to Message 90435.  

So I hope your efforts can be applied to something more mundane, such as AVX2. It appears that AMD is going to improve their implementation of it on Ryzen 3000, and so should be widely available before long. Good luck.


+1
Avx seems to be enough
An example: Acustics


Interesting link.
Since Rosetta is working on only 3-dimensional arrays, it currently performs 3-loads, 3-operations and then 3-stores.
Changing to 4-dimensions will have SSE2 do 2-loads, 2-operations and 2-stores.
AVX-512 would not make much difference and few could run the binary.
I think on Skylake forward, Intel was looking at not stalling on software prefetches. If the code issued a software prefetch and all the read/write buffers were busy, the software prefetch was not executed.

32-bit has a smaller code footprint and smaller data footprint and therefore makes the on-chip caches more effective.

IMO, a 32-bit code version with a 4-dimensional vector so SSE2 does only 2 operations vs three would probably be the fastest. Rosetta probably measures performance running one copy on one of their servers with large caches. The optimizations they chose bloat the runtime size and running multiple Rosetta binaries stress the hardware and slows all copies down. They over tune the binary using single one execution.
ID: 90439 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 90609 - Posted: 4 Apr 2019, 7:14:26 UTC - in response to Message 90439.  

Since Rosetta is working on only 3-dimensional arrays, it currently performs 3-loads, 3-operations and then 3-stores.
Changing to 4-dimensions will have SSE2 do 2-loads, 2-operations and 2-stores.


Seems that they have some problems on 3-dimensional arrays 60692

Make the "Cannot normalize xyzVector of length() zero" error more informative. The "Cannot normalize xyzVector of length() zero" error is a pain in the neck, because it's hard to know exactly what vector was tripping it up, and in what context. @everyday847 had the idea a while back of adding more try/catch blocks around calls to xyzVector::normalize() which would themselves re-throw after adding more information about the context to the error message, to aid debugging. This is a first pass at that.

ID: 90609 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile G.L.I.S.
Avatar

Send message
Joined: 25 Dec 08
Posts: 26
Credit: 2,304,594
RAC: 5,216
Message 90698 - Posted: 20 Apr 2019, 22:37:45 UTC - in response to Message 90609.  

Without wishing to flare or sound rude, we regret to see projects in 2019 that do not exploit the potential of modern processors.
I don't necessarily say 'AVX', but some SIMD, it would mainly benefit the project itself.

Best regards
ID: 90698 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : Rosetta@home using AVX / AVX2 ?



©2024 University of Washington
https://www.bakerlab.org