Message boards : Number crunching : Rosetta v4.08 x86_64-pc-linux-gnu or Rosetta v4.07 i686-pc-linux-gnu
Author | Message |
---|---|
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 7,218 |
Minirosetta and Rosetta v4.07 i686-pc-linux-gnu are 32-bit binaries. I did an objdump on the binaries an the only difference is the "i686" and "x86_64" strings. One binary is 2-bytes larger. Rosetta v4.08 x86_64-pc-linux-gnu is a 64-bit binary. I was downloading Minirosetta and Rosetta 4.08 (64-bit) WU and then my machine stopped downloading the 4.08 binaries and started downloading Rosetta 4.07 (32-bit) WU. Does the researcher specify Rosetta 4.07 or 4.08 ... or does the project code run some detection test on my machine to determine what is supported? |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
Does the researcher specify Rosetta 4.07 or 4.08 ... or does the project code run some detection test on my machine to determine what is supported? The server will send you both to see which one performs better. After 10 valid WUs for each application, most WUs will be assigned to the faster application and only very few to the slower one (just to check it is still slow). . |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 7,218 |
Does the researcher specify Rosetta 4.07 or 4.08 ... or does the project code run some detection test on my machine to determine what is supported? Thanks Do you know how I get it to execute the benchmark process and chose again OR explicitly override it for 64-bits? It made the wrong choice because I was messing with the new machine and changing settings that affected the runs. |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
Well, you can't execute the benchmark since that are all your completed WUs. Whatever you did, will sort itself out with time. You can force 64-bit only with <no_alt_platform> in cc_config.xml, but I would wait with that and see if the 64-bit application is really faster than the 32-bit, that's not always the case. Just watch the GFLOPS values. If it really is faster, the server will send most WUs to that application anyway, so no really need to do anything. . |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 7,218 |
Well, you can't execute the benchmark since that are all your completed WUs. Whatever you did, will sort itself out with time. I was surprised to see that Rosetta 4.08 is the only 64-bit binary for Linux. Both copies of Minirosetta are 32-bit. The 32-bit binaries end up with a smaller code footprint and pass parameters on the stack. The parameter list quickly spills to the stack, but they will be in the L1 cache which has a 1-cycle access time. A 64-bt version will always be faster if compiled properly, unless you aggressively inline functions. The larger code footprint is causing front-end icache miss stalls. I want to see how hard it is to modify the source to redefine their 3-dimensional "vector" object into a 4-dimensional vector so they can use packed SSE or AVX. Right now all Rosetta computation uses scalar operations. thanks again. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 8,387 |
I was surprised to see that Rosetta 4.08 is the only 64-bit binary for Linux. Both copies of Minirosetta are 32-bit. I noted this (thanks to you) almost 2 years ago. Nothing has changed. Today 98% of Windows runs 64 bit version. A 64-bt version will always be faster if compiled properly, unless you aggressively inline functions. The larger code footprint is causing front-end icache miss stalls. You are doing what, in Italy, we call "opera meritoria" (something like "meritorious work"). But sometimes, here in Rosetta@Home, seems to be like Don Quixote of la Mancha, who "tilting at windmills". |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 7,218 |
I was surprised to see that Rosetta 4.08 is the only 64-bit binary for Linux. Both copies of Minirosetta are 32-bit. minirosetta_3.78_x86_64-pc-linux-gnu is a 32-bit binary even though the name implies it is 64-bit. Seems curious that they would build and deploy TWO binaries with only different names. The only difference in the 2 binaries is a different text name in the binary. I wonder if someone goofed on the compile options. It is beginning to appear that BOINC is becoming too successful and the projects are having a hard time utilizing the compute power. The server infrastructure is creaking under the pressure. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
It is beginning to appear that BOINC is becoming too successful and the projects are having a hard time utilizing the compute power. The server infrastructure is creaking under the pressure. That is my observation too, especially with GPU projects. But any number of CPU projects are having a hard time getting the work out the door, or returned again as the case may be. That may be one reason why the Rosetta people are not chomping at the bit for code optimizations; they can barely handle what they have already. It is causing me to think twice about my own upgrade plans. The Ryzen 3000 looks very nice, but maybe I will wait for the 4000; similarly when upgrading my GPUs. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 8,387 |
That may be one reason why the Rosetta people are not chomping at the bit for code optimizations; they can barely handle what they have already. I don't know if their infrastructure is stressed or not, but it seems me strange. They changed the servers less than 2y ago. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I don't know if their infrastructure is stressed or not, but it seems me strange. Their servers may be OK, but there is more to it than that. They need the work generators and more importantly the scientists. There may be times that they just run out of work to do. There is nothing wrong with that. They don't exist to provide us with work; we are here to help them. But it is just another illustration that there may be more crunchers than needed at any given moment. We don't really know where the limitations are by the way. They don't bother to tell us. So one speculation is as good as another. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 8,387 |
We don't really know where the limitations are by the way. They don't bother to tell us. So one speculation is as good as another. I agree with you. Maybe it's a problem of work generation or others. And I don't like the lack of communication of this project Returning to the argument of thread: why not create 64 native app? Seems, reading r5js, this is not SO difficult. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 7,218 |
I frequently find myself thinking ... "why did the project devote time to that instead of ...?" Maybe it makes sense if you have a better view of what they are doing. Maybe not. 8-) LACK OF COMMUNICATION ... It seems like all their moderators are closely related to the project. It should be fairly easy to recruit some volunteer MODERATORS or people who agree to handle many of the routine comments on the boards. Some of these volunteers could even generate some extra financial support for the project. Many US companies have employee benefits that match cash donations to schools and charities (like U of Washington). Many of these companies will also match volunteer time with cash called a "Matching Volunteer Grant". I retired from a company that extends these benefits to retirees. In theory, I could submit the number of hours that I contribute to Rosetta for a "matching gift". My company matches volunteer time at a rate of $10/hour. A $10 to $20 hourly match rate is common. Apple matches $50 per volunteer hour currently. https://doublethedonation.com/matching-gifts/apple-inc APPLICATION BINARIES ... It seems to me, the biggest problem is how the Rosetta Project "spends" its limited human resources. It may be a problem with matching their "people skill sets" with the "development wish list" and with the available time of those people. Their efforts to "hyper optimize" the binary by pulling functions "inline" is based on running 1 copy on a large, idle machine. The result is "sub optimized" results when running 2 or more WU on a machine that strain a critical resource .. like the instruction cache. I am running 36 copies on a machine and the negative impact of inlining functions is pretty obvious. Climateprediction@home has WU that run for hundreds of hours, but they "checkpoint" and trickle up the results 12 times during the run. You get partial credit even if the WU aborts deep into execution. That seems like an execution model Rosetta could consider. Rosetta chops up a long running WU and broadcasts the pieces to many machines. If they ran multiple pieces of that one WU on the same machine in parallel, you would only need 1 database file to share and the overall size of the execution footprint would be smaller. Etc, etc, ... We don't really know where the limitations are by the way. They don't bother to tell us. So one speculation is as good as another. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
Well, you can't execute the benchmark since that are all your completed WUs. Whatever you did, will sort itself out with time. I have nothing to add, but I started reading your post and when it started going over my head I went "yo this dude knows his stuff...", then I saw the username and said "of course, it's rjs5" lol |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,627,225 RAC: 10,243 |
My guess is that the researchers are funded to work on specific research outcomes, rather than the infrastructure (in this case, BOINC) so BOINC/R@H falls between the gaps with no one with any allotted time to do work on the infrastructure or interact with the community etc. If just really like to know what their ideal situation weeks be- less computer power, more power, more RAM, more funding for the platform, less computer power most of the tube but with occasional peaks? Also, like most people, I don't generally comment on the science updates posted by the lab members on here, I really like reading them to get a feel for what's being run on my machines. From their side it probably looks like no one is interested, but for me that's definitely not the case. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,627,225 RAC: 10,243 |
There's only mod.Sense who is a volunteer and so isn't paid by the bakerlab, bit has done an incredible job of keeping this place in order for many years, with waaay more patience dealing with all types of people on the forum than I would ever have had. David Kim also post occasionally, but less so recently and usually only about tech stuff like hardware issues. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 7,218 |
There's only mod.Sense who is a volunteer and so isn't paid by the bakerlab, bit has done an incredible job of keeping this place in order for many years, with waaay more patience dealing with all types of people on the forum than I would ever have had. Agree about mod.Sense and David. I have worked with David off line several times and he has been quite knowledgeable and helpful. I have usually talked about performance issues, but the last time was to give him a fix for the Ubuntu 18.04 glibc problem that caused WU to abort. David cleaned up my fix and built the Linux Rosetta 4.08 64-bit version. The Rosetta developers have been repeatedly skeptical about my performance improvement estimates. That is not a surprise. Developers are sensitive about their work and frequently think they know more than they do. I had to explain to many compiler developers why their "really neat improvement" was not going to make the impact they forecast. The application developers are farther away from performance problems than the compiler developers. I thought I would look at Rosetta code and see what they have done during the last two years and maybe run some experiments. ... Chilean ... you know more than you think. As with most topics, terminology is the barrier. Start another thread if you have a question and I will see if I can answer. |
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 188 Credit: 6,431,332 RAC: 4,992 |
Their efforts to "hyper optimize" the binary by pulling functions "inline" is based on running 1 copy on a large, idle machine. The result is "sub optimized" results when running 2 or more WU on a machine that strain a critical resource .. like the instruction cache. I am running 36 copies on a machine and the negative impact of inlining functions is pretty obvious. I guess it really depends on what compiler and optimizer is used. Long ago, a friend and I worked for Bell Labs, doing a post-compiler assembler-level optimizer for their C compiler. One of the optimizations we did was to expand functions in-line. Not if they were "too big" or obviously recursive. By itself, it could save the call return overhead that really mattered only in short fast functions. But sometimes, this also gave the optimizer a better view of what was going on. In one benchmark, a function was called 10,000 times, but the loop was outside that function. Expanded inline, the optimizer noticed that everything inside the loop had the same value each time around, so all those instructions were moved outside the loop, greatly speeding up the execution time. Then the live-dead analysis eliminated the single computation because the values were never used. Even the loop overhead and the function call and return overhead was removed. As far as running more than one instance of a program at the same time, actual RAM use could be reduced because only one instance of the code need be in RAM, independent of the number of processes using that code. And if the working sets were comparable, these days with large instruction caches (my 4-core Xeon processor has 10 MB SmartCache) the working set of the instances could well be pretty much the same, so execution time for both might not degrade at all, compared to running the programs sequentially. For short programs, this might not matter, but programs like climateprediction.net that can take weeks or months to run, this could be quite significant. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 8,387 |
As far as running more than one instance of a program at the same time, actual RAM use could be reduced because only one instance of the code need be in RAM, independent of the number of processes using that code. And if the working sets were comparable, these days with large instruction caches (my 4-core Xeon processor has 10 MB SmartCache) the working set of the instances could well be pretty much the same, so execution time for both might not degrade at all, compared to running the programs sequentially. For short programs, this might not matter, but programs like climateprediction.net that can take weeks or months to run, this could be quite significant. Uhh, from what i see, we have 2 good developers (you and Rjs5) with experience in code optimization. It would be a shame if R@H developers didn't use these knowledge. But i think it will happen. |
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 188 Credit: 6,431,332 RAC: 4,992 |
The Rosetta developers have been repeatedly skeptical about my performance improvement estimates. That is not a surprise. Developers are sensitive about their work and frequently think they know more than they do. I had to explain to many compiler developers why their "really neat improvement" was not going to make the impact they forecast. The application developers are farther away from performance problems than the compiler developers. When I was working on optimizers, another part of my department was working on hardware design for a new 32-bit processor. The hardware designers were even farther away from performance problems than the compiler developers. The hardware guys found out that in a benchmark program that the marketing department thought was important, there was often a multiplication by two, so they were going to design in a special floating point multiply by two instruction. I pointed out that in a normal workload, multiplying floating point numbers by two was seldom done and furthermore, due to the construction of the benchmark program, I could guarantee that the compiler-optimizer would never generate the floating point multiply by two instruction. (The value 2 was in an external variable that could not be seen by the compiler-optimizer). I suggested that a much better use of the chip area would be to put in a larger instruction cache instead, which would be much more useful. But they would not do that; they designed their fancy new instruction, and we never generated it. |
Message boards :
Number crunching :
Rosetta v4.08 x86_64-pc-linux-gnu or Rosetta v4.07 i686-pc-linux-gnu
©2024 University of Washington
https://www.bakerlab.org