Message boards : Number crunching : Rosetta 4.0+
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 · Next
Author | Message |
---|---|
spRocket Send message Joined: 23 Mar 20 Posts: 22 Credit: 3,008,018 RAC: 0 |
(Reposting here, since this is a 4.08 issue) I'm finding that I get signal 11 issues with a couple of older AMD processors, an Athlon II X4 630 and a Phenom II X2 550 Black Edition (the latter running with two unlocked cores). Both of these systems are running on ASUS M4A785-M motherboards with 4 GB of ECC RAM. It seems that Rosetta Mini works OK, but the full Rosetta consistently gets errors on tasks. An example from Task 1133622372: <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process got signal 11</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 7hp5zr7e_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 7hp5zr7e_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3696211 Starting watchdog... Watchdog active. </stderr_txt> ]]> Both of these CPUs are shown as "Family 16" in the CPU type listing. In the meantime, I've shifted both of these systems over to World Community Grid, which is working as it should. On the other hand, my Ryzen 7/1700 is happily devouring Rosetta tasks, as is an old ThinkPad with an i7 L 640. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 42 |
Just an observation. I was getting the download problem almost daily on my machines, but have not had one for 5 days now. What I can see in my task list, is a failure with insufficient memory, both machines, 4 core 8 thread, have 16GB with 90% use figures. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,013 |
"ERROR: Assertion `copy_pose.size() == native.size()` failed. MSG:the reference pose must be the same size as the working pose ERROR:: Exit from: ......srcprotocolsprotein_interface_designfiltersRmsdFilter.cc line: 323 22:47:47 (7828): called boinc_finish(0)" https://boinc.bakerlab.org/rosetta/result.php?resultid=1134099093 Is this an error? The work unit is validated. Is the result usable? I see this across 2 Ryzen hosts. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 9,701 |
I just saw a 24.01 KB zip file being downloaded. 24.0 KB appeared to download at normal speed, then it was several seconds before it downloaded the last 0.01 KB. Confirmed here too, many times |
Peti Send message Joined: 17 Mar 20 Posts: 5 Credit: 142,053 RAC: 0 |
"ERROR: Assertion `copy_pose.size() == native.size()` failed. MSG:the reference pose must be the same size as the working pose Hi, I'd think it's an error if it says so. But it's inside the Rosetta software or data, some tasks are getting this very same error message at my PC, too. for example, this: https://boinc.bakerlab.org/rosetta/result.php?resultid=1134374428 ERROR: Assertion `copy_pose.size() == native.size()` failed. MSG:the reference pose must be the same size as the working pose And to note, my PC was not overclocked at that time, and I did not reboot the PC or stop boinc in any way around that time. so it must be the software.... |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,548 |
"ERROR: Assertion `copy_pose.size() == native.size()` failed. MSG:the reference pose must be the same size as the working pose Getting credit with an error reported depends on how the validator was written. It may have been written to accept task output as valid if two different computers report identical errors for the same workunit. |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,013 |
I did reboot the PC at least once while that WU ran. The error is only on the Rosetta log, as the Server says the unit validated successfully. And because the server validated the WU, there was no other copy sent to another host. These COVID-19 WU?s are heavy on the RAM so I try not use other programs that use lots of RAM, would be a pity if they weren't even working properly on my PC's. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Work units are comprised of a number of models ("decoys"). Credit is issued by the number of completed models. A fast machine completes more models per hour than a slower machine, and is granted credit per completed model, so higher credit per hour. The error report must relate to the last model that was attempted in the work unit. Any prior completed models still report in and get credit. Some number of failures is to be expected. Every model is a combination of things that noone as tried before. So as we collectively navigate the search space, some of the models can get lost. Observation of failure is a part of the scientific process. Your machines report the failures back to the Project Team, and they can then be studied for details on why they fail and how to modify the program to work better in the future. It is important that everyone realize that things BOINC calls "failures" are more aptly described as "learning experiences". Until your machine came across the combination of factors that caused it to fail, noone knew the program needed improvement. Keep 'em coming. Rosetta Moderator: Mod.Sense |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,013 |
Thanks for the reply, Mod.Sense. A quick forum search shows this ERROR: Assertion issue is new, with the first report a mere 5 days ago. And IIRC from looking at the workunits, it specific to the Rosetta 4.07 COVID-19 work units. Haven't seen it with Rosetta Mini. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,548 |
Some of you may want to extend each task to get more work done during the current shortage of tasks. If so, try this: If you are using the simple view, click on View near the top line, then Advance view.... Click on Projects, then Rosetta@home, then Your account. Under Preferences, click on Rosetta@home preferences. In each preferences section, click on Edit preferences. Click on the V for Target CPU run time. Click on a value just above your current setting. Increasing this value too fast causes problems. Click on Update preferences. Click on the X at the top right corner of the Rosetta@home preferences window to shut it down. Click on Projects, then Rosetta@home, then Update. If you want to go back to the Simple view, click on View, then Simple view.... You might repeat the above every few days until you reach the maximum value for Target CPU run time. |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
Hi I have minirosetta tasks running on a linux machine like this one where I realized I had been running a long with almost no CPU used. In the slot file I found some errors so I decided to cancel them No heartbeat from core client for 30 sec - exiting It seems that I have others taking the same way, the CPU time is almost null with a consistent run-time... I had only 2 rosetta mini that succeeded. My rosetta tasks seem to be all OK. What should I do ? completely stop rosetta mini to be sent for that machine ? Thanks |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
24 processors, and 8GB of RAM is not going to work very well for Rosetta@home. You probably have more than half of the tasks in a "waiting for memory" status. Rosetta Moderator: Mod.Sense |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
I have limited to 4 rosetta + 4 mini using an app_config. The rest is running TN-Grid. The system is currently only using 4 GB out of 8, so plenty or RAM left. The mini tasks keep having the same issue, I have some running over 30 hours without CPU used nor completion. I have limited mini to 1 tasks and rosetta to 6 now, I have problem accessing the machine now except from a linux ssh command line and I have loads of mini tasks waiting and I don't know how to bulk cancel them all... |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
I managed to access that boinc using boinctasks from another machine now, I aborted all pending mini tasks (BT is great to manager many tasks at once). I'll let it run for some days to see how it goes, for the moment there are enough rosetta (normal) tasks for some time I think, I'll see how it goes. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 42 |
The post a couple back about memory, I fully concur. When I build a system, I always try to have at least 2GB per thread. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
It is a dedicated server hosted by a foreign provider, cheap and not recent : I cannot upgrade memory or anything. All rosetta tasks are running fine (biggest use of RAM) and finishing in success, even with 6 concurrent tasks running, and all mini tasks are ending in error (except one), even limited to 1 at a time, so it cannot be a lack of RAM (rosetta uses more than mini) (and I doubled checked I still have a fair amount of free RAM at any given time). I realize I cannot select applications to exclude mini in rosetta preferences ! (unlike all other boinc projects) And in app_config I cannot set max number to 0 because it is ignored, I have to set to 1 to see the max limit considered by boinc... do I have any other way to completely exclude mini and waste processing time ? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Being able to see mini tasks that have failed without being aborted would be the best way to see why they aren't working on that machine. Please post with links to the host and problem WUs. Rosetta Moderator: Mod.Sense |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
I had posted these details in my message above (april 12) but it seems the tasks were now purged from the website, my first link is not showing the example task I had given anymore (how long do they remain visible ? this was only 2 days ago). But I posted examples of the error messages I could find in the slot directory by that time in that same message above. |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
I realize the last mini task I had has been stuck for 3 days without using no CPU (at least no advancement is done on the task), all the files in the slot have not been updated since 3 days. I don't know how to extract the err file out of the linux hosted machine so i made screenshots because I'm going to abort this task and as we say rosetta doesn't keep the task log on the server after one or two days. I was given a solution to exclude all mini on that machine by using an app_info config file (re-describe all rosetta apps, and no mini app). The rosetta tasks continue to run normally on that machine... |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,548 |
[snip] I realize the last mini task I had has been stuck for 3 days without using no CPU (at least no advancement is done on the task), all the files in the slot have not been updated since 3 days. Upgrading to BOINC 7.16.5 makes that error much less likely. |
Message boards :
Number crunching :
Rosetta 4.0+
©2024 University of Washington
https://www.bakerlab.org