Message boards : Number crunching : Problems with Rosetta version 5.98
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
RamonS, if you could post a link to the task that failed, that would be great. Hi Mod Sense. I'm not Ramons but i had a look and could only find one 5.98 that errored and all that ran it failed. This rig https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=837145 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=209585268 He also has a lot of lock file errors with mini 1.54. This rig https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=881461 pete. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
A couple of tasks ( 245331636 and 245251599) failed on Mac in a way similar to that reported by ramostol. Rosetta@home Macintosh Stack Size checker. Original size: 8388608. Maximum size: 0. RLIM_INFINITY 67108864 # cpu_run_time_pref: 14400 No heartbeat from core client for 31 sec - exiting Rosetta@home Macintosh Stack Size checker. Original size: 8388608. Maximum size: 0. RLIM_INFINITY 67108864 Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 1 starting structures 0 cpu seconds This process generated 0 decoys from 0 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... </stderr_txt> <message> <file_xfer_error> <file_name>Rossmann2X2_033_11257_11463_0_0</file_name> <error_code>-161</error_code> </file_xfer_error> </message> |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,284,221 RAC: 1,121 |
A recent version of acemd from GPUGRID and version 5.98 of rosetta_beta from Rosetta@home may have a compatibility problem; if not, the rosetta_beta graphics portion appears to have frozen by itself. 9/24/2009 4:39:22 PM CUDA device: GeForce 9800 GT (driver version 19038, compute capability 1.1, 1024MB, est. 60GFLOPS) 9/24/2009 4:39:35 PM rosetta@home Restarting task Rossmann2X3_002_14911_14657_0 using rosetta_beta version 598 9/24/2009 4:39:38 PM GPUGRID Restarting task PMEno54-OTTO_HERG4-10-40-RND5579_0 using acemd version 671 Today, I saw the graphics portion of a rosetta_beta workunit freeze in a way that kept it from ending its screensaver function when I used the keyboard and mouse. Some information above about which workunits resumed after I rebooted the computer. The rosetta_beta workunit resumed at essentially the same point shown in the frozen graphics before the reboot. I'd like to see the rosetta_beta graphics portion modified to show the complete workunit and program names - but here's what I copied off the frozen screen: denova design of Rossmann2X3; 70.74% Complete CPU time: 8 hr 29 min 21 sec Stage: Ab initio + relax Model: 43 Step: 77427 Rosetta@home v5.98 Currently using Nvidia driver version 190.38; no word yet on whether the 190.62 version now available is likely to be more reliable. 64-bit Vista SP2 BOINC version 6.6.36 |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
This workunit 285247863 failed on Mac OSX 10.6.1 <core_client_version>6.6.36</core_client_version> <![CDATA[ <stderr_txt> Rosetta@home Macintosh Stack Size checker. Original size: 8388608. Maximum size: 0. RLIM_INFINITY 67104768 # cpu_run_time_pref: 10800 sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range # random seed: 3155889 Rosetta@home Macintosh Stack Size checker. Original size: 8388608. Maximum size: 0. RLIM_INFINITY 67104768 plus similar messages |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
svincent, the task indicates that the actual cause of death was too many restarts without progress. This can mean several things, including perhaps a bug in the application or model. But most often it means you restarted your machine several times in a row? Or that the task got suspended several times in a row perhaps to run other projects or if you only run when computer not in use, perhaps someone came up and used it for brief periods several times in a row. Hence the recommendation in the message to keep tasks in memory when preempted. The "memory" in such a case ends up being the swap space. This will preserve the work (unless the machine is actually powered off, or BOINC completed exited) and let the task pick up where it left off, regardless of checkpoints. Otherwise the task has to crunch long enough to reach and complete a checkpoint, which can take over an hour for some types of work units. Do you happen to know if all of that happened on the first start of the task? I see it only recorded a fraction of a second of CPU time. But this does not count any prior runs that were not able to checkpoint. Rosetta Moderator: Mod.Sense |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Thanks for the explanation. Ever since upgrading to Snow Leopard, Excel 2004 has been constantly crashing on me, causing a return to the log-in screen. After reading your explanation, I suspect that this is the cause of the failed task. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,284,221 RAC: 1,121 |
One problem with using the leave in memory option - it restricts the participation in multiple BOINC projects with high memory requirements on the same computer, especially if some of them have a memory leak. I no longer consider it a suitable option to use when including Rosetta@home and/or Ralph@home in the mix of projects. I haven't yet found a version of BOINC that's very good at actually moving much of what's in memory into the swap file, especially when what needs to be moved is the results of minirosetta's known memory leak. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,284,221 RAC: 1,121 |
I just found a Rosetta Beta workunit with frozen graphics covering the whole screen again. They wouldn't go away when I used the mouse and keyboard. Two Rosetta Beta 5.98 workunits could have been running on that machine at the time of the graphics freeze; not enough evidence left to tell which one was responsible for this: Rossmann2X3_001_14908_12080_1 Rossmann2X3_027_15080_10154_0 Both had just a little less CPU time than in the frozen graphics after I rebooted. That machine has 64-bit BOINC 6.10.3 under Vista SP2; that BOINC version is recommended if I want to continue using the GPU on that machine under GPUGRID. That version often displays the graphics for any workunits in progress, even if I don't ask for any graphics. One of those workunits is now running again; the other one is waiting for its turn on a CPU core. The frozen graphics showed Model 2, Step 287738, with CPU time 0:24:07. Is there an option to disable Rosetta Beta workunits on that machine, but continue running minirosetta workunits? Or would it be better to just discontinue Rosetta@home participation at all until this 5.98 problem is fixed? |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,284,221 RAC: 1,121 |
I've now seen the same problem with a workunit from a different BOINC project - QMC@home. Also had graphics covering the whole screen. This leads me to suspect that the problem is with BOINC 6.10.3 dealing with situations where it decides to move the graphics around on the screen, and finds that the graphics don't leave any empty space to move them to. GPUGRID now needs the newer versions of BOINC, and I don't plan to stop participating there, so I expect a number of people would also like the option to stop receiving 5.98 workunits, and a few BOINC alpha testers to want the option to receive only 5.98 workunits from Rosetta@home for a while. At least part of the problem apparantly occurs inside the Nvidia driver, though. Already using the newest Nvidia driver GPUGRID recommends (190.38), though. |
bill Johnson@GMU Send message Joined: 5 Aug 09 Posts: 5 Credit: 1,356,008 RAC: 0 |
I have been getting some Rosetta Beta 5.98. They have been having problems downloading and if they do download my computer simply refuses to start work on them so they just sit there untouched. I have had to delete a few of them to make way for Rosetta Mini 1.97 work units that do actually get worked on. Is there a problem with my preferences that is causing this or just the Rosetta Beta 5.98 work units? the Beta work units are all Rossmann2X3 units. |
ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0 |
This workunit 285247863 failed on Mac OSX 10.6.1 All Rossmann tasks, successful or not, report these errors, for instance this task run on MacOS 10.5 on a computer working quite undisturbed by human activity: CPU time 21761.01 stderr out <core_client_version>6.10.11</core_client_version> <![CDATA[ <stderr_txt> Rosetta@home Macintosh Stack Size checker. Original size: 8388608. Maximum size: 0. RLIM_INFINITY 67104768 # cpu_run_time_pref: 21600 sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range # random seed: 2994865 ====================================================== DONE :: 1 starting structures 21760.5 cpu seconds This process generated 10 decoys from 10 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... </stderr_txt> ]]> Validate state Valid Claimed credit 145.750455203617 Granted credit 74.7162632779638 |
MarcoA Send message Joined: 2 Sep 08 Posts: 9 Credit: 777,433 RAC: 0 |
Here is another rossmann-task with the same [-1,+1]-Error: https://boinc.bakerlab.org/rosetta/result.php?resultid=288301200 |
Gen_X_Accord Send message Joined: 5 Jun 06 Posts: 154 Credit: 279,018 RAC: 0 |
I just found a Rosetta Beta workunit with frozen graphics covering the whole screen again. They wouldn't go away when I used the mouse and keyboard. It would be better to disable the graphics and not allow Boinc as your screensaver. Set your computer to no screensaver and have the video power down after 10 minutes or so, and shut the monitors off when you are done. No only will you save a little on power, but you will no loger have a problem with frozen graphics. Rosetta doesn't need the graphics to run the work unit. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Task 290293053 (Rossmann2X3_060_003_15300_100_0) failed on Windows System 7. It ran for 25 hours stuck on Model 4 Step 271587 before I aborted it. How come the watchdog thread didn't stop it? ------ # cpu_run_time_pref: 10800 sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range # random seed: 3714901 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x755A194B Engaging BOINC Windows Runtime Debugger... followed by a bunch of Windows debugging info. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Task 290293053 (Rossmann2X3_060_003_15300_100_0) failed on Windows System 7. It ran for 25 hours stuck on Model 4 Step 271587 before I aborted it. How come the watchdog thread didn't stop it? ...because it had only used 4038.959 seconds of CPU time. Your machine must have had some other higher priority work going on during that time period. Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Good point transient. I believe at that time it was when you reach 4x the runtime preference. But, as I pointed out, the task wasn't getting much CPU time. The newer BOINC clients show "elapsed time" now, not CPU time. Rosetta Moderator: Mod.Sense |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Task 290293053 (Rossmann2X3_060_003_15300_100_0) failed on Windows System 7. It ran for 25 hours stuck on Model 4 Step 271587 before I aborted it. How come the watchdog thread didn't stop it? There is a mismatch between the 4,038 seconds of CPU time reported in the Task Details and the 25+ hours it actually took (I decided to let it continue running). The only other tasks going on were Rosetta tasks using the second core. Could it be a System 7 issue? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
ok, so when you say it actually took 25 hrs, this information came from what source? Rosetta Moderator: Mod.Sense |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
ok, so when you say it actually took 25 hrs, this information came from what source? The elapsed time field in the BOINC manager. (My run time preference is set to 3 hours) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
So the question becomes, why would 25hrs elapse, with only 4000 seconds of low priority CPU being available to BOINC? This is why I made the comment that your machine must have been busy doing something else that day. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Problems with Rosetta version 5.98
©2024 University of Washington
https://www.bakerlab.org