Problems with Rosetta version 5.98

Message boards : Number crunching : Problems with Rosetta version 5.98

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 60022 - Posted: 8 Mar 2009, 5:11:33 UTC - in response to Message 60021.  
Last modified: 8 Mar 2009, 5:16:49 UTC

RamonS, if you could post a link to the task that failed, that would be great.


Hi Mod Sense.

I'm not Ramons but i had a look and could only find one 5.98 that errored and all

that ran it failed. This rig https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=837145

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=209585268

He also has a lot of lock file errors with mini 1.54.
This rig https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=881461

pete.
ID: 60022 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 60770 - Posted: 21 Apr 2009, 22:08:10 UTC

A couple of tasks ( 245331636 and 245251599) failed on Mac in a way similar to that reported by ramostol.


Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67108864
# cpu_run_time_pref: 14400
No heartbeat from core client for 31 sec - exiting
Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67108864
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 0 cpu seconds
This process generated 0 decoys from 0 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>Rossmann2X2_033_11257_11463_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>


ID: 60770 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,284,221
RAC: 995
Message 63447 - Posted: 25 Sep 2009, 3:34:05 UTC

A recent version of acemd from GPUGRID and version 5.98 of rosetta_beta from Rosetta@home may have a compatibility problem; if not, the rosetta_beta graphics portion appears to have frozen by itself.


9/24/2009 4:39:22 PM CUDA device: GeForce 9800 GT (driver version 19038, compute capability 1.1, 1024MB, est. 60GFLOPS)

9/24/2009 4:39:35 PM rosetta@home Restarting task Rossmann2X3_002_14911_14657_0 using rosetta_beta version 598
9/24/2009 4:39:38 PM GPUGRID Restarting task PMEno54-OTTO_HERG4-10-40-RND5579_0 using acemd version 671

Today, I saw the graphics portion of a rosetta_beta workunit freeze in a way that kept it from ending its screensaver function when I used the keyboard and mouse.

Some information above about which workunits resumed after I rebooted the computer.

The rosetta_beta workunit resumed at essentially the same point shown in the frozen graphics before the reboot.

I'd like to see the rosetta_beta graphics portion modified to show the complete workunit and program names - but here's what I copied off the frozen screen:

denova design of Rossmann2X3;
70.74% Complete
CPU time: 8 hr 29 min 21 sec
Stage: Ab initio + relax
Model: 43 Step: 77427
Rosetta@home v5.98

Currently using Nvidia driver version 190.38; no word yet on whether the 190.62 version now available is likely to be more reliable.

64-bit Vista SP2
BOINC version 6.6.36
ID: 63447 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 63591 - Posted: 4 Oct 2009, 7:36:46 UTC

This workunit 285247863 failed on Mac OSX 10.6.1

<core_client_version>6.6.36</core_client_version>
<![CDATA[
<stderr_txt>
Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67104768
# cpu_run_time_pref: 10800
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
# random seed: 3155889
Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67104768

plus similar messages


ID: 63591 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63595 - Posted: 4 Oct 2009, 14:02:43 UTC

svincent, the task indicates that the actual cause of death was too many restarts without progress. This can mean several things, including perhaps a bug in the application or model. But most often it means you restarted your machine several times in a row? Or that the task got suspended several times in a row perhaps to run other projects or if you only run when computer not in use, perhaps someone came up and used it for brief periods several times in a row.

Hence the recommendation in the message to keep tasks in memory when preempted. The "memory" in such a case ends up being the swap space. This will preserve the work (unless the machine is actually powered off, or BOINC completed exited) and let the task pick up where it left off, regardless of checkpoints. Otherwise the task has to crunch long enough to reach and complete a checkpoint, which can take over an hour for some types of work units.

Do you happen to know if all of that happened on the first start of the task? I see it only recorded a fraction of a second of CPU time. But this does not count any prior runs that were not able to checkpoint.
Rosetta Moderator: Mod.Sense
ID: 63595 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 63596 - Posted: 4 Oct 2009, 15:40:52 UTC

Thanks for the explanation. Ever since upgrading to Snow Leopard, Excel 2004 has been constantly crashing on me, causing a return to the log-in screen. After reading your explanation, I suspect that this is the cause of the failed task.
ID: 63596 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,284,221
RAC: 995
Message 63600 - Posted: 4 Oct 2009, 21:27:17 UTC

One problem with using the leave in memory option - it restricts the participation in multiple BOINC projects with high memory requirements on the same computer, especially if some of them have a memory leak. I no longer consider it a suitable option to use when including Rosetta@home and/or Ralph@home in the mix of projects.

I haven't yet found a version of BOINC that's very good at actually moving much of what's in memory into the swap file, especially when what needs to be moved is the results of minirosetta's known memory leak.
ID: 63600 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,284,221
RAC: 995
Message 63603 - Posted: 4 Oct 2009, 23:17:21 UTC
Last modified: 4 Oct 2009, 23:21:54 UTC

I just found a Rosetta Beta workunit with frozen graphics covering the whole screen again. They wouldn't go away when I used the mouse and keyboard.

Two Rosetta Beta 5.98 workunits could have been running on that machine at the time of the graphics freeze; not enough evidence left to tell which one was responsible for this:

Rossmann2X3_001_14908_12080_1

Rossmann2X3_027_15080_10154_0

Both had just a little less CPU time than in the frozen graphics after I rebooted.

That machine has 64-bit BOINC 6.10.3 under Vista SP2; that BOINC version is recommended if I want to continue using the GPU on that machine under GPUGRID. That version often displays the graphics for any workunits in progress, even if I don't ask for any graphics.

One of those workunits is now running again; the other one is waiting for its turn on a CPU core.

The frozen graphics showed Model 2, Step 287738, with CPU time 0:24:07.

Is there an option to disable Rosetta Beta workunits on that machine, but continue running minirosetta workunits? Or would it be better to just discontinue Rosetta@home participation at all until this 5.98 problem is fixed?
ID: 63603 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,284,221
RAC: 995
Message 63611 - Posted: 5 Oct 2009, 15:30:54 UTC

I've now seen the same problem with a workunit from a different BOINC project - QMC@home. Also had graphics covering the whole screen. This leads me to suspect that the problem is with BOINC 6.10.3 dealing with situations where it decides to move the graphics around on the screen, and finds that the graphics don't leave any empty space to move them to.

GPUGRID now needs the newer versions of BOINC, and I don't plan to stop participating there, so I expect a number of people would also like the option to stop receiving 5.98 workunits, and a few BOINC alpha testers to want the option to receive only 5.98 workunits from Rosetta@home for a while.

At least part of the problem apparantly occurs inside the Nvidia driver, though. Already using the newest Nvidia driver GPUGRID recommends (190.38), though.
ID: 63611 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bill Johnson@GMU

Send message
Joined: 5 Aug 09
Posts: 5
Credit: 1,356,008
RAC: 0
Message 63620 - Posted: 6 Oct 2009, 12:51:25 UTC
Last modified: 6 Oct 2009, 12:53:44 UTC

I have been getting some Rosetta Beta 5.98. They have been having problems downloading and if they do download my computer simply refuses to start work on them so they just sit there untouched. I have had to delete a few of them to make way for Rosetta Mini 1.97 work units that do actually get worked on.

Is there a problem with my preferences that is causing this or just the Rosetta Beta 5.98 work units?

the Beta work units are all Rossmann2X3 units.
ID: 63620 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 63638 - Posted: 9 Oct 2009, 9:21:36 UTC - in response to Message 63591.  

This workunit 285247863 failed on Mac OSX 10.6.1

<core_client_version>6.6.36</core_client_version>
<![CDATA[
<stderr_txt>
Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67104768
# cpu_run_time_pref: 10800
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
# random seed: 3155889
Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67104768

plus similar messages



All Rossmann tasks, successful or not, report these errors, for instance this task run on MacOS 10.5 on a computer working quite undisturbed by human activity:

CPU time 21761.01
stderr out

<core_client_version>6.10.11</core_client_version>
<![CDATA[
<stderr_txt>
Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67104768
# cpu_run_time_pref: 21600
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
# random seed: 2994865
======================================================
DONE :: 1 starting structures 21760.5 cpu seconds
This process generated 10 decoys from 10 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>

Validate state Valid
Claimed credit 145.750455203617
Granted credit 74.7162632779638
ID: 63638 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarcoA

Send message
Joined: 2 Sep 08
Posts: 9
Credit: 777,433
RAC: 0
Message 63714 - Posted: 16 Oct 2009, 13:05:29 UTC

Here is another rossmann-task with the same [-1,+1]-Error:

https://boinc.bakerlab.org/rosetta/result.php?resultid=288301200
ID: 63714 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Gen_X_Accord
Avatar

Send message
Joined: 5 Jun 06
Posts: 154
Credit: 279,018
RAC: 0
Message 63722 - Posted: 16 Oct 2009, 22:54:28 UTC - in response to Message 63603.  

I just found a Rosetta Beta workunit with frozen graphics covering the whole screen again. They wouldn't go away when I used the mouse and keyboard.

Two Rosetta Beta 5.98 workunits could have been running on that machine at the time of the graphics freeze; not enough evidence left to tell which one was responsible for this:

Rossmann2X3_001_14908_12080_1

Rossmann2X3_027_15080_10154_0

Both had just a little less CPU time than in the frozen graphics after I rebooted.

That machine has 64-bit BOINC 6.10.3 under Vista SP2; that BOINC version is recommended if I want to continue using the GPU on that machine under GPUGRID. That version often displays the graphics for any workunits in progress, even if I don't ask for any graphics.

One of those workunits is now running again; the other one is waiting for its turn on a CPU core.

The frozen graphics showed Model 2, Step 287738, with CPU time 0:24:07.

Is there an option to disable Rosetta Beta workunits on that machine, but continue running minirosetta workunits? Or would it be better to just discontinue Rosetta@home participation at all until this 5.98 problem is fixed?


It would be better to disable the graphics and not allow Boinc as your screensaver. Set your computer to no screensaver and have the video power down after 10 minutes or so, and shut the monitors off when you are done. No only will you save a little on power, but you will no loger have a problem with frozen graphics. Rosetta doesn't need the graphics to run the work unit.
ID: 63722 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 63809 - Posted: 25 Oct 2009, 2:03:08 UTC

Task 290293053 (Rossmann2X3_060_003_15300_100_0) failed on Windows System 7. It ran for 25 hours stuck on Model 4 Step 271587 before I aborted it. How come the watchdog thread didn't stop it?

------

# cpu_run_time_pref: 10800
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
# random seed: 3714901


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x755A194B

Engaging BOINC Windows Runtime Debugger...

followed by a bunch of Windows debugging info.


ID: 63809 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63810 - Posted: 25 Oct 2009, 2:31:27 UTC - in response to Message 63809.  

Task 290293053 (Rossmann2X3_060_003_15300_100_0) failed on Windows System 7. It ran for 25 hours stuck on Model 4 Step 271587 before I aborted it. How come the watchdog thread didn't stop it?


...because it had only used 4038.959 seconds of CPU time. Your machine must have had some other higher priority work going on during that time period.

Rosetta Moderator: Mod.Sense
ID: 63810 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63818 - Posted: 25 Oct 2009, 15:04:17 UTC

Good point transient. I believe at that time it was when you reach 4x the runtime preference. But, as I pointed out, the task wasn't getting much CPU time. The newer BOINC clients show "elapsed time" now, not CPU time.
Rosetta Moderator: Mod.Sense
ID: 63818 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 63822 - Posted: 25 Oct 2009, 15:45:59 UTC

Task 290293053 (Rossmann2X3_060_003_15300_100_0) failed on Windows System 7. It ran for 25 hours stuck on Model 4 Step 271587 before I aborted it. How come the watchdog thread didn't stop it?


...because it had only used 4038.959 seconds of CPU time. Your machine must have had some other higher priority work going on during that time period.

____________
Rosetta Moderator: Mod.Sense



There is a mismatch between the 4,038 seconds of CPU time reported in the Task Details and the 25+ hours it actually took (I decided to let it continue running). The only other tasks going on were Rosetta tasks using the second core. Could it be a System 7 issue?

ID: 63822 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63829 - Posted: 26 Oct 2009, 2:40:03 UTC

ok, so when you say it actually took 25 hrs, this information came from what source?
Rosetta Moderator: Mod.Sense
ID: 63829 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 63840 - Posted: 26 Oct 2009, 14:08:03 UTC

ok, so when you say it actually took 25 hrs, this information came from what source?


The elapsed time field in the BOINC manager.

(My run time preference is set to 3 hours)

ID: 63840 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63843 - Posted: 26 Oct 2009, 16:45:07 UTC

So the question becomes, why would 25hrs elapse, with only 4000 seconds of low priority CPU being available to BOINC? This is why I made the comment that your machine must have been busy doing something else that day.
Rosetta Moderator: Mod.Sense
ID: 63843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : Problems with Rosetta version 5.98



©2024 University of Washington
https://www.bakerlab.org