Long run and failure.

Message boards : Number crunching : Long run and failure.

To post messages, you must log in.

AuthorMessage
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 28
Message 73722 - Posted: 31 Aug 2012, 8:31:00 UTC
Last modified: 31 Aug 2012, 8:31:26 UTC

This morning, I noticed this wu was acting a little strangely. First, it has been running for more than 24 hours now, I've been watching for a short while and it doesn't seem to be advancing, ie. the % done is not moving. Also, when I try to fire up the graphics from that wu,it doesn't run, I get the initial black rectangular window, but nothing more. When I try to stop the graphics, I get Windows saying the "not running, abort or retry" type message.

Digging a little deeper, I found another wu, this one, with an extended run time and an eventual error.

I have suspended the wu, pending comments. This is my usual machine I use almost all day, (ie not a dedicated BOINC cruncher), not had any other BOINC problems or issues with anything else. Fully patched, up to date Win XP machine.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 73722 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,177,195
RAC: 3,176
Message 73726 - Posted: 31 Aug 2012, 11:18:18 UTC - in response to Message 73722.  

This morning, I noticed this wu was acting a little strangely. First, it has been running for more than 24 hours now, I've been watching for a short while and it doesn't seem to be advancing, ie. the % done is not moving. Also, when I try to fire up the graphics from that wu,it doesn't run, I get the initial black rectangular window, but nothing more. When I try to stop the graphics, I get Windows saying the "not running, abort or retry" type message.

Digging a little deeper, I found another wu, this one, with an extended run time and an eventual error.

I have suspended the wu, pending comments. This is my usual machine I use almost all day, (ie not a dedicated BOINC cruncher), not had any other BOINC problems or issues with anything else. Fully patched, up to date Win XP machine.


I don't remember the time frame but there seems to be a problems with SOME of the units not recognizing that the time has elapsed and it needs to stop. Try exiting Boinc and then restarting it and see if the unit picks back up at a checkpoint or if it is just time to cancel it and move on to the next one.
ID: 73726 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 73727 - Posted: 31 Aug 2012, 20:12:58 UTC

There were some issues with some BOINC versions (not sure specific versions were ever isolated) where the BOINC Manager shows the task as "running", but doesn't give it any CPU time. And so no progress is made. Simple litmus test is to open Windows task manager and see if you have an active thread per CPU or if system idle is getting CPU.

Exiting (not just close) and restarting BOINC causes it to get it's head straight and the task will be given CPU to finish it's work.

When the tasks actually get CPU as planned, I don't believe I've ever seen any that were not automatically ended by the watch dog if they exceed target runtime (see your Rosetta-specific preferences, default runtime preference is 3hrs) by 4hrs. So if you run default 3hr tasks, and one runs more than 7hrs, the watchdog will catch that and end it for you... so long as BOINC is giving it CPU time for it to do so.
Rosetta Moderator: Mod.Sense
ID: 73727 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Long run and failure.



©2024 University of Washington
https://www.bakerlab.org