Long-running and failing rb_06_21_* work units

Message boards : Number crunching : Long-running and failing rb_06_21_* work units

To post messages, you must log in.

AuthorMessage
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 75804 - Posted: 25 Jun 2013, 2:50:25 UTC

Recently Rosetta@home downloaded a bunch of workunits with names starting with rb_06_21_ (this on a Linux system with a 6 hr time preference)

Of these, 29 have finished. 22 completed successfully , while 7 resulted in a Computation Error.

Of these latter 7, 2 ( 588693210 and 588692717 ) failed early after 1.5 hours and 3.5 hours with an already reported error about torsion_angle_dof_id ), while no file appeared in the transfer window.

The remaining 5 failures all ceased computation after 10 hours, 10 minutes, and some number of seconds but did produce an output file (like 588693276). In the task report it's stated they successfully completed some number of decoys but finished with a segmentation violation after calling boinc_finish.

Of the 22 successful tasks, most ran for the allotted time of 6 hours or thereabouts. 5 however had a completion time, as for most of the failed tasks, of 10 hours and 10.xx minutes (like 588693293). I really don't see how valid tasks would also take precisely this amount of time: it seems a bit suspicious. Too suspicious in fact.

Hope this helps someone figure out what's going on with these failed tasks.
ID: 75804 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 75828 - Posted: 8 Jul 2013, 17:59:58 UTC
Last modified: 8 Jul 2013, 18:04:32 UTC

I am geting much the same thing on my ubuntu Linux cruncher with the rb_06 and rb_07 WU,
not all WU generate an error
i have run memtest and dun other checks with no problems found
my win 7 cruncbox duz not have this problem

process got signal 11
SIGSEGV: segmentation violation

They all get credit after a second pass through the validator [or whatever it is]
run time pref iz 12 hours
this problem has been going on for a long time.
I googled SIGSEVG and it looks like it is an application error,
so i think the app needs a bugfix for linux.
ID: 75828 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 75830 - Posted: 9 Jul 2013, 16:40:18 UTC

Seems like this is a general problem. Maybe the way to go on a Linux system is to use BOINC under Wine: does anyone have any experience using this?
ID: 75830 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dskagcommunity
Avatar

Send message
Joined: 6 Apr 11
Posts: 10
Credit: 4,208,914
RAC: 0
Message 75862 - Posted: 23 Jul 2013, 18:55:06 UTC
Last modified: 23 Jul 2013, 18:58:53 UTC

I had running rosetta on my 100% stable fileserver. But after three windowsresets, witch is totaly unacceptable on this system, i deactived that project temporary. I though it could be a new app running on rosetta and lokked into the page. vola. Since 18th there are new ones. It seems this rb units are running unstable.
DSKAG Austria Research Team: http://www.research.dskag.at



ID: 75862 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dskagcommunity
Avatar

Send message
Joined: 6 Apr 11
Posts: 10
Credit: 4,208,914
RAC: 0
Message 75866 - Posted: 24 Jul 2013, 15:20:15 UTC
Last modified: 24 Jul 2013, 15:25:39 UTC

Oh it seems to be a wrong alert (looks like a bad accident they released a new kind of Units on exactly these days ^^) from me because it resets yesterday on LHC too after i stopped rosetta. After years of running and running on a "dust protected" place i looked into the server and there was a little! compact fine Dust on the outer top and outer left of the CPU Cooler. No problem for the CPU and the temperaturecontroling software but it seems a big for the voltagecontroller/stabilizers near the CPU "under" the cooler on exactly that place. Older Intel Board, it runs stable like hell, but the coolingdesign is made with not much tolerance (for dust) in summer and silentparts i presume O.o or the stabilizer getting old and got lower tolerance over the years. It runs until now with no resets, i will see the next days.
DSKAG Austria Research Team: http://www.research.dskag.at



ID: 75866 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
alan

Send message
Joined: 21 Mar 13
Posts: 3
Credit: 4,784
RAC: 0
Message 75868 - Posted: 25 Jul 2013, 17:36:41 UTC

I recently joined Rosetta@home.it is sending tasks but every one fails to download
What is going wrong?.
ID: 75868 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
alan

Send message
Joined: 21 Mar 13
Posts: 3
Credit: 4,784
RAC: 0
Message 75869 - Posted: 25 Jul 2013, 17:39:20 UTC

I recently joined Rosetta@home.it is sending tasks but every one fails to download
What is going wrong?.
ID: 75869 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,664,803
RAC: 11,191
Message 75870 - Posted: 25 Jul 2013, 20:12:46 UTC - in response to Message 75869.  

I recently joined Rosetta@home.it is sending tasks but every one fails to download
What is going wrong?.


What Anti virus are you running?
ID: 75870 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 76176 - Posted: 7 Nov 2013, 1:01:02 UTC

Not wishing to tempt fate but this problem seems to have gone away with 3.48.
ID: 76176 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stefan_Strauß
Avatar

Send message
Joined: 10 Sep 13
Posts: 4
Credit: 2,206
RAC: 0
Message 76212 - Posted: 23 Nov 2013, 17:40:16 UTC - in response to Message 76176.  
Last modified: 23 Nov 2013, 17:44:24 UTC

Not wishing to tempt fate but this problem seems to have gone away with 3.48.


Not for me. I'm using Boinc (Version 7.2.7) on my Ubuntu 13.10 (64-bit) machine and every WU starting with "rb_" is failing. My runtime preference was set to two hours because of various reasons. I recently had two of the "rb_" workunits running and they did'nt finish. The difference between those WUs and the ones my computer finishes are the checkpoints. Every WU that sets checkpoints gets finished, while most of the "rb_" ones don't set any checkpoints and keep working and working and working, even if the computer is not shutdown after starting those WUs.

So I tried something out: I cancelled the two recent WUs that wouldn't end and set the runtime preference up to four hours. Now I got a new "rb_" WU, but this time, it sets checkpoints, so it's likely going to finish.



So it seems to be a checkpointing problem in my case. The marked zone on the screenshot was empty on the other two (failing) workunits.
ID: 76212 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stefan_Strauß
Avatar

Send message
Joined: 10 Sep 13
Posts: 4
Credit: 2,206
RAC: 0
Message 76213 - Posted: 23 Nov 2013, 17:42:16 UTC - in response to Message 76176.  
Last modified: 23 Nov 2013, 17:43:17 UTC

Edit: Sorry for double-posting. :D
ID: 76213 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,664,803
RAC: 11,191
Message 76217 - Posted: 24 Nov 2013, 20:12:05 UTC

I think your run-time is causing the problem. I could be wrong, but I believe that BOINC (or maybe Rosetta in this instance?) will cancel the task if it runs for double the selected target run-time. I'd recommend increasing that to at least 4 hrs.

Danny
ID: 76217 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stefan_Strauß
Avatar

Send message
Joined: 10 Sep 13
Posts: 4
Credit: 2,206
RAC: 0
Message 76219 - Posted: 25 Nov 2013, 8:43:45 UTC - in response to Message 76217.  
Last modified: 25 Nov 2013, 8:43:55 UTC

I think your run-time is causing the problem. I could be wrong, but I believe that BOINC (or maybe Rosetta in this instance?) will cancel the task if it runs for double the selected target run-time. I'd recommend increasing that to at least 4 hrs.

Danny


That's what I did (see my comment) and now the "rb_" WUs work just fine. :)
ID: 76219 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 76249 - Posted: 6 Dec 2013, 0:41:11 UTC

The runtime dependency for Rosetta is based on the Rosetta runtime preference. If this is exceeded by 4 hours, the "watchdog" will end the task and report it completed. By increasing the target runtime, you give some breathing room for Rosetta to decide if additional models can be completed or not, within the runtime target, and typically get a more consistent runtime.
Rosetta Moderator: Mod.Sense
ID: 76249 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,664,803
RAC: 11,191
Message 76250 - Posted: 6 Dec 2013, 12:11:33 UTC - in response to Message 76249.  

The runtime dependency for Rosetta is based on the Rosetta runtime preference. If this is exceeded by 4 hours, the "watchdog" will end the task and report it completed. By increasing the target runtime, you give some breathing room for Rosetta to decide if additional models can be completed or not, within the runtime target, and typically get a more consistent runtime.

Ah - thanks for the clarification ;)
ID: 76250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 28
Message 76346 - Posted: 13 Jan 2014, 20:25:39 UTC
Last modified: 13 Jan 2014, 21:05:20 UTC

I have this wu, an rb_01_10... etc wu, on one of my systems. So far it has run for 50:09:53 and is showung 41.811% complete. The remaining time is showing "---". I'm fairly sure it is not actually doing anything as I see the Windows idle process "using" 25% of my quad core. I've suspended it pending advice, the deadline is a week away.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 76346 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 28
Message 76347 - Posted: 14 Jan 2014, 16:32:16 UTC

Okay, no responses, so time for experiments. I removed the suspended status and suspended enough other projects so the job would start again, my intention was to see what happened with the suspend then reactivate. So, it started again, but, the elapsed time dropped from the previous high to 3:20:17, and the %complete to 30.495%. I'll leave it but watch it more carefully.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 76347 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Long-running and failing rb_06_21_* work units



©2024 University of Washington
https://www.bakerlab.org