Message boards : Number crunching : Long-running and failing rb_06_21_* work units
Author | Message |
---|---|
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Recently Rosetta@home downloaded a bunch of workunits with names starting with rb_06_21_ (this on a Linux system with a 6 hr time preference) Of these, 29 have finished. 22 completed successfully , while 7 resulted in a Computation Error. Of these latter 7, 2 ( 588693210 and 588692717 ) failed early after 1.5 hours and 3.5 hours with an already reported error about torsion_angle_dof_id ), while no file appeared in the transfer window. The remaining 5 failures all ceased computation after 10 hours, 10 minutes, and some number of seconds but did produce an output file (like 588693276). In the task report it's stated they successfully completed some number of decoys but finished with a segmentation violation after calling boinc_finish. Of the 22 successful tasks, most ran for the allotted time of 6 hours or thereabouts. 5 however had a completion time, as for most of the failed tasks, of 10 hours and 10.xx minutes (like 588693293). I really don't see how valid tasks would also take precisely this amount of time: it seems a bit suspicious. Too suspicious in fact. Hope this helps someone figure out what's going on with these failed tasks. |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
I am geting much the same thing on my ubuntu Linux cruncher with the rb_06 and rb_07 WU, not all WU generate an error i have run memtest and dun other checks with no problems found my win 7 cruncbox duz not have this problem process got signal 11 SIGSEGV: segmentation violation They all get credit after a second pass through the validator [or whatever it is] run time pref iz 12 hours this problem has been going on for a long time. I googled SIGSEVG and it looks like it is an application error, so i think the app needs a bugfix for linux. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Seems like this is a general problem. Maybe the way to go on a Linux system is to use BOINC under Wine: does anyone have any experience using this? |
dskagcommunity Send message Joined: 6 Apr 11 Posts: 10 Credit: 4,208,914 RAC: 0 |
I had running rosetta on my 100% stable fileserver. But after three windowsresets, witch is totaly unacceptable on this system, i deactived that project temporary. I though it could be a new app running on rosetta and lokked into the page. vola. Since 18th there are new ones. It seems this rb units are running unstable. DSKAG Austria Research Team: http://www.research.dskag.at |
dskagcommunity Send message Joined: 6 Apr 11 Posts: 10 Credit: 4,208,914 RAC: 0 |
Oh it seems to be a wrong alert (looks like a bad accident they released a new kind of Units on exactly these days ^^) from me because it resets yesterday on LHC too after i stopped rosetta. After years of running and running on a "dust protected" place i looked into the server and there was a little! compact fine Dust on the outer top and outer left of the CPU Cooler. No problem for the CPU and the temperaturecontroling software but it seems a big for the voltagecontroller/stabilizers near the CPU "under" the cooler on exactly that place. Older Intel Board, it runs stable like hell, but the coolingdesign is made with not much tolerance (for dust) in summer and silentparts i presume O.o or the stabilizer getting old and got lower tolerance over the years. It runs until now with no resets, i will see the next days. DSKAG Austria Research Team: http://www.research.dskag.at |
alan Send message Joined: 21 Mar 13 Posts: 3 Credit: 4,784 RAC: 0 |
I recently joined Rosetta@home.it is sending tasks but every one fails to download What is going wrong?. |
alan Send message Joined: 21 Mar 13 Posts: 3 Credit: 4,784 RAC: 0 |
I recently joined Rosetta@home.it is sending tasks but every one fails to download What is going wrong?. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,664,803 RAC: 11,191 |
I recently joined Rosetta@home.it is sending tasks but every one fails to download What Anti virus are you running? |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Not wishing to tempt fate but this problem seems to have gone away with 3.48. |
Stefan_Strauß Send message Joined: 10 Sep 13 Posts: 4 Credit: 2,206 RAC: 0 |
Not wishing to tempt fate but this problem seems to have gone away with 3.48. Not for me. I'm using Boinc (Version 7.2.7) on my Ubuntu 13.10 (64-bit) machine and every WU starting with "rb_" is failing. My runtime preference was set to two hours because of various reasons. I recently had two of the "rb_" workunits running and they did'nt finish. The difference between those WUs and the ones my computer finishes are the checkpoints. Every WU that sets checkpoints gets finished, while most of the "rb_" ones don't set any checkpoints and keep working and working and working, even if the computer is not shutdown after starting those WUs. So I tried something out: I cancelled the two recent WUs that wouldn't end and set the runtime preference up to four hours. Now I got a new "rb_" WU, but this time, it sets checkpoints, so it's likely going to finish. So it seems to be a checkpointing problem in my case. The marked zone on the screenshot was empty on the other two (failing) workunits. |
Stefan_Strauß Send message Joined: 10 Sep 13 Posts: 4 Credit: 2,206 RAC: 0 |
Edit: Sorry for double-posting. :D |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,664,803 RAC: 11,191 |
I think your run-time is causing the problem. I could be wrong, but I believe that BOINC (or maybe Rosetta in this instance?) will cancel the task if it runs for double the selected target run-time. I'd recommend increasing that to at least 4 hrs. Danny |
Stefan_Strauß Send message Joined: 10 Sep 13 Posts: 4 Credit: 2,206 RAC: 0 |
I think your run-time is causing the problem. I could be wrong, but I believe that BOINC (or maybe Rosetta in this instance?) will cancel the task if it runs for double the selected target run-time. I'd recommend increasing that to at least 4 hrs. That's what I did (see my comment) and now the "rb_" WUs work just fine. :) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The runtime dependency for Rosetta is based on the Rosetta runtime preference. If this is exceeded by 4 hours, the "watchdog" will end the task and report it completed. By increasing the target runtime, you give some breathing room for Rosetta to decide if additional models can be completed or not, within the runtime target, and typically get a more consistent runtime. Rosetta Moderator: Mod.Sense |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,664,803 RAC: 11,191 |
The runtime dependency for Rosetta is based on the Rosetta runtime preference. If this is exceeded by 4 hours, the "watchdog" will end the task and report it completed. By increasing the target runtime, you give some breathing room for Rosetta to decide if additional models can be completed or not, within the runtime target, and typically get a more consistent runtime. Ah - thanks for the clarification ;) |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 28 |
I have this wu, an rb_01_10... etc wu, on one of my systems. So far it has run for 50:09:53 and is showung 41.811% complete. The remaining time is showing "---". I'm fairly sure it is not actually doing anything as I see the Windows idle process "using" 25% of my quad core. I've suspended it pending advice, the deadline is a week away. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 28 |
Okay, no responses, so time for experiments. I removed the suspended status and suspended enough other projects so the job would start again, my intention was to see what happened with the suspend then reactivate. So, it started again, but, the elapsed time dropped from the previous high to 3:20:17, and the %complete to 30.495%. I'll leave it but watch it more carefully. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Message boards :
Number crunching :
Long-running and failing rb_06_21_* work units
©2024 University of Washington
https://www.bakerlab.org