Message boards : Number crunching : Problems with Rosetta version 5.78
Author | Message |
---|---|
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Not too much different in this app from previous version. Thanks for continuing to post problems! |
m.mitch Send message Joined: 10 Feb 06 Posts: 34 Credit: 1,928,904 RAC: 0 |
Work unit 94392699 on computer 551987 has been stuck at 97.756% finished with about 00:9:54 to go for most of today. Unlike the last time this occurred to me, the CPU is at 100% use. However, the CPU time (done) is still only showing a bit over 7 hours. Is this a real problem? Click here to join the #1 Aussie Alliance on Rosetta |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Mike, is that task still working on model 1? What is your work unit runtime preference? (the default is 3hrs). ...sounding normal so far. Rosetta Moderator: Mod.Sense |
m.mitch Send message Joined: 10 Feb 06 Posts: 34 Credit: 1,928,904 RAC: 0 |
No, that one finished after I went to bed. :-) No other problems so far. I expect a bit of a pause around the 10 minute to go mark, this one just seem to go longer. Perhaps it snuck in a work unit from another project while I wasn't looking. Didn't see any in the messages though. Click here to join the #1 Aussie Alliance on Rosetta |
M.L. Send message Joined: 21 Nov 06 Posts: 182 Credit: 180,462 RAC: 0 |
Result ID 104053613 Name profilin2_BOINC_MFR_ABRELAX_PICKED_2062_29191_0 Workunit 94455620 Created 3 Sep 2007 9:01:12 UTC Sent 3 Sep 2007 9:01:24 UTC Received 4 Sep 2007 12:25:47 UTC Server state Over Outcome Client error Client state Compute error Exit status 1 (0x1) Computer ID 510574 Report deadline 13 Sep 2007 9:01:24 UTC CPU time 0 stderr out <core_client_version>5.10.13</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # cpu_run_time_pref: 21600 ERROR:: Unable to obtain total_residue & sequence. start pdb file must be provided. ERROR:: Exit from: .input_pdb.cc line: 2956 </stderr_txt> ]]> Validate state Invalid Claimed credit 0 Granted credit 0 application version 5.78 AMD4800 duall core on W SP2 Home |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
I have two W.U.'s of the same type finish short of time on my two systems, they are both have the runtime set for 8hrs and they both stoped after only 4hrs. I have the projects switch every 2hrs, anyway they haven't U/L ed yet. Edit/ added: 1gidA_BOINC_MG_CHAINBREAK5_LRSCOREFIX_RNA_********** https://boinc.bakerlab.org/rosetta/workunit.php?wuid=94629940 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=94566211 Pete. |
drghughes Send message Joined: 27 Apr 07 Posts: 7 Credit: 6,346 RAC: 0 |
I also have a work unit 94604566 that has been stuck at around 97.2% progress for several hours of CPU time. It has now been running for 5:47 compared to a normal run time of about 3 hours. I suspended it when the Rosetta problems started. Should I start it up again and let it run or should I abort it? |
mikus Send message Joined: 7 Nov 05 Posts: 58 Credit: 700,115 RAC: 0 |
Had a problem with <https://boinc.bakerlab.org/rosetta/workunit.php?wuid=94715507> (not reported yet, since Rosetta is not yet accepting uploads). Noticed in gkrellm that one of my CPUs was idle (though boincmgr said that the workunit on that CPU was "running"). (If you can tell me where to send it, I have a tar of the slot directory.) Here is a copy of the stderr.txt from that slot directory: Graphics are disabled due to configuration... # cpu_run_time_pref: 28800 # random seed: 1285195 SIGSEGV: segmentation violation Stack trace (12 frames): [0x8d45107] [0x8d3fefc] [0x40000420] [0x8bb4bb4] [0x8c96f34] [0x84b6ee1] [0x80d8665] [0x85efeb3] [0x871f807] [0x871f8b2] [0x8da9454] [0x8048111] Exiting... SIGABRT: abort called Stack trace (23 frames): [0x8d45107] [0x8d3fefc] [0x40000420] [0x8db0514] [0x8dc53df] [0x8dca445] [0x8dca723] [0x8d9b171] [0x8d9cb99] [0x83f92c1] [0x8db0a5f] [0x8d45152] [0x8d3fefc] [0x40000420] [0x8bb4bb4] [0x8c96f34] [0x84b6ee1] [0x80d8665] [0x85efeb3] [0x871f807] [0x871f8b2] [0x8da9454] [0x8048111] Exiting... Would prefer it if applications which terminated abnormally would go away, rather than making the boinc client (Linux 32-bit 5.10.8) believe thay are still "running". . |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Peter & drghughes: Some of the recent tasks sent out have long run times per model. Some up to about 4 hours on 3Ghz machines. So if your runtime preference is 8hrs, and your first model took 4.5hrs to complete, then beginning a second model would be predicted to take you over the 8hr preference by a significant amount, so Rosetta ends that task early rather then beginning the next model, which would almost certainly take longer. So Peter, that is normal for it to end early. drhhughes, that is normal for them to sometimes take longer then your shorter runtime preference. But that can't be marked as finished until you complete at least one model. The time to completion is really just an estimate based on your 3hr preference. Once they get down to <10min left they start to just try to continue to show about 10min remaining, because they've got no more accurate idea when that first model will complete. Please let it run. mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well. It appears from the number of tasks outstanding, that the project is accepting uploads and issueing downloads. I just had an upload go through about an hour ago. Keep in mind there are about 50,000 PCs out there that all are trying to report completed results and get more work. We just have to let it keep chugging and working through the backlog. Thanks for your patience. Rosetta Moderator: Mod.Sense |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Mod Sense. Fair enough answer, thanks. Pete. |
mikus Send message Joined: 7 Nov 05 Posts: 58 Credit: 700,115 RAC: 0 |
mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well. From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.) . |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well. Not at all. I was simply trying to point out that BOINC manager controls which tasks run and when, so I'm not sure there is anything that the Rosetta team can do to improve things. It's probably going to require a fix to the BOINC code so that it can pick up and schedule other work accordingly. Rosetta Moderator: Mod.Sense |
mikus Send message Joined: 7 Nov 05 Posts: 58 Credit: 700,115 RAC: 0 |
Not at all. I was simply trying to point out that BOINC manager controls which tasks run and when, so I'm not sure there is anything that the Rosetta team can do to improve things. It's probably going to require a fix to the BOINC code so that it can pick up and schedule other work accordingly.mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.) It may well be that BOINC code needs to be upgraded to handle this unusual situation - an application task "dispatched" by BOINC which does not use any CPU. BUT it is likely that the existing BOINC code expected that an application task which (according to the task's stderr.txt) had received (SIGSEGV + SIGABRT) would perform a "final exit". My question is - did the Rosetta application task do that ? (If yes, then BOINC dropped the ball; but if no, then it was the application that did not do what BOINC expected.) That is why I would like to send the snapshot of the slot directory to someone at Rosetta (if I knew where to send it), so Rosetta people can check for how far the application had gotten. mikus p.s. By the way, I now see that when I "aborted" the task to get it out of the ready queue, only the "abort" shows in the result's stderr field - overwriting the task's previously accumulated stderr output. Also, I believe boincmgr is merely the 'GUI' to the BOINC client - the client can (and does) run perfectly well if boincmgr has been closed. So while the BOINC manager *can* control the application tasks (I issued the "abort" from boincmgr), it is the client which performs the details of task scheduling. Unfortunately, I believe the principal means the client has to keep track of what the tasks are doing is to track their CPU consumption. When faced with a task that does not consume CPU, I think the current BOINC *will* lose track. . |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
mikus, you can EMail your files to me at the moderator contact EMail address, and I will forward them to the project team for you. Yes, my terminology needs a little refinement. Most users do not know the difference between the two BOINC pieces, so they don't notice my misuse of terms. Two questions for you, perhaps just include them in the EMail. What is your runtime preference? (actually that probably shows in the output file), and do you have any idea how long it was in the "running" state, but not using CPU time? Rosetta Moderator: Mod.Sense |
drghughes Send message Joined: 27 Apr 07 Posts: 7 Credit: 6,346 RAC: 0 |
Mod.Sense, Thanks. I let it run and it finished at about 5 h 57 mins. Perhaps you could include a sticky note telling people about the "10 minutes to completion" rule. That would have been useful to know. Also, the latest work unit that I've received has an initial "To completion" of 5 h 57 mins. Is this coincidence or do new work units take the CPU Time of the last work unit as their initial To completion estimate? Again, this would be useful to know since it would explain why the actual run time might not match the estimate. |
M.L. Send message Joined: 21 Nov 06 Posts: 182 Credit: 180,462 RAC: 0 |
Result ID 104434245 Name t030__BOINC_CAPRI14_DOCK_FIXBACKBONE-t030_-nosillyloop_nodimerloop_plexinmonomer__2066_697_0 Workunit 94766131 Created 9 Sep 2007 23:57:14 UTC Sent 10 Sep 2007 0:01:53 UTC Received 10 Sep 2007 14:05:17 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 510574 Report deadline 20 Sep 2007 0:01:53 UTC CPU time 13821.375 stderr out <core_client_version>5.10.20</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 21600 # random seed: 1280434 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! Stuck at score -218.075 for 900 seconds ********************************************************************** GZIP SILENT FILE: .xxt030.out </stderr_txt> ]]> Validate state Valid Claimed credit 56.4278965225222 Granted credit 20 application version 5.78 |
Christoph Jansen Send message Joined: 6 Jun 06 Posts: 248 Credit: 267,153 RAC: 0 |
Same here too: "Rosetta score is stuck or going too long. Watchdog is ending the run!" On these WUs: wuid=94910696 wuid=94910692 wuid=94910691 wuid=94770968 |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
Is this sort of thing supposed to be happening frequently as, at the moment, my four machines are doing quite a bit of work < 6.5 hrs and then coming up with <core_client_version>5.10.20</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 28800 # random seed: 1276748 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! Stuck at score 303.464 for 900 seconds ********************************************************************** GZIP SILENT FILE: .xx1he8.out </stderr_txt> ]]> Taken from Here and giving next to nothing in credit (not that that bothers me, just wondering if there's something amiss !!) Anyone else ? Now message has been moved I see there are others. |
BitSpit Send message Joined: 5 Nov 05 Posts: 33 Credit: 4,147,344 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=104493983 https://boinc.bakerlab.org/rosetta/result.php?resultid=104493982 https://boinc.bakerlab.org/rosetta/result.php?resultid=104511814 https://boinc.bakerlab.org/rosetta/result.php?resultid=104511813 Watchdog killed these after the score got stuck for 900 seconds. It only seemed to affect the Windows machines. The Linux ones ran just fine. |
Zxian Send message Joined: 17 May 07 Posts: 18 Credit: 1,173,075 RAC: 0 |
I've also had several WU's come out with only 20 granted credit, regardless of how long the WU actually ran for. This is on several different computers with different versions of Windows (XP, 2003). |
Message boards :
Number crunching :
Problems with Rosetta version 5.78
©2025 University of Washington
https://www.bakerlab.org