Message boards : Number crunching : Report problems with Rosetta version 5.34
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Tobie Send message Joined: 1 Sep 06 Posts: 2 Credit: 11,856 RAC: 0 |
My result - 44001776 - exceeded preferred time. <core_client_version>5.4.11</core_client_version> <stderr_txt> # random seed: 1263112 # cpu_run_time_pref: 10800 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! CPU time: 45890.3 seconds. Greater than 4X preferred time: 10800 seconds ********************************************************************** GZIP SILENT FILE: .xx1hz6.out </stderr_txt> and ... 44015673 ... had -161 error. <core_client_version>5.4.11</core_client_version> <stderr_txt> # random seed: 1341101 No heartbeat from core client for 31 sec - exiting # random seed: 1341101 No heartbeat from core client for 31 sec - exiting # random seed: 1341101 No heartbeat from core client for 31 sec - exiting # random seed: 1341101 No heartbeat from core client for 31 sec - exiting # random seed: 1341101 No heartbeat from core client for 31 sec - exiting Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 0 starting structures built 9 (nstruct) times This process generated 0 decoys from 0 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... </stderr_txt> <message> <file_xfer_error> <file_name>1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_JUMP_RELAX_SAVE_ALL_OUT__1306_5595_0_0</file_name> <error_code>-161</error_code> </file_xfer_error> </message> |
Edwin Send message Joined: 29 Mar 06 Posts: 4 Credit: 69,961 RAC: 0 |
My system is constantly freezing up with the new versions. It started more frequently from version 5.32 I don't mind giving unused processing time to Rosetta but it must not make my system unstable / unrelaible. Can somebody help me with this. If there is not a sollution i am considering the removal of the rosetta programm. Tony, How do i reveal my PC to you so you can see what i'm doing? About the freezing up: To be more presice: Rosetta is freezing. I get these sort of reports very often. (this is the last one that occured) 27/10/2006 11:22:57|rosetta@home|Unrecoverable error for result FRA_2rio_154E_hom001_1_2rio_1_1zy4A_IGNORE_THE_REST_149_1305_21_0 ( - exit code 1073807364 (0x40010004)) Perhaps we can have contact via e-mail? Please tell me what kind of information you need from me to help me. Edwin |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
My system is constantly freezing up with the new versions. It started more frequently from version 5.32 I don't mind giving unused processing time to Rosetta but it must not make my system unstable / unrelaible. Can somebody help me with this. If there is not a sollution i am considering the removal of the rosetta programm. Your machine is already visible through the profile.... Judging by this: - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x7C901230 - Registers - eax=00000000 ebx=00000001 ecx=53cd7a8b edx=b5e6008f esi=00000000 edi=00000010 eip=7c901230 esp=039bfd04 ebp=0000000f cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202 - Callstack - ChildEBP RetAddr Args to Child 039bfd00 009a54ce 50568787 00000000 205f0140 7c802442 ntdll!_DbgBreakPoint@0+0x0 FPO: [0,0,0] 039bfdd8 009a596e 039bff34 2019be88 ffffffff 7c802442 rosetta_5.32_windows_intelx86!get_the_hell_out+0x6 (c:cygwinhomechurosettarosetta_c++boincrosetta_5.32watchdog.cc:150) 039bff6c 009a59a7 00000000 009d6c91 00000000 50568527 rosetta_5.32_windows_intelx86!main_watchdog+0x1b (c:cygwinhomechurosettarosetta_c++boincrosetta_5.32watchdog.cc:252) 039bff74 009d6c91 00000000 50568527 010b0778 0287f8f8 rosetta_5.32_windows_intelx86!main_watchdog_windows+0x7 (c:cygwinhomechurosettarosetta_c++boincrosetta_5.32watchdog.cc:57) 039bffac 009d6d36 7c91056d 7c80b683 0287f8f8 010b0778 rosetta_5.32_windows_intelx86!_callthreadstartex+0x6 (f:rtmvctoolscrt_bldself_x86crtsrcthreadex.c:348) 039bffb4 7c80b683 0287f8f8 010b0778 7c91056d 0287f8f8 rosetta_5.32_windows_intelx86!_threadstartex+0x5 (f:rtmvctoolscrt_bldself_x86crtsrcthreadex.c:326) 039bffec 00000000 009d6cb7 0287f8f8 00000000 00000008 kernel32!_BaseThreadStart@8+0x0 (f:rtmvctoolscrt_bldself_x86crtsrcthreadex.c:326) It looks like the watchdog killed your process by doing a "int3" instruction. Why this happens, I don't know. There's two possibilities: There's a bug in Rosetta, or your machine is for some reason not operating correctly. All the other threads appear to be in "reasonable" state - but I'm not an expert at the debugging and internal workings of Rosetta... -- Mats |
Edwin Send message Joined: 29 Mar 06 Posts: 4 Credit: 69,961 RAC: 0 |
My system is constantly freezing up with the new versions. It started more frequently from version 5.32 I don't mind giving unused processing time to Rosetta but it must not make my system unstable / unrelaible. Can somebody help me with this. If there is not a sollution i am considering the removal of the rosetta programm. Thanks Mats, I have no problem at all with my PC. All other programs run perfectly without any error. It seems to me that rosetta is causing some problems. They started occuring from 5.32. Before that (5.24) everything was fine. Edwin |
anders n Send message Joined: 19 Sep 05 Posts: 403 Credit: 537,991 RAC: 0 |
This wu was stopped by the watch dog. https://boinc.bakerlab.org/rosetta/result.php?resultid=44047361 The thing is that it was not stuck. It made sloooow progress with about 20 steps/H. Is the watchdog supposed to work like this? Anders n |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
...Is the watchdog supposed to work like this? Yes. The watchdog has (at least) two reasons it might end a task. One is that it has maintained the same score for an extended period of time (i.e. it is "stuck", and typically the graphic would show the steps are NOT progressing). Another is that the task runs for 4x longer then your runtime preference. WU name was: 1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_VARY_ALL_BOND_ANGLES_ALL_BOND_DISTANCES_SAVE_ALL_OUT__1306_10496_0 That WU says: CPU time: 88223.5 seconds. Greater than 4X preferred time: 21600 seconds So, it ran for 24hrs, and your preference is 6hrs, so it was ended. This happens more for people with very short run time preference. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Christoph Send message Joined: 10 Dec 05 Posts: 57 Credit: 1,512,386 RAC: 0 |
...Is the watchdog supposed to work like this? Exactly the same... My WU (1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_VARY_ALL_BOND_ANGLES_SAVE_ALL_OUT__1306_12333_0) runs now for 6 hours and the steps are increasing very slowly. I'll abort it now. |
BiloxiPete Send message Joined: 27 Jun 06 Posts: 1 Credit: 515,091 RAC: 0 |
My 1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_VARY_ALL_BOND_ANGLES_ALL_BOND_DISTANCES_SAVE_ALL_OUT__1306_14368_0 is also advancing extremely slow. At 9hrs 5min Model 3 Step 396211 10hrs 18min Model 3 Step 396237 Time to completion is increasing 5-7 seconds per update interval, % complete stuck at 4.40% |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
This result is odd it finished well inside my runtime pref. Still had 3hrs to go don't know why but is still valid. A bug? https://boinc.bakerlab.org/rosetta/result.php?resultid=44099174 |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
This task ran for less than a second (on a 667MHz cpu) and in the output file seems to have restarted itself 4x in that time. No complaints as I got the credit I claimed for it |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
This result is odd it finished well inside my runtime pref. Still Did it *actually* run for ~7hrs or ~10hrs - did you happen to notice? What is odd is that is says this in stderr: DONE :: 1 starting structures built 39 (nstruct) times This process generated 28 decoys from 28 attempts and usually the nstruct number matches the no of decoys. If you multiply the run time by 39/28 you get close to your pref. This may be an irrelevant coincidence or may be a significant clue, but what I am far from clear what it might mean. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
River, that is the amount of time it ran for CPU time 25234 sec. It came out of preempt ran for about a minute then finished. As you can see from my results before, i changed my runtime they where always close to the 8hrs. |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
Rosetta 5.34 has a few new features to allow us to test more accurate energy functions and more interesting variations in the protein's bond geometry. Let us know if you see any problems -- especially if they are reproducible! Here is a bit more of the stuff I am seeing with 5.34 ... It's another of the 1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_VARY_ALL_BOND_ANGLES jobs https://boinc.bakerlab.org/rosetta/result.php?resultid=44140906 Below is part of a stdout.txt file from Linux-2.6 ... The message that concerns me is below the URL... (the stdout URL lists many of these).... I am wondering... are these bad workunits????? and BTW.. It's past it's bedtime... http://web.hotiron.net/pics/johng/38941838-partial-stdout.txt WARNING:: cant find phi but not a chainbreak? ====================================================== ====================================================== ====================================================== DANGER!!!! DANGER!!!! DANGER!!!! DANGER!!!! DANGER!!!! ====================================================== ====================================================== ====================================================== DANGER!!!! DANGER!!!! DANGER!!!! DANGER!!!! DANGER!!!! ====================================================== ====================================================== ====================================================== pose_minimize:: Big score_delta when turning on the nblist: 1.43719 Looking for a team ??? Join BoincSynergy!! |
Keith Akins Send message Joined: 22 Oct 05 Posts: 176 Credit: 71,779 RAC: 0 |
This WU appears to be progressing much better. 1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_JUMP_RELAX_SAVE_ALL_OUT__1306_40278_0 |
EW-3 Send message Joined: 1 Sep 06 Posts: 27 Credit: 2,561,427 RAC: 0 |
curious, do you want us to report problems, or does the server see the problems and can fetch our results? Thanks, |
Conan Send message Joined: 11 Oct 05 Posts: 151 Credit: 4,244,078 RAC: 128 |
>> Since turning off the graphics I can now process work units to completion and not get lock ups with the workunit freezing and the cpu dropping back to zero. This started happening on Ralph 5.32 and then Rosetta 5.32 and has continued into 5.34 on both. If I leave the computer running and the Boinc screensaver comes on then it only takes a matter of time (can be as short as 4% completed and less than 30 minutes), before the screen/computer no longer does anything. > With 5.32 the job often but not always kept running in the background and when I was able to release the screensaver (often needed a reboot for this), the workunit then errored out. > Now with 5.34 it shows in Task Manager as 'not responding' and I have to end the process wich of course causes the workunit to error out. > With the Boinc Screensaver off all is working well and the work units are finishing (both Ralph and Rosetta), even if a standard screensaver comes on (not the Boinc one), so now I can turn off the monitor knowing that by moving the mouse again I can get my screen back and not have to reboot. > I like the graphics and would like to use them (I also run other projects on the same machine, so have lost the graphics for them as well now they are turned off), so can this problem please be looked at? It has been occurring for about a month now, since 5.32. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
River, that is the amount of time it ran for CPU time 25234 sec. Yes, sorry to ask again, but was the run time *stuck* at 25234 for maybe 3 hrs before it went into pre-empt? I am sure the cpu time shown was around the 7hr mark, but was it stuck not using the cpu for longer? When I have seen these they have typically been stuck for a while. If stuck they will not upload at all unless pre-empted (with remove-from-memory) or with BOINC stopped restarted. I am not sure if you and I have seen the same issue, or different issues. This task of mine was showing 2hrs 04min in BOINCview, and BV indicated that it was running slow (usually means stopped). It stuck at the same for a few min, and a look in the messages showed it had been restarted several hours previously, so should have been over 5hrs cpu by then. I stopped and started BOINC. The task then showed 1hr 59min 44sec, stuck for a minute or so, then went to 1hr 58min and some sec and uploaded. I think this is the same behaviour as you saw after your task restarted - so I am wondering was it the same before it was pre-empted, was it stuck for hours being the running task but not actually using the cpu and not incrementing the clock? You would only know if you happened to notice when the task was previously loaded, or if you'd looked in the messages to see. Like yours, my task it shows a lot more nstruct than decoys - a lot more in my case. One more thing would be useful - on noticing that the clock has stopped it is useful to use top (linux) or task manager (win) to see if the task is using the cpu. Sadly I forgot to do that here :( R~~ |
Adam Gajdacs (Mr. Fusion) Send message Joined: 26 Nov 05 Posts: 13 Credit: 2,884,155 RAC: 1,191 |
Got four of these: 1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_VARY_ALL_BOND_ANGLES... currently cached on my client which is struggling to finish just the first one for at least a day now, stuck at 48.4%, Model 5, AB Initio (jumping). It's not actually stuck, progressing about 1 step in every few minutes, but at this rate it doesn't seem to be able to reach the next checkpoint in the 8-10 hours during which my computer is on on an average day, so it's effectively stuck. My CPU time preference is at 3 hours, and yet, one of those WUs ran about 7 hours just today without moving an inch ahead. Seeing the same warnings in the stdout.txt as netwraith in a few posts below. I guess I'll have to discard them eventually. |
Aidan Sonoda Send message Joined: 4 Mar 06 Posts: 2 Credit: 3,505,865 RAC: 0 |
Since the upgrade to 5.34 my error rate has gone through the roof, across four of my five hosts. Validate Errors ("Rosetta score is stuck or going too long" eg result) seem to be the most prevalent, though compute errors ("Access Violation" eg result and "cannot find the path specified" result) and WU's that get single digit credit after 8 hours cpu time have also occured (result) I am not overly concerned about the credit, but feel I am not contributing as much as I might be to the project, please advice as to whether there is something I need to fix at my end. Aidan Sonoda ~Coimhéad fearg fhear na foighde.~ In necessariis unitas, in dubiis libertas, in omnibus caritas. AidanSonoda@ftml.net |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I am not overly concerned about the credit, but feel I am not contributing as much as I might be to the project, please advice as to whether there is something I need to fix at my end. Aidan, others will advise if you need to change anything at your end - thanks for asking. However I'd like to reassure you that task errors do not mean there is a lack of contribution. One of the main missions of this project is to improve algorithms used in protei stucture computing. That mission is *more* important than crunching the structures. So if you get tasks that self destruct, so long as you let them report back to the project (as you are doing) that is all valuable and on-mission. And, although you say you are not mainly after the credit, the team behind this porject recognise that credit is important to some people, so most errored tasks end up being given manual credit based on the value of the debugging feedback to the project and the time your boxes spent. River~~ |
Message boards :
Number crunching :
Report problems with Rosetta version 5.34
©2025 University of Washington
https://www.bakerlab.org