Message boards : Number crunching : Problems with Minirosetta 1.80
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. Another problem task, it seemed to be in a loop going nowhere. I aborted it after 4hrs and another the of the same type. real_core_1.5_low200_beta_low200_start_hb_t308__IGNORE_THE_REST_13186_100_0. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=238909195 Model:0 Step:52800 pete. |
gazzawazza Send message Joined: 4 May 07 Posts: 28 Credit: 297,648 RAC: 0 |
Hi all. I'm still getting the odd computation error (please see previous thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=4933). However, this has been the only WU failure since the 21st June 2009. The symptoms are that a task repeatedly restarts (having exited with zero status but no 'finished' file), then when complete the output file is absent (or at least that's what's being reported in the BOINC client logs). My other projects seem to be running without issue. My current setup is BOINC 6.6.36 (running as a service) on vista home premium SP2 (32bit), running Rosetta 1.80. I do have Kaspersky antivirus 2009 installed but real-time scanning was disabled for the entirety of the time that this latest WU was running for (I only mention this because I know that A/V progs have been implicated in other crunching problems e.g. files getting locked). Regards, Gary |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
Another problem task, it seemed to be in a loop going nowhere. I have also had a real_core going in a loop to nowhere so I have aborted it. 261825496 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
was just randomly doing a check on the tasks i have lined up and went to look at the graphics of lb_cutback_all_multi_hb_t326__IGNORE_THE_REST_2GK3A_3_12956_21_0 and found that the native structure and low energy structure windows were working fine but none of the other windows have any structures or plots showing. on one occasion the search window showed a graphic for all of a second or two. it also says stage unknown for the kind of work it is doing. the line representations in the two working windows move and change position. also the accepted energy value is not a number but 1.#QNAN and for accepted rmsd it shows 1.#QO. Here is a screen shot: |
ByRad Send message Joined: 12 Apr 08 Posts: 8 Credit: 15,865,146 RAC: 1,078 |
BOINC Manager message: wrote: 2009-06-28 23:07:24 rosetta@home task lr_score12_snase_run02_rlbn_yfsong_3BDC-ASN100LYS_SAVE_ALL_OUT_NATIVE_NOCON_12975_3093_0 aborted by user= I aborted this task because: after about 1,5h of work it still had 5,3% (normally it is about 40) and then I have checked the graphic for this task - model:2 step:70; I have checked it after about an hour later and there still was model:2, step:70... An infinite loop... (Normally I crounh 50 to 100 models in about 3h!) |
WinterWasp Send message Joined: 16 Jun 09 Posts: 2 Credit: 11,905 RAC: 0 |
Is it normal, that a task completes successfully, gets verified as ok and grants almost double the asked credits despite the log being almost flooded with not a number and value out of range errors? wRMSF_1_5_core_jumps_mixcst2_hb_t374__IGNORE_THE_REST_12929_921_1 is the task in question. |
Venturini Dario[VENETO] Send message Joined: 25 May 07 Posts: 22 Credit: 245,028 RAC: 0 |
dom 28 giu 2009 22:30:23 CEST|rosetta@home|Output file real_core_1.5_low200_beta_low200_start_hb_t322__IGNORE_THE_REST_13290_313_0_0 for task real_core_1.5_low200_beta_low200_start_hb_t322__IGNORE_THE_REST_13290_313_0 absent This WU errored out after 8 hours of crunching (supposed to be 4) To me it seems like the "real_core" ones have a fairly high failure rate... |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I've had a number of real_core_1.5_low200_beta_low200_start_ WUs go 4 hours past my runtime and they were presumably ended by the watchdog. They all claim 1 decoy and were marked invalid. https://boinc.bakerlab.org/rosetta/result.php?resultid=261815816 https://boinc.bakerlab.org/rosetta/result.php?resultid=261768023 https://boinc.bakerlab.org/rosetta/result.php?resultid=261765649 https://boinc.bakerlab.org/rosetta/result.php?resultid=261722487 |
Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0 |
Been getting errors on my Mac too with 1.80. |
xsc2 Send message Joined: 9 Jul 08 Posts: 4 Credit: 62,354 RAC: 0 |
This WU crashed with exit status: 1 (0x1) https://boinc.bakerlab.org/rosetta/result.php?resultid=262078278 |
Steve Dodd Send message Joined: 13 Dec 05 Posts: 7 Credit: 3,811,680 RAC: 2,283 |
Just adding to the rest of the comments here. I'm also experiencing issues with wus that being with "lb_cutback_all_multi...". Seems that the app. is ignoring the preferences file for maximum time per wu. Mine's set at 4 hours, but these are running over 8 hrs. and still going. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Steve, thanks for the info. Just to clarify, the setting in the Rosetta preferences is not for maximum time per work unit. It is a target runtime. Having said that, the program checks periodically to assure the task seems to be progressing normally, and at the end of models it checks to see if the runtime would allow another model or not. The "watchdog" should take action on any task that runs longer then the runtime preference plus 4 hours. Since it doesn't waste time checking this all of the time, it may take another 15 min. or so after that. So, your task just reached the point where the system should have taken action itself. With all of these reports, it sounds like there are some new tasks that have lengthy models, and perhaps some new issues with the watchdog as well. Keep the details coming. Rosetta Moderator: Mod.Sense |
lusvladimir Send message Joined: 18 Oct 05 Posts: 12 Credit: 1,784,854 RAC: 0 |
Errors for tasks: real_core_1.5_low200_beta_low200_start_hb https://boinc.bakerlab.org/result.php?resultid=261781005 https://boinc.bakerlab.org/result.php?resultid=261750967 https://boinc.bakerlab.org/result.php?resultid=261750701 https://boinc.bakerlab.org/result.php?resultid=261750699 Ended by the watchdog. Marked invalid. |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
task 262080735 ended after 16 hours, which happens to be my cpu_run_time_pref + 4 hours. And then there was a <file_xfer_error>. BOINC:: CPU time: 57669.2s, 14400s + 43200s[2009- 6-29 14:48:55:] :: BOINC Output exists: default.out.gz InternalDecoyCount: 0 (GZ) ====================================================== DONE :: 1 starting structures 57670.2 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== called boinc_finish </stderr_txt> <message> <file_xfer_error> <file_name>real_core_1.5_low200_beta_low200_start_hb_t286__IGNORE_THE_REST_13040_508_0_0</file_name> <error_code>-161</error_code> </file_xfer_error> |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
This one only ran for 1 sec, and has errored for others. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=237150212 calbindin_BOINC_ABRELAX_4xBIN_1xCYCLES_SAVE_ALL_OUT_IGNORE_THE_REST-S25-9-S3-3--calbindin-_12935_707_2 |
Path7 Send message Joined: 25 Aug 07 Posts: 128 Credit: 61,751 RAC: 0 |
The next task: real_core_1.5_low200_beta_low200_start_hb_t308__IGNORE_THE_REST_13046_407_0 Didn't switch to another application after 1 hour – ran on for over 7 hours. Didn't stop after runtime preference of 6 hours – was ended by the watchdog after 10 hours. Didn't checkpoint regular – rebooting after 9 hours runtime: the WU started from 2 hours runtime. The good thing: Outcome: Success. Path7. |
Venturini Dario[VENETO] Send message Joined: 25 May 07 Posts: 22 Credit: 245,028 RAC: 0 |
2 more real_core ran far over the 4 hours boundary, both ended after 8 hours, one successful, the other one errored out: mar 30 giu 2009 09:51:33 CEST|rosetta@home|Output file real_core_1.5_low200_beta_low200_start_hb_t368__IGNORE_THE_REST_13036_638_0_0 for task real_core_1.5_low200_beta_low200_start_hb_t368__IGNORE_THE_REST_13036_638_0 absent Error is always code -161 |
Venturini Dario[VENETO] Send message Joined: 25 May 07 Posts: 22 Credit: 245,028 RAC: 0 |
Another real_core with a strange behaviour: This is when I turned the PC on this morning (54% completed because it ran yesterday for some hours) And this is 2 minutes later (5% because somehow it resetted itself, including CPU time) Btw now it's at 6% after 37 minutes, which means it will need some 16 x 37 minutes to reach 100%, which means more than 8 hours, when the target time is set at 4. I'm having this errors both on my laptop (Core2Duo 7700, Vista Home Premium, BOINC 6.4.5) and my desktop (Amd 3800x2, Ubuntu 9.04 64bit, BOINC 6.6.28) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Folks, please do not take % complete and time to completion as any indication of a Rosetta problem. It is simply an estimate that your BOINC manager is making. This takes a number of factors in to account, including the speed of your machine, and time it took your last task to complete. So if your last task ran long, the % on the next task MAY (or MAY NOT) reflect that, or part of that information. BOINC tries not to presume all tasks are the same and sometimes looks at the last several tasks runtime as a frame of reference. If you restart a task, you should be looking at the elapsed time change as the indication of what checkpoint (if any) the task was able to restart from. Rosetta Moderator: Mod.Sense |
Venturini Dario[VENETO] Send message Joined: 25 May 07 Posts: 22 Credit: 245,028 RAC: 0 |
Folks, please do not take % complete and time to completion as any indication of a Rosetta problem. It is simply an estimate that your BOINC manager is making. This takes a number of factors in to account, including the speed of your machine, and time it took your last task to complete. So if your last task ran long, the % on the next task MAY (or MAY NOT) reflect that, or part of that information. BOINC tries not to presume all tasks are the same and sometimes looks at the last several tasks runtime as a frame of reference. Agreed with that, but I think I have enough experience to understand when there is a problem and when not. I'll write some more elements down: 1) the WU arrived yesterday at 12.52. 2) all of my WUs are started within a few hours from their arrival because I don't have any cache and the PC is set to always connected Therefore 3) that WU started being crunched yesterday in the middle of the afternoon 4) I turned off the PC for the night when that WU had reached 54% percentage of completion (yes I'm a nerd and I check how work is going in my PC) 5) I restarted it today and saw that WU being crunched but making no progress 6) I checked the graphic and saw nothing (see posted image #1 in my previous post) 7) I waited a few minutes and saw the WU's percentage dropping to 5%. Checked the CPU time and it said 25 minutes (while it ran for hours the day before) 8) I reported to your thread Also 9) the WU is still running, percentage is inreasing but time is long overdue. Should have been 4 hours, it's already 5 1/2 and the progress bar indicates 55,22%. As you can see, I (and BOINC) made a fairly accurate prevision because at this speed it will end in 9 hours. Of course the watchdog will kill it after 8 but hey, not that I can do anything about it. 10) I am trying to see the graphics of that WU but the window pops up without syncing to the WU. The graphics' window blocks and I have to terminate it from the task manager. So now 11) I'm going to let that WU run until completion and hope that you will find something useful in the output, being it for medicine or for the improvement of the application. P.S. Oh and about the checkpoint thing: the elapsed time for that WU changed from 5 hours to 25 minutes. Is it meant to be this way? |
Message boards :
Number crunching :
Problems with Minirosetta 1.80
©2024 University of Washington
https://www.bakerlab.org