Message boards : Number crunching : Minirosetta v1.32 bug thread
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next
Author | Message |
---|---|
Philosopher2 Send message Joined: 28 Mar 06 Posts: 3 Credit: 111,037 RAC: 0 |
Please post bugs/issues with minirosetta v1.32 here. I downloaded and installed v1.32 and the wu came up! WU 2reb_JUMPRELAX_PREDJUMP_FROMPREDFRAG_SAVE_ALL_OUT-2re_-_4420_740_0 has been running. This Wu completed 95 per cent of processing in approx 3 to 4 hours. From 95.360 percent I have observed that it has taken 9 hours of processing to move upto 98.809 percent! This Wu is targeted to complete on 8/22/08 ! at this rate of progress I wonder if it will! IS this the predicted behaviour ? The time to completion has moved from 00:09.51 to 00:09:54 during these last five days. I am running two other BOINC applications, hence time is available sequentially for 50 minutes to each application. Please advise - should I abort or let it carry on till the -whatever ? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Philosopher2 It all sounded normal up until you said it has been running for 5 days. But since you are running other projects, we would need to look at the CPU time used to understand how much of that time this task was actually running. You see the initial runtime is just an estimate. But the actual is related to the runtime preference in your Rosetta preferences (3hrs is the default). At this is a target, not a hard and fast limit. Rosetta will do it's best to complete within that target time if possible. But it is not always possible. In cases where it is not possible to complete in your desired runtime, the time estimate will get down to about 11 minutes and then move exponentially slower until it completes and jumps to 100%. There is a watchdog thread that will check in on the tasks every 15 minutes or so and see if it thinks things are running normally or not. I suggest the following: If that task has 5 or more days of CPU time (120 hours), which is pretty unlikely, then abort it. If it has run for 9 hours, let it run. If not, take a look at what your Rosetta preferences have configured for the venue of that PC, and post back here with the details of both your preference, and the actual CPU time you see for the task. The watchdog task will end the task if it runs longer then 5 times your runtime preference. So that would be 15 hours with the defaults. You should follow the same guideline. I would also suggest you review your computing preferences and check the box to keep tasks in memory while suspended. Since you are switching projects every 50 minutes, you will be losing a lot of work if you do not keep the tasks in memory. Oh, and you shouldn't have to download anything manually. You said you downloaded 1.32. I wasn't positive if you meant that you did this manually, or if BOINC did this during the normal file transfers. Rosetta Moderator: Mod.Sense |
lusvladimir Send message Joined: 18 Oct 05 Posts: 12 Credit: 1,784,854 RAC: 0 |
Please post bugs/issues with minirosetta v1.32 here. Debian Linux;Boinc Manager 6.2.14 errors from 1.32 tasks https://boinc.bakerlab.org/rosetta/result.php?resultid=185499939 https://boinc.bakerlab.org/rosetta/result.php?resultid=185493038 https://boinc.bakerlab.org/rosetta/result.php?resultid=185493027 https://boinc.bakerlab.org/rosetta/result.php?resultid=185493026 https://boinc.bakerlab.org/rosetta/result.php?resultid=185489375 <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> # cpu_run_time_pref: 3600 needs psipred_ss2 to run filters needs psipred_ss2 to run filters SIGSEGV: segmentation violation Stack trace (19 frames): [0x8926f8f] [0x89514e0] [0xb7f19400] [0x880c924] [0x834349c] [0x88bcc81] [0x880c5d6] [0x829591c] [0x85f20f6] [0x8072b5d] [0x807e2e7] [0x8165bee] [0x80abecc] [0x80a9ea4] [0x80d7044] [0x80d8651] [0x804b9f8] [0x89acfdc] [0x8048111] Exiting... </stderr_txt> ]]> |
Philosopher2 Send message Joined: 28 Mar 06 Posts: 3 Credit: 111,037 RAC: 0 |
Thank you Moderator. I have changed the prefernce to 120 minutes per application run. The application will remain in memory as suggested. This WU has already been running (CPU time) for 15 hours and it is only 98.900 done! Should I let it go on - I am a bit curious whether it will complete by 22 Aug? Take care. Philosopher2 |
joergent Send message Joined: 17 Feb 08 Posts: 1 Credit: 32,031 RAC: 0 |
Every morning for the last week, I find the computer frozen, and the mini rosetta on the task bar. I have detached from the project, to see what happens with the other projects.. just for info. Add to this, that every time my screen saver is running with Minirosetta and the screen has turned black, the PC (windows XP SP3) cannot be returned to its previous state. The screen with rosetta appears, but mouse is not working and I can barely use the keyboard to shut down the PC, which goes very slowly just until the rosetta is killed. Rosetta has been disabled on my PC !!! |
Terrasapiens Send message Joined: 25 Apr 08 Posts: 15 Credit: 368,919 RAC: 0 |
I've been having a lot of WU errors with mini rosetta ever since version 1.28 and now have had 5 in the past two days. Here's the link to my WUs showing the recent falures: https://boinc.bakerlab.org/rosetta/results.php?userid=254884 I've also had to do a hard shutdown and reboot several times recently after the RAH screen saver apparently locked up the machine. Not sure if v1.32 or 5.98 was running at the time. This seemed to happen after I changed the options setting so the screen would go to black after a few minutes. I undid the setting and have had no application crashes since then, but I'm not totally sure the crashes were due to that change. |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
{...} There is a workaround for this: Ctrl + Shift + Esc to force the Task Manager, then carefully move the mouse around until you can "find" it in the Task Manager window, at which point you can kill the screensaver process without having to shutdown the computer. As always, YMMV, but I've not lost any crunching time using this method. But the simplest solution is to not use the BOINC screensaver... Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
Terrasapiens Send message Joined: 25 Apr 08 Posts: 15 Credit: 368,919 RAC: 0 |
I disabled the screensaver and will see if this minimizes the WUs crashing as well. Thanks for the info. |
Terrasapiens Send message Joined: 25 Apr 08 Posts: 15 Credit: 368,919 RAC: 0 |
I decided to disable the BOINC screensaver yesterday to see if that would have any affect on the number of failed WUs, but it didn't seem to make any difference. Two more failed today: https://boinc.bakerlab.org/rosetta/result.php?resultid=186050178 https://boinc.bakerlab.org/rosetta/result.php?resultid=186005807 |
Jim Wilkins Send message Joined: 5 Feb 08 Posts: 1 Credit: 4,513 RAC: 0 |
I successfully completed a 1.32 run but had a lot of this message in my stderr file: needs psipred_ss2 to run filters Is that a problem? Thanks, Jim |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
Compute error in this workunit. stderr out: <core_client_version>5.10.45</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> # cpu_run_time_pref: 43200 ERROR: NANs occured in hbonding! ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763 called boinc_finish </stderr_txt> ]]> |
mitrichr Send message Joined: 23 May 07 Posts: 44 Credit: 1,005,660 RAC: 0 |
The graphic is freezing, meaning, I assume, that the WU is a dead fish. I am getting this on three machines so far. I have to abort too often. >>RSM http://sciencespringe.wordpress.com http://facebook.com/sciencesprings |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Just found this one bombed in my results https://boinc.bakerlab.org/rosetta/result.php?resultid=184820750 1ughI_BOINC_ABINITIO_IGNORE_THE_REST-S25-13-S3-3--1ughI-_4309_84_1 Exit status 1 (0x1) CPU time 1.328125 stderr out <core_client_version>6.2.16</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> ERROR: Cannot find file 'minirosetta_databasechemical/residue_type_sets/fa_standard/CSD_ATOM_TYPE_SET fa_standard' ERROR:: Exit from: ....srccorechemicalresidue_io.cc line: 132 called boinc_finish </stderr_txt> ]]> |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
The graphic is freezing, meaning, I assume, that the WU is a dead fish. {...} Not necessarily. Try this workaround. Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
lusvladimir Send message Joined: 18 Oct 05 Posts: 12 Credit: 1,784,854 RAC: 0 |
<core_client_version>6.2.14</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) I observed that the rosetta model I was processing failed with this error after a ntp daemon resynch on my linux mashine. System clock, when adjusted on a routine resynch, caused the running model to fail because its understanding of time steps changed outside of the model I temporarily stop ntp daemon and not see this error. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I temporarily stop ntp daemon and not see this error. Have you seen it fail during a resynch before? Consistently? Am I correct to presume that if the resynch did not cause a change to the clock, then there is no problem? Do you have any perspective on whether resynch caused failures on older BOINC releases? Which Linux distribution are you running? Do you configure the machine to run at 100% of CPU? Or less? Rosetta Moderator: Mod.Sense |
lusvladimir Send message Joined: 18 Oct 05 Posts: 12 Credit: 1,784,854 RAC: 0 |
I temporarily stop ntp daemon and not see this error. Thanks and sorry for my english, its not my own language. Debian Linux Kernel is 2.6.26-1-686 SMP Boinc Manager 6.2.14 Machine configure to run at 100 % CPU In linux system log ..... ntp time change Aug 23 03:09:12 alpha ntpd[13389]: time reset -0.175490 s Aug 23 03:09:33 alpha ntpd[13389]: synchronized to 77.234.200.98, stratum 4 Aug 23 03:10:29 alpha ntpd[13389]: synchronized to 87.236.24.179, stratum 2 ...and in BOINC stderr.txt at this time (i'm set task_debug on) .... 23-Aug-2008 03:07:27 [rosetta@home] Started download of boinc_homfrags_aa1pxuA03_05.200_v1_3.gz 23-Aug-2008 03:08:05 [rosetta@home] [task_debug] result abinitio_only62_A_1bq9A_4438_2605_0 checkpointed 23-Aug-2008 03:08:44 [rosetta@home] [task_debug] result abinitio_only62_A_1vcc__4438_3676_0 checkpointed 23-Aug-2008 03:08:45 [rosetta@home] [task_debug] result abinitio_only62_A_1vcc__4438_3676_0 checkpointed 23-Aug-2008 03:09:23 [rosetta@home] [task_debug] result abinitio_only62_A_2chf__4434_6914_0 checkpointed 23-Aug-2008 03:09:38 [rosetta@home] [task_debug] result abinitio_homfrag_71_A_2hboA_4443_1214_0 checkpointed 23-Aug-2008 03:09:52 [rosetta@home] Finished download of boinc_homfrags_aa1pxuA03_05.200_v1_3.gz 23-Aug-2008 03:09:52 [rosetta@home] Started download of boinc_homfrags_aa1pxuA09_05.200_v1_3.gz 23-Aug-2008 03:10:32 [rosetta@home] [task_debug] result abinitio_only62_A_1bq9A_4438_2605_0 checkpointed 23-Aug-2008 03:10:44 [rosetta@home] [task_debug] result abinitio_only62_A_1bq9A_4438_2605_0 checkpointed 23-Aug-2008 03:10:56 [rosetta@home] [task_debug] result abinitio_only62_A_1vcc__4438_3676_0 checkpointed 23-Aug-2008 03:11:28 [rosetta@home] Sending scheduler request: To fetch work. Requesting 3081 seconds of work, reporting 0 completed tasks 23-Aug-2008 03:11:31 [rosetta@home] [task_debug] result abinitio_only62_A_2chf__4434_6914_0 checkpointed 23-Aug-2008 03:11:33 [rosetta@home] Scheduler request succeeded: got 1 new tasks 23-Aug-2008 03:11:33 [rosetta@home] [task_debug] result state=NEW for abinitio_only62_A_1ptq__4438_5437_0 from handle_scheduler_reply 23-Aug-2008 03:11:34 [rosetta@home] [task_debug] result state=FILES_DOWNLOADING for abinitio_only62_A_1ptq__4438_5437_0 from CS::update_results 23-Aug-2008 03:12:00 [rosetta@home] [task_debug] result abinitio_homfrag_71_A_2hboA_4443_1214_0 checkpointed 23-Aug-2008 03:12:11 [rosetta@home] [task_debug] Process for abinitio_only62_A_2chf__4434_6914_0 exited 23-Aug-2008 03:12:11 [rosetta@home] [task_debug] task_state=EXITED for abinitio_only62_A_2chf__4434_6914_0 from handle_exited_app 23-Aug-2008 03:12:11 [rosetta@home] [task_debug] result state=COMPUTE_ERROR for abinitio_only62_A_2chf__4434_6914_0 from CS::report_result_error 23-Aug-2008 03:12:11 [rosetta@home] [task_debug] exit status 193 23-Aug-2008 03:12:11 [rosetta@home] Computation for task abinitio_only62_A_2chf__4434_6914_0 finished 23-Aug-2008 03:12:11 [rosetta@home] Output file abinitio_only62_A_2chf__4434_6914_0_0 for task abinitio_only62_A_2chf__4434_6914_0 absent 23-Aug-2008 03:12:11 [rosetta@home] [task_debug] result state=COMPUTE_ERROR for abinitio_only62_A_2chf__4434_6914_0 from CS::app_finished 23-Aug-2008 03:12:11 [rosetta@home] Starting abinitio_only62_A_1pgx__4438_2667_0 23-Aug-2008 03:12:12 [---] [task_debug] ACTIVE_TASK::start(): forked process: pid 4030 23-Aug-2008 03:12:12 [rosetta@home] [task_debug] task_state=EXECUTING for abinitio_only62_A_1pgx__4438_2667_0 from start 23-Aug-2008 03:12:12 [rosetta@home] Starting task abinitio_only62_A_1pgx__4438_2667_0 using minirosetta version 132 23-Aug-2008 03:12:13 [rosetta@home] [task_debug] Process for abinitio_homfrag_71_A_2hboA_4443_1214_0 exited 23-Aug-2008 03:12:13 [rosetta@home] [task_debug] task_state=EXITED for abinitio_homfrag_71_A_2hboA_4443_1214_0 from handle_exited_app 23-Aug-2008 03:12:13 [rosetta@home] [task_debug] result state=COMPUTE_ERROR for abinitio_homfrag_71_A_2hboA_4443_1214_0 from CS::report_result_error 23-Aug-2008 03:12:13 [rosetta@home] [task_debug] exit status 193 23-Aug-2008 03:12:13 [rosetta@home] Computation for task abinitio_homfrag_71_A_2hboA_4443_1214_0 finished 23-Aug-2008 03:12:13 [rosetta@home] Output file abinitio_homfrag_71_A_2hboA_4443_1214_0_0 for task abinitio_homfrag_71_A_2hboA_4443_1214_0 absent 23-Aug-2008 03:12:13 [rosetta@home] [task_debug] result state=COMPUTE_ERROR for abinitio_homfrag_71_A_2hboA_4443_1214_0 from CS::app_finished 23-Aug-2008 03:12:13 [rosetta@home] Starting abinitio_homfrag_71_A_2hl7A_4443_1633_0 23-Aug-2008 03:12:13 [---] [task_debug] ACTIVE_TASK::start(): forked process: pid 4042 23-Aug-2008 03:12:13 [rosetta@home] [task_debug] task_state=EXECUTING for abinitio_homfrag_71_A_2hl7A_4443_1633_0 from start 23-Aug-2008 03:12:13 [rosetta@home] Starting task abinitio_homfrag_71_A_2hl7A_4443_1633_0 using minirosetta version 132 23-Aug-2008 03:12:17 [rosetta@home] [task_debug] Process for abinitio_only62_A_1pgx__4438_2667_0 exited 23-Aug-2008 03:12:17 [rosetta@home] [task_debug] task_state=EXITED for abinitio_only62_A_1pgx__4438_2667_0 from handle_exited_app 23-Aug-2008 03:12:17 [rosetta@home] [task_debug] result state=COMPUTE_ERROR for abinitio_only62_A_1pgx__4438_2667_0 from CS::report_result_error 23-Aug-2008 03:12:17 [rosetta@home] [task_debug] exit status 193 23-Aug-2008 03:12:17 [rosetta@home] Computation for task abinitio_only62_A_1pgx__4438_2667_0 finished 23-Aug-2008 03:12:17 [rosetta@home] Output file abinitio_only62_A_1pgx__4438_2667_0_0 for task abinitio_only62_A_1pgx__4438_2667_0 absent 23-Aug-2008 03:12:17 [rosetta@home] [task_debug] result state=COMPUTE_ERROR for abinitio_only62_A_1pgx__4438_2667_0 from CS::app_finished 23-Aug-2008 03:12:17 [rosetta@home] Starting abinitio_only62_A_1cc8A_4438_3695_0 23-Aug-2008 03:12:18 [---] [task_debug] ACTIVE_TASK::start(): forked process: pid 4061 23-Aug-2008 03:12:18 [rosetta@home] [task_debug] task_state=EXECUTING for abinitio_only62_A_1cc8A_4438_3695_0 from start 23-Aug-2008 03:12:18 [rosetta@home] Starting task abinitio_only62_A_1cc8A_4438_3695_0 using minirosetta version 132 23-Aug-2008 03:12:21 [rosetta@home] [task_debug] Process for abinitio_homfrag_71_A_2hl7A_4443_1633_0 exited 23-Aug-2008 03:12:21 [rosetta@home] [task_debug] task_state=EXITED for abinitio_homfrag_71_A_2hl7A_4443_1633_0 from handle_exited_app 23-Aug-2008 03:12:21 [rosetta@home] [task_debug] result state=COMPUTE_ERROR for abinitio_homfrag_71_A_2hl7A_4443_1633_0 from CS::report_result_error 23-Aug-2008 03:12:21 [rosetta@home] [task_debug] exit status 193 I see my old rosetta result (my own stats) - in BOINC 5.10.45 and 5.96 rosetta client - 4 errors in month After upgrading BOINC to version 6.2.X and new minirosetta app i see many more errors if ntp is on .... Stop ntp damon - all works fine without error. I try manually run ntpdate ( not daemon, only once sync with time server) - after sync, two workunuts fails and then works again without error. I run rosetta 3 years ago and i do not know in what a problem in my system cause it - kernel, boinc manger, science app or ntp. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
lusvladimir, thank you for all the details. One more question, have you run any other projects when the time change is negative? I mean, do tasks from other projects have a similar problem? Rosetta Moderator: Mod.Sense |
lusvladimir Send message Joined: 18 Oct 05 Posts: 12 Credit: 1,784,854 RAC: 0 |
lusvladimir, thank you for all the details. One more question, have you run any other projects when the time change is negative? I mean, do tasks from other projects have a similar problem? I crunch rosetta@home only, but for experiment and resolving this problem i will try to be connected to other project for linux platforms and inform results after 1-2 days. Thanks you. |
lusvladimir Send message Joined: 18 Oct 05 Posts: 12 Credit: 1,784,854 RAC: 0 |
lusvladimir, thank you for all the details. One more question, have you run any other projects when the time change is negative? I mean, do tasks from other projects have a similar problem? Mod.Sense, thank you for advice about negative time!!! I read more manual about time synchronization and I was able to tune my system so that the time shift was very very small (millisecons per several hours) and still positive. NTP daemon now do not need to synchronize the time often, adn rosetta workunits work without errors. I did not replicate the error on another project (Einstein @ Home), but too little time has passed. I will continue to monitor the state of the system and in case of errors will announce their way to reproduce. |
Message boards :
Number crunching :
Minirosetta v1.32 bug thread
©2024 University of Washington
https://www.bakerlab.org