Problems and Technical Issues with Rosetta@home

Author	Message
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 86772 - Posted: 30 Jun 2017, 13:57:47 UTC - in response to Message 86770. I have picked up a couple, but fortunately they fail after only 8 seconds. It has not been a big problem for me. ID: 86772 · Rating: 0 · rate: / Reply Quote

TJ Volunteer moderator Project developer Project scientist Send message Joined: 22 Oct 10 Posts: 9 Credit: 216,670 RAC: 0	Message 86773 - Posted: 30 Jun 2017, 17:17:02 UTC - in response to Message 86769. Sorry about that. These are my jobs and while rushing I made a mistake. The CYB atoms were supposed to be stripped before the job was submitted. -TJ ID: 86773 · Rating: 0 · rate: / Reply Quote

zombie3 Send message Joined: 13 May 17 Posts: 1 Credit: 362,588 RAC: 0	Message 86782 - Posted: 4 Jul 2017, 10:59:23 UTC Last modified: 4 Jul 2017, 11:05:15 UTC Hi, all my Rosetta mini jobs are failing with computation error. Could you please check what's going on with my workunits? Outcome Computation error Client state Compute error Exit status -185 (0xFFFFFF47) ERR_RESULT_START Please note, this issue only started happening today (4th July 2017) was fine before that. Thanks, Zombie3 ID: 86782 · Rating: 0 · rate: / Reply Quote

JoeM Send message Joined: 7 Nov 14 Posts: 1 Credit: 1,115,683 RAC: 0	Message 86809 - Posted: 14 Jul 2017, 3:19:59 UTC - in response to Message 86782. I too have been recently getting these. ID: 86809 · Rating: 0 · rate: / Reply Quote

supdood Send message Joined: 3 Aug 15 Posts: 6 Credit: 190,389 RAC: 0	Message 86819 - Posted: 16 Jul 2017, 22:39:26 UTC - in response to Message 86809. I've had 18 of these in the past two days across two android devices. All are approx. 30-45 secs runtime and 15 sec cpu time, then error out. ID: 86819 · Rating: 0 · rate: / Reply Quote

furukitsune Send message Joined: 19 Mar 16 Posts: 9 Credit: 7,847,298 RAC: 0	Message 86821 - Posted: 17 Jul 2017, 21:52:17 UTC The old ip addresses are still active and giving a "servers are down, please try later message". I had hardwired the ip addresses in the hosts file during the DNS/domain problem a while back, and did not know the addresses have changed. Could you disable the error messages for those who still have hardwired ip addresses . fk ID: 86821 · Rating: 0 · rate: / Reply Quote

supdood Send message Joined: 3 Aug 15 Posts: 6 Credit: 190,389 RAC: 0	Message 86825 - Posted: 18 Jul 2017, 13:16:21 UTC - in response to Message 86819. I've had 18 of these in the past two days across two android devices. All are approx. 30-45 secs runtime and 15 sec cpu time, then error out. Update: now have 58 errors over 3 days. ID: 86825 · Rating: 0 · rate: / Reply Quote

Sandvika Send message Joined: 7 Feb 15 Posts: 2 Credit: 17,333 RAC: 0	Message 86842 - Posted: 24 Jul 2017, 15:46:24 UTC Last modified: 24 Jul 2017, 15:52:11 UTC I have a work unit (https://boinc.bakerlab.org/workunit.php?wuid=838066564) that progressed to 42% on the client but then stopped increasing in %-progress. It has now reached 85 hours of run time and has about 65 hours left before its deadline. It is stuck? ID: 86842 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 86843 - Posted: 25 Jul 2017, 2:02:20 UTC - in response to Message 86842. It doesn't sound like your BOINC applications are getting any CPU time on your machine. Sometimes BOINC Manager has a problem where it shows a task as running, but never dispatches any CPU to it. I would suggest you exit the BOINC Manager (not just close the window, but exit). Then restart it and let it run for a few hours to see if it clears things up. Certainly abort the task if that doesn't seem to help. Rosetta Moderator: Mod.Sense ID: 86843 · Rating: 0 · rate: / Reply Quote

Sandvika Send message Joined: 7 Feb 15 Posts: 2 Credit: 17,333 RAC: 0	Message 86845 - Posted: 25 Jul 2017, 10:05:33 UTC - in response to Message 86843. Thank you for your reply. I think you are probably right about dispatching problems, as I'd increased the resource share for the project to 10K to ensure the job was running all the time. Its state was "Running, High Priority", it had got to 105 hours elapsed time but under 2 hours processing time. It hadn't occurred to me to look at the task properties because it's usually mundane information. On other machines where I had the same symptoms I restarted BOINC and all the work units seemed to resume from the last checkpoint percentage, but then all dropped back to zero and had evidently restarted. My impression (only an impression, not troubleshooted) is that this project is fine when the work units run from start to finish uninterrupted, but there seem to be issues with resume in a multi-project environment. I'll update the system and BOINC and try again. ID: 86845 · Rating: 0 · rate: / Reply Quote

Sebastian M. Bobrecki Send message Joined: 9 Oct 05 Posts: 4 Credit: 6,286,377 RAC: 0	Message 86847 - Posted: 25 Jul 2017, 11:29:45 UTC I see serious problems with 21AA_CSP_complex, 22AA_1st_CSP_complex and 22AA_2nd_CSP_complex tasks. They all exited with "Error while computing" like: <message> upload failure: <file_xfer_error> <file_name>21AA_CSP_complex_0151_SAVE_ALL_OUT_500602_91_0_r1107058418_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> And I see that some of them also failed on other hosts, like for example this 838934585 and this 838962637. ID: 86847 · Rating: 0 · rate: / Reply Quote

furukitsune Send message Joined: 19 Mar 16 Posts: 9 Credit: 7,847,298 RAC: 0	Message 86851 - Posted: 25 Jul 2017, 14:48:27 UTC Same error on Name 22AA_1st_CSP_complex_0042_SAVE_ALL_OUT_500829_31_0 Workunit https://boinc.bakerlab.org/result.php?resultid=929956055 <file_name>22AA_1st_CSP_complex_0042_SAVE_ALL_OUT_500829_31_0_r1023849146_0</file_name> <error_code>-161 (not found)</error_code> fk ID: 86851 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 86852 - Posted: 25 Jul 2017, 21:02:37 UTC - in response to Message 86847. I see serious problems with 21AA_CSP_complex, 22AA_1st_CSP_complex and 22AA_2nd_CSP_complex tasks. They all exited with "Error while computing" like: <message> upload failure: <file_xfer_error> <file_name>21AA_CSP_complex_0151_SAVE_ALL_OUT_500602_91_0_r1107058418_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> And I see that some of them also failed on other hosts, like for example this 838934585 and this 838962637. Same here. Many, many errors like this. Worst of all is that they appear to happen at the end (lots of CPU Time wasted...). I'm going to reduce the run times to reduce the wasted hours. ID: 86852 · Rating: 0 · rate: / Reply Quote

Defender Send message Joined: 22 Mar 08 Posts: 10 Credit: 13,517,861 RAC: 0	Message 86868 - Posted: 28 Jul 2017, 5:07:47 UTC I got a WU while it has been running on a different host. As he completed computation the server aborted my WU. What is the reason for this, it seems to be a wast of power? Right now there are 27 WUs "Aborted by server" in my account, see this. ID: 86868 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 86869 - Posted: 28 Jul 2017, 7:58:23 UTC - in response to Message 86868. I got a WU while it has been running on a different host. As he completed computation the server aborted my WU. What is the reason for this, it seems to be a wast of power? Right now there are 27 WUs "Aborted by server" in my account, see this. They are probably bad work units that have been reported by other users. The server is doing you a favor by aborting them as soon as possible. Some of them have 0 run time. ID: 86869 · Rating: 0 · rate: / Reply Quote

Defender Send message Joined: 22 Mar 08 Posts: 10 Credit: 13,517,861 RAC: 0	Message 86870 - Posted: 28 Jul 2017, 10:55:50 UTC - in response to Message 86869. Last modified: 28 Jul 2017, 10:58:25 UTC But why do I get WUs sent that are currently crunched by other hosts? The task I linked above has been sent to me while another host was working on it. After he finished successfully the server aborted my copy, that can't be correct. ID: 86870 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 86871 - Posted: 28 Jul 2017, 11:17:37 UTC - in response to Message 86870. But why do I get WUs sent that are currently crunched by other hosts? The task I linked above has been sent to me while another host was working on it. After he finished successfully the server aborted my copy, that can't be correct. That may be another issue. They often send out two work units and hope to get them back under the deadline for comparison. If they both get back in time, no problem. But if one of them is delayed and gets too close to the deadline, they will send out a third copy (to you, for example). So you start working on it. But then the second copy is completed and sent back in time, so they cancel your copy. It saves you the time of having to complete it. I see it all the time and don't pay much attention to it. It might indicate a problem, but not necessarily. ID: 86871 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 86873 - Posted: 28 Jul 2017, 14:55:56 UTC - in response to Message 86871. Last modified: 28 Jul 2017, 15:01:21 UTC But why do I get WUs sent that are currently crunched by other hosts? The task I linked above has been sent to me while another host was working on it. After he finished successfully the server aborted my copy, that can't be correct. That may be another issue. They often send out two work units and hope to get them back under the deadline for comparison. If they both get back in time, no problem. But if one of them is delayed and gets too close to the deadline, they will send out a third copy (to you, for example). So you start working on it. But then the second copy is completed and sent back in time, so they cancel your copy. It saves you the time of having to complete it. I see it all the time and don't pay much attention to it. It might indicate a problem, but not necessarily. @Defender, I agree. I've EMailed DK pointing this out. The new server version has added functionality to do such things as requesting clients cancel specific WUs. So this may be a shaking out of the new server version. @Jim1348, sending redundant tasks from a single WU is done in some other BOINC projects. But R@h has never used that approach in the past. They would rather get more models attempted than have two hosts run the same set of models because the hosts are not fully trusted. The only time we see more than one task from a single WU is if the first passes the deadline, or fails in some way. In that case, having a second host work it to see if it runs any better is very informative. Rosetta Moderator: Mod.Sense ID: 86873 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 86874 - Posted: 28 Jul 2017, 15:08:45 UTC - in response to Message 86873. Thanks. I shouldn't have assumed that Rosetta used redundancy. But I normally like server aborts. They mean that I am not wasting my time on something not needed. ID: 86874 · Rating: 0 · rate: / Reply Quote

Juha Send message Joined: 28 Mar 16 Posts: 13 Credit: 705,034 RAC: 0	Message 86875 - Posted: 28 Jul 2017, 15:14:23 UTC - in response to Message 86870. But why do I get WUs sent that are currently crunched by other hosts? The task I linked above has been sent to me while another host was working on it. After he finished successfully the server aborted my copy, that can't be correct. If you look at the first copy carefully you'll see that it had missed its deadline: Created 19 Jul 2017, 10:21:00 UTC Sent 19 Jul 2017, 11:42:16 UTC Report deadline 27 Jul 2017, 11:42:16 UTC Received 27 Jul 2017, 14:04:30 UTC When the server noticed that the first copy had missed its deadline the server created another copy which was sent to you. This was done on the assumption that the first copy would never be completed. Then the first copy was returned and the server could abort your copy because it wasn't needed any more. ID: 86875 · Rating: 0 · rate: / Reply Quote