Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 302 · Next
Author | Message |
---|---|
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I have picked up a couple, but fortunately they fail after only 8 seconds. It has not been a big problem for me. |
TJ Volunteer moderator Project developer Project scientist Send message Joined: 22 Oct 10 Posts: 9 Credit: 216,670 RAC: 0 |
Sorry about that. These are my jobs and while rushing I made a mistake. The CYB atoms were supposed to be stripped before the job was submitted. -TJ |
zombie3 Send message Joined: 13 May 17 Posts: 1 Credit: 362,588 RAC: 0 |
Hi, all my Rosetta mini jobs are failing with computation error. Could you please check what's going on with my workunits? Outcome Computation error Client state Compute error Exit status -185 (0xFFFFFF47) ERR_RESULT_START Please note, this issue only started happening today (4th July 2017) was fine before that. Thanks, Zombie3 |
JoeM Send message Joined: 7 Nov 14 Posts: 1 Credit: 1,115,683 RAC: 0 |
I too have been recently getting these. |
supdood Send message Joined: 3 Aug 15 Posts: 6 Credit: 190,389 RAC: 0 |
I've had 18 of these in the past two days across two android devices. All are approx. 30-45 secs runtime and 15 sec cpu time, then error out. |
furukitsune Send message Joined: 19 Mar 16 Posts: 9 Credit: 7,221,780 RAC: 3,057 |
The old ip addresses are still active and giving a "servers are down, please try later message". I had hardwired the ip addresses in the hosts file during the DNS/domain problem a while back, and did not know the addresses have changed. Could you disable the error messages for those who still have hardwired ip addresses . fk |
supdood Send message Joined: 3 Aug 15 Posts: 6 Credit: 190,389 RAC: 0 |
I've had 18 of these in the past two days across two android devices. All are approx. 30-45 secs runtime and 15 sec cpu time, then error out. Update: now have 58 errors over 3 days. |
Sandvika Send message Joined: 7 Feb 15 Posts: 2 Credit: 17,333 RAC: 0 |
I have a work unit (https://boinc.bakerlab.org/workunit.php?wuid=838066564) that progressed to 42% on the client but then stopped increasing in %-progress. It has now reached 85 hours of run time and has about 65 hours left before its deadline. It is stuck? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
It doesn't sound like your BOINC applications are getting any CPU time on your machine. Sometimes BOINC Manager has a problem where it shows a task as running, but never dispatches any CPU to it. I would suggest you exit the BOINC Manager (not just close the window, but exit). Then restart it and let it run for a few hours to see if it clears things up. Certainly abort the task if that doesn't seem to help. Rosetta Moderator: Mod.Sense |
Sandvika Send message Joined: 7 Feb 15 Posts: 2 Credit: 17,333 RAC: 0 |
Thank you for your reply. I think you are probably right about dispatching problems, as I'd increased the resource share for the project to 10K to ensure the job was running all the time. Its state was "Running, High Priority", it had got to 105 hours elapsed time but under 2 hours processing time. It hadn't occurred to me to look at the task properties because it's usually mundane information. On other machines where I had the same symptoms I restarted BOINC and all the work units seemed to resume from the last checkpoint percentage, but then all dropped back to zero and had evidently restarted. My impression (only an impression, not troubleshooted) is that this project is fine when the work units run from start to finish uninterrupted, but there seem to be issues with resume in a multi-project environment. I'll update the system and BOINC and try again. |
Sebastian M. Bobrecki Send message Joined: 9 Oct 05 Posts: 4 Credit: 6,286,377 RAC: 0 |
I see serious problems with 21AA_CSP_complex, 22AA_1st_CSP_complex and 22AA_2nd_CSP_complex tasks. They all exited with "Error while computing" like: <message> upload failure: <file_xfer_error> <file_name>21AA_CSP_complex_0151_SAVE_ALL_OUT_500602_91_0_r1107058418_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message>And I see that some of them also failed on other hosts, like for example this 838934585 and this 838962637. |
furukitsune Send message Joined: 19 Mar 16 Posts: 9 Credit: 7,221,780 RAC: 3,057 |
Same error on Name 22AA_1st_CSP_complex_0042_SAVE_ALL_OUT_500829_31_0 Workunit https://boinc.bakerlab.org/result.php?resultid=929956055 <file_name>22AA_1st_CSP_complex_0042_SAVE_ALL_OUT_500829_31_0_r1023849146_0</file_name> <error_code>-161 (not found)</error_code> fk |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
I see serious problems with 21AA_CSP_complex, 22AA_1st_CSP_complex and 22AA_2nd_CSP_complex tasks. They all exited with "Error while computing" like:<message> upload failure: <file_xfer_error> <file_name>21AA_CSP_complex_0151_SAVE_ALL_OUT_500602_91_0_r1107058418_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message>And I see that some of them also failed on other hosts, like for example this 838934585 and this 838962637. Same here. Many, many errors like this. Worst of all is that they appear to happen at the end (lots of CPU Time wasted...). I'm going to reduce the run times to reduce the wasted hours. |
Defender Send message Joined: 22 Mar 08 Posts: 10 Credit: 13,517,861 RAC: 338 |
I got a WU while it has been running on a different host. As he completed computation the server aborted my WU. What is the reason for this, it seems to be a wast of power? Right now there are 27 WUs "Aborted by server" in my account, see this. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I got a WU while it has been running on a different host. As he completed computation the server aborted my WU. What is the reason for this, it seems to be a wast of power? Right now there are 27 WUs "Aborted by server" in my account, see this. They are probably bad work units that have been reported by other users. The server is doing you a favor by aborting them as soon as possible. Some of them have 0 run time. |
Defender Send message Joined: 22 Mar 08 Posts: 10 Credit: 13,517,861 RAC: 338 |
But why do I get WUs sent that are currently crunched by other hosts? The task I linked above has been sent to me while another host was working on it. After he finished successfully the server aborted my copy, that can't be correct. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
But why do I get WUs sent that are currently crunched by other hosts? The task I linked above has been sent to me while another host was working on it. After he finished successfully the server aborted my copy, that can't be correct. That may be another issue. They often send out two work units and hope to get them back under the deadline for comparison. If they both get back in time, no problem. But if one of them is delayed and gets too close to the deadline, they will send out a third copy (to you, for example). So you start working on it. But then the second copy is completed and sent back in time, so they cancel your copy. It saves you the time of having to complete it. I see it all the time and don't pay much attention to it. It might indicate a problem, but not necessarily. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
But why do I get WUs sent that are currently crunched by other hosts? The task I linked above has been sent to me while another host was working on it. After he finished successfully the server aborted my copy, that can't be correct. @Defender, I agree. I've EMailed DK pointing this out. The new server version has added functionality to do such things as requesting clients cancel specific WUs. So this may be a shaking out of the new server version. @Jim1348, sending redundant tasks from a single WU is done in some other BOINC projects. But R@h has never used that approach in the past. They would rather get more models attempted than have two hosts run the same set of models because the hosts are not fully trusted. The only time we see more than one task from a single WU is if the first passes the deadline, or fails in some way. In that case, having a second host work it to see if it runs any better is very informative. Rosetta Moderator: Mod.Sense |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Thanks. I shouldn't have assumed that Rosetta used redundancy. But I normally like server aborts. They mean that I am not wasting my time on something not needed. |
Juha Send message Joined: 28 Mar 16 Posts: 13 Credit: 705,034 RAC: 0 |
But why do I get WUs sent that are currently crunched by other hosts? The task I linked above has been sent to me while another host was working on it. After he finished successfully the server aborted my copy, that can't be correct. If you look at the first copy carefully you'll see that it had missed its deadline: Created 19 Jul 2017, 10:21:00 UTC Sent 19 Jul 2017, 11:42:16 UTC Report deadline 27 Jul 2017, 11:42:16 UTC Received 27 Jul 2017, 14:04:30 UTC When the server noticed that the first copy had missed its deadline the server created another copy which was sent to you. This was done on the assumption that the first copy would never be completed. Then the first copy was returned and the server could abort your copy because it wasn't needed any more. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org