Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 302 · Next

AuthorMessage
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 86772 - Posted: 30 Jun 2017, 13:57:47 UTC - in response to Message 86770.  

I have picked up a couple, but fortunately they fail after only 8 seconds. It has not been a big problem for me.
ID: 86772 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TJ
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 22 Oct 10
Posts: 9
Credit: 216,670
RAC: 0
Message 86773 - Posted: 30 Jun 2017, 17:17:02 UTC - in response to Message 86769.  

Sorry about that. These are my jobs and while rushing I made a mistake. The CYB atoms were supposed to be stripped before the job was submitted.
-TJ
ID: 86773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie3

Send message
Joined: 13 May 17
Posts: 1
Credit: 362,588
RAC: 0
Message 86782 - Posted: 4 Jul 2017, 10:59:23 UTC
Last modified: 4 Jul 2017, 11:05:15 UTC

Hi, all my Rosetta mini jobs are failing with computation error. Could you please check what's going on with my workunits?

Outcome Computation error
Client state Compute error
Exit status -185 (0xFFFFFF47) ERR_RESULT_START

Please note, this issue only started happening today (4th July 2017) was fine before that.

Thanks,

Zombie3
ID: 86782 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JoeM

Send message
Joined: 7 Nov 14
Posts: 1
Credit: 1,115,683
RAC: 0
Message 86809 - Posted: 14 Jul 2017, 3:19:59 UTC - in response to Message 86782.  

I too have been recently getting these.
ID: 86809 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
supdood

Send message
Joined: 3 Aug 15
Posts: 6
Credit: 190,389
RAC: 0
Message 86819 - Posted: 16 Jul 2017, 22:39:26 UTC - in response to Message 86809.  

I've had 18 of these in the past two days across two android devices. All are approx. 30-45 secs runtime and 15 sec cpu time, then error out.
ID: 86819 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
furukitsune

Send message
Joined: 19 Mar 16
Posts: 9
Credit: 7,221,346
RAC: 3,066
Message 86821 - Posted: 17 Jul 2017, 21:52:17 UTC

The old ip addresses are still active and giving a "servers are down, please try later message". I had hardwired the ip addresses in the hosts
file during the DNS/domain problem a while back, and did not know the addresses have changed. Could you disable the error messages for
those who still have hardwired ip addresses .

fk
ID: 86821 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
supdood

Send message
Joined: 3 Aug 15
Posts: 6
Credit: 190,389
RAC: 0
Message 86825 - Posted: 18 Jul 2017, 13:16:21 UTC - in response to Message 86819.  

I've had 18 of these in the past two days across two android devices. All are approx. 30-45 secs runtime and 15 sec cpu time, then error out.

Update: now have 58 errors over 3 days.
ID: 86825 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sandvika

Send message
Joined: 7 Feb 15
Posts: 2
Credit: 17,333
RAC: 0
Message 86842 - Posted: 24 Jul 2017, 15:46:24 UTC
Last modified: 24 Jul 2017, 15:52:11 UTC

I have a work unit (https://boinc.bakerlab.org/workunit.php?wuid=838066564) that progressed to 42% on the client but then stopped increasing in %-progress. It has now reached 85 hours of run time and has about 65 hours left before its deadline. It is stuck?
ID: 86842 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 86843 - Posted: 25 Jul 2017, 2:02:20 UTC - in response to Message 86842.  

It doesn't sound like your BOINC applications are getting any CPU time on your machine. Sometimes BOINC Manager has a problem where it shows a task as running, but never dispatches any CPU to it. I would suggest you exit the BOINC Manager (not just close the window, but exit). Then restart it and let it run for a few hours to see if it clears things up. Certainly abort the task if that doesn't seem to help.
Rosetta Moderator: Mod.Sense
ID: 86843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sandvika

Send message
Joined: 7 Feb 15
Posts: 2
Credit: 17,333
RAC: 0
Message 86845 - Posted: 25 Jul 2017, 10:05:33 UTC - in response to Message 86843.  

Thank you for your reply. I think you are probably right about dispatching problems, as I'd increased the resource share for the project to 10K to ensure the job was running all the time. Its state was "Running, High Priority", it had got to 105 hours elapsed time but under 2 hours processing time. It hadn't occurred to me to look at the task properties because it's usually mundane information. On other machines where I had the same symptoms I restarted BOINC and all the work units seemed to resume from the last checkpoint percentage, but then all dropped back to zero and had evidently restarted. My impression (only an impression, not troubleshooted) is that this project is fine when the work units run from start to finish uninterrupted, but there seem to be issues with resume in a multi-project environment. I'll update the system and BOINC and try again.
ID: 86845 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sebastian M. Bobrecki

Send message
Joined: 9 Oct 05
Posts: 4
Credit: 6,286,377
RAC: 0
Message 86847 - Posted: 25 Jul 2017, 11:29:45 UTC

I see serious problems with 21AA_CSP_complex, 22AA_1st_CSP_complex and 22AA_2nd_CSP_complex tasks. They all exited with "Error while computing" like:
<message>
upload failure: <file_xfer_error>
  <file_name>21AA_CSP_complex_0151_SAVE_ALL_OUT_500602_91_0_r1107058418_0</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
And I see that some of them also failed on other hosts, like for example this 838934585 and this 838962637.
ID: 86847 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
furukitsune

Send message
Joined: 19 Mar 16
Posts: 9
Credit: 7,221,346
RAC: 3,066
Message 86851 - Posted: 25 Jul 2017, 14:48:27 UTC

Same error on
Name 22AA_1st_CSP_complex_0042_SAVE_ALL_OUT_500829_31_0
Workunit https://boinc.bakerlab.org/result.php?resultid=929956055

<file_name>22AA_1st_CSP_complex_0042_SAVE_ALL_OUT_500829_31_0_r1023849146_0</file_name>
<error_code>-161 (not found)</error_code>

fk
ID: 86851 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 86852 - Posted: 25 Jul 2017, 21:02:37 UTC - in response to Message 86847.  

I see serious problems with 21AA_CSP_complex, 22AA_1st_CSP_complex and 22AA_2nd_CSP_complex tasks. They all exited with "Error while computing" like:
<message>
upload failure: <file_xfer_error>
  <file_name>21AA_CSP_complex_0151_SAVE_ALL_OUT_500602_91_0_r1107058418_0</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
And I see that some of them also failed on other hosts, like for example this 838934585 and this 838962637.


Same here. Many, many errors like this. Worst of all is that they appear to happen at the end (lots of CPU Time wasted...).
I'm going to reduce the run times to reduce the wasted hours.
ID: 86852 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Defender

Send message
Joined: 22 Mar 08
Posts: 10
Credit: 13,517,861
RAC: 338
Message 86868 - Posted: 28 Jul 2017, 5:07:47 UTC

I got a WU while it has been running on a different host. As he completed computation the server aborted my WU. What is the reason for this, it seems to be a wast of power? Right now there are 27 WUs "Aborted by server" in my account, see this.
ID: 86868 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 86869 - Posted: 28 Jul 2017, 7:58:23 UTC - in response to Message 86868.  

I got a WU while it has been running on a different host. As he completed computation the server aborted my WU. What is the reason for this, it seems to be a wast of power? Right now there are 27 WUs "Aborted by server" in my account, see this.

They are probably bad work units that have been reported by other users. The server is doing you a favor by aborting them as soon as possible. Some of them have 0 run time.
ID: 86869 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Defender

Send message
Joined: 22 Mar 08
Posts: 10
Credit: 13,517,861
RAC: 338
Message 86870 - Posted: 28 Jul 2017, 10:55:50 UTC - in response to Message 86869.  
Last modified: 28 Jul 2017, 10:58:25 UTC

But why do I get WUs sent that are currently crunched by other hosts? The task I linked above has been sent to me while another host was working on it. After he finished successfully the server aborted my copy, that can't be correct.
ID: 86870 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 86871 - Posted: 28 Jul 2017, 11:17:37 UTC - in response to Message 86870.  

But why do I get WUs sent that are currently crunched by other hosts? The task I linked above has been sent to me while another host was working on it. After he finished successfully the server aborted my copy, that can't be correct.

That may be another issue. They often send out two work units and hope to get them back under the deadline for comparison. If they both get back in time, no problem. But if one of them is delayed and gets too close to the deadline, they will send out a third copy (to you, for example). So you start working on it. But then the second copy is completed and sent back in time, so they cancel your copy. It saves you the time of having to complete it.

I see it all the time and don't pay much attention to it. It might indicate a problem, but not necessarily.
ID: 86871 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 86873 - Posted: 28 Jul 2017, 14:55:56 UTC - in response to Message 86871.  
Last modified: 28 Jul 2017, 15:01:21 UTC

But why do I get WUs sent that are currently crunched by other hosts? The task I linked above has been sent to me while another host was working on it. After he finished successfully the server aborted my copy, that can't be correct.

That may be another issue. They often send out two work units and hope to get them back under the deadline for comparison. If they both get back in time, no problem. But if one of them is delayed and gets too close to the deadline, they will send out a third copy (to you, for example). So you start working on it. But then the second copy is completed and sent back in time, so they cancel your copy. It saves you the time of having to complete it.

I see it all the time and don't pay much attention to it. It might indicate a problem, but not necessarily.


@Defender, I agree. I've EMailed DK pointing this out. The new server version has added functionality to do such things as requesting clients cancel specific WUs. So this may be a shaking out of the new server version.

@Jim1348, sending redundant tasks from a single WU is done in some other BOINC projects. But R@h has never used that approach in the past. They would rather get more models attempted than have two hosts run the same set of models because the hosts are not fully trusted. The only time we see more than one task from a single WU is if the first passes the deadline, or fails in some way. In that case, having a second host work it to see if it runs any better is very informative.
Rosetta Moderator: Mod.Sense
ID: 86873 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 86874 - Posted: 28 Jul 2017, 15:08:45 UTC - in response to Message 86873.  

Thanks. I shouldn't have assumed that Rosetta used redundancy. But I normally like server aborts. They mean that I am not wasting my time on something not needed.
ID: 86874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Juha

Send message
Joined: 28 Mar 16
Posts: 13
Credit: 705,034
RAC: 0
Message 86875 - Posted: 28 Jul 2017, 15:14:23 UTC - in response to Message 86870.  

But why do I get WUs sent that are currently crunched by other hosts? The task I linked above has been sent to me while another host was working on it. After he finished successfully the server aborted my copy, that can't be correct.


If you look at the first copy carefully you'll see that it had missed its deadline:

Created            19 Jul 2017, 10:21:00 UTC
Sent               19 Jul 2017, 11:42:16 UTC
Report deadline    27 Jul 2017, 11:42:16 UTC
Received           27 Jul 2017, 14:04:30 UTC


When the server noticed that the first copy had missed its deadline the server created another copy which was sent to you. This was done on the assumption that the first copy would never be completed. Then the first copy was returned and the server could abort your copy because it wasn't needed any more.
ID: 86875 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 302 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org