Message boards : Number crunching : Problems with Minirosetta 1.80
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I've now had quite a lot of WUs run for 4 hours over my run time of 12 hours, and then get ended by the watchdog. They always report one decoy being made, although, in fact, no decoys seem to have been produced. They then have a file xfer error (-161), presumably because there was no output file. here's yet another example: https://boinc.bakerlab.org/rosetta/result.php?resultid=262096625 Note that this ran over 16 hours on a Phenom II, yet produced no output. |
RC Send message Joined: 27 Sep 05 Posts: 13 Credit: 262,048 RAC: 0 |
Another one that died after almost 13 hours (my runtime preference is 8 hours): https://boinc.bakerlab.org/rosetta/result.php?resultid=262397691 |
Wissi Send message Joined: 19 Nov 08 Posts: 14 Credit: 485,807 RAC: 0 |
Since getting 1.80, almost every WU I get is planned for about 4 Hours of work, but they will run at least 8 hours. So is there some miscalculation of how strong (or weak) my computer is? It's quite annoying to see "calculation error" on almost every WU, because the runtime exceeds 8 hours, the last 3 did use more than 10 hours of work. What's going on here? Currently, I've got the following WU: real_core_1.5_low200_beta_low200_start_hb_t332_IGNORE_THE_REST_13273_142 Task ID: 261849792, Work unit 238985112 The original time estimation was about 4hrs 20min, but the task now ran for 5 hours, and still there are 4hrs 10min left. What I can see is, that the time left INCREASES. The same applies for the currently new started job: lb_dk_ksync_withtrim2_hb_t302_IGNORE_THE_REST_13365_670 Task ID: 262152215, Work unit 239248916 The time left goes up and up, but never down... |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Here's another sad story. real_core_3.5_low50_beta_low200_hb_t303__IGNORE_THE_REST_13576_83_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=239454464 This ran for 4hrs 34min made no progress. At 1hr 49min. MODEL:0 STEP:46800 At 4hrs 34min. MODEL:0 STEP:46800 ABORTED. |
Rob Heilman [Echo Labs] Send message Joined: 26 Apr 07 Posts: 20 Credit: 2,815,410 RAC: 0 |
I am getting a ton of compute errors. I also see some ridiculous disparities at time about Claimed/Awarded credit. i.e. 262177679 239266150 28 Jun 2009 23:26:41 UTC 30 Jun 2009 5:40:18 UTC Over Success Done 101,333.10 224.36 17.95 262177658 239266121 28 Jun 2009 23:26:41 UTC 30 Jun 2009 5:40:18 UTC Over Client error Compute error 101,330.80 224.36 --- Any ideas? |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Here's another real_core that was stuck. real_core_5.0_low50_beta_low200_hb_t332__IGNORE_THE_REST_13705_64_0. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=239491013 Hadn't moved in 2hrs 12min. Got to that step then didn't move. MODEL:0 STEP:48000 ABORTED I think i have only had 1 of these that has ran O.K. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,208,737 RAC: 2,882 |
I am getting a ton of compute errors. I also see some ridiculous disparities at time about Claimed/Awarded credit. i.e. You seem to be having to different kinds of errors, one is error code 161 and the other is something that doesn't list a code. I only looked on a few machines but it is happening on all that I checked. Hmmm Here is the Wiki link to the error codes for Boinc http://www.boinc-wiki.info/Error_Code Do you ever reboot your machines? Have you updated them lately? I see you run Linux and I know they put out updates all the time, I usually wait until there are just under a hundred to do the updates. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I moved Rob and mikey's posts to this thread. Rob, several users are reporting tasks that stop progressing. This often means that some models complete in normal time and others take considerably longer. Since credit is issued on completed models, I believe that is the reason for the large disparities between some of your claimed and granted credit. Rosetta Moderator: Mod.Sense |
Rob Heilman [Echo Labs] Send message Joined: 26 Apr 07 Posts: 20 Credit: 2,815,410 RAC: 0 |
Is there anything I can do on my end to help with the issue? It seems to have started right about when 1.80 came out. I have tried both decreasing my run time to 3 hrs and increasing to 24 hours. Right now I am at 12 on my way back to 8 hours. What ever is going on it is costing the project some serious computing power. If you look at my daily credit numbers you can see that without any changes to my machines, software versions, etc. I am only completing 50-55% of what I was able to do on a daily basis over the last several weeks. My BOINCstats |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Rob, I believe the Project Team should already have the data they need to identify specific types of tasks that are causing problems. So, really can't think of anything on your end to help. I for one have not been getting any of the tasks with names starting with "real_core", so I tend to believe there probably are not very many of them in the mix. So, your machines should return to tasks that are running well soon. Rosetta Moderator: Mod.Sense |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Why is it that so many people have so many problems? You always have to keep in mind that this is the "problems with" thread. So, by design, most of the posts here will be about problems. Some of the 50 posts in this thread are not about specific problems in 1.80, more about BOINC general issues. I should probably be moving them elsewhere, but who has the time? So of 85,000 active hosts, you will never get every event reported, but overall the big picture is still good. And so when you compare to about 2 million tasks completed since the creation of this thread, the number of problems is quite modest. And seems most highly correlated to some of the new task types that are being worked on. As I said, it seems these are fairly few in number, so this is the current rough ground being covered. Not everyone monitors their machines closely, and this is why it was key to make the changes Mike made earlier this year to collect and report more data both for when things go unexpectedly and to gather better information about things that are running well (which helps you readily identify any future variations as compared to that historical data). Rosetta Moderator: Mod.Sense |
alpha Send message Joined: 4 Nov 06 Posts: 27 Credit: 1,550,107 RAC: 0 |
Two compute errors after 101,000 seconds (28 hrs) with a preference of 24 hours run time. Only one decoy in both cases: https://boinc.bakerlab.org/rosetta/result.php?resultid=261928706 https://boinc.bakerlab.org/rosetta/result.php?resultid=262283940 Also, two more with 101,000 seconds run time, these ones completed successfully but granted ridiculously low credit, again, only one decoy: https://boinc.bakerlab.org/rosetta/result.php?resultid=262122318 https://boinc.bakerlab.org/rosetta/result.php?resultid=262236422 |
ByRad Send message Joined: 12 Apr 08 Posts: 8 Credit: 15,865,146 RAC: 1,078 |
|
Seversen Send message Joined: 21 Dec 07 Posts: 3 Credit: 57,599 RAC: 0 |
Why did this workunit get such low credit? real_core_1.5_low200_beta_low200_start_hb_t331__IGNORE_THE_REST_13032_83 Thanks. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Lord ByRad my translation skills are minimal, but the status shown for the Rosetta task you highlighted has the acronym RAM in it. Which I take it means that the rest of the words translate to something like "waiting for memory". So the settings for BOINC Manager are not allowing it to use enough of the large memory your system has. There are several memory settings you can adjust to allow BOINC to use more memory. Also, since there is no Rosetta application in the task list, I take it you have it set to remove from memory when not active. Your machine will do work more efficiently if you leave tasks in memory when suspended. Rosetta Moderator: Mod.Sense |
Oliver Send message Joined: 11 Oct 07 Posts: 4 Credit: 525 RAC: 0 |
Hi folks, I checked the output of the real_core_xxx WUs and found that all of them produce good results and valid results. So if you see RMSD=1 or similar oddities that seems to be an error of the graphics, rather than the actual WU. In summary, the issues seem to be around the boinc-managment but not the internal quality of the results. We are now starting to address the problems mentioned in this thread with graphics, completion time and checkpointing/resuming. -Oliver |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Oliver, the RMSD of 1 we are seeing is in the graphs of results described in this thread. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=4967 Not the graphics on the client machines. So, somewhere, you have data that reports those values in your databases used to make these graphs. Rosetta Moderator: Mod.Sense |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Task 262972813 failed on Mac, Watchdog active. Hbond tripped: [2009- 7- 2 8:46:56:] ERROR: dis==0 in pairtermderiv! ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 334 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
This one ran for 10 hrs on a 6hr pref. It did 1 Model when the watchdog kicked in, i guess it was incomplete. https://boinc.bakerlab.org/rosetta/result.php?resultid=263029599 Sun 05 Jul 2009 10:20:27 EST|rosetta@home|Output file lb_cutback_all_multi_hb_t328__IGNORE_THE_REST_2CEXA_8_12958_5_1_0 for task lb_cutback_all_multi_hb_t328__IGNORE_THE_REST_2CEXA_8_12958_5_1 absent |
Message boards :
Number crunching :
Problems with Minirosetta 1.80
©2024 University of Washington
https://www.bakerlab.org