Message boards : Number crunching : Report problems with Rosetta version 5.34
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
This task, running on this host: ran for about five hours (pref = 24h) and stopped - no DONE box in output file, exit err 131. This host seems to be having more than its fair share of errors since v5.32 and 5.34. It has the same hardware as two other boxes which are not seeing anywhere near such a high error rate. This box has a heavy network load (it is a router internal to a LAN and masquerades about half a broadband bandwidth) so possibly the heartbeat error is caused by a peak load on the box's main mission? This box has been OK on Rosetta until v 5.32. It recently ran for two days on LHC WU with no problems - and LHC is about the most fussy project there is for declaring WU invalid! - then back to Rosetta v 5.34 and the errors start again. This is not a complaint, just painting the picture for you. Let me know if you'd like this box taken off Rosetta due to the error rates - it will stay on unless you say otherwise. River~~ |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Maybe they should increase the "no heartbeat" time out from 30 seconds to ONE minute before it exits the daemon???? I'd think it likely your other TCP traffic was prohibiting Boinc from talking. Perhaps someone should ask Rom at his blog about this error messages and possible problems with high network traffic computers??? |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
... I'd think it likely your other TCP traffic was prohibiting Boinc from talking. ... Yes Tony, it certainly looks like it. Only I'd have hoped that traffic from one net card to the other would be queued separately by linux from internal 'ip' traffic from localhost to localhost - especially as the two kinds of traffic are handled by different tables within iptables. and then again, why was it not killing the older Rosetta versions, or LHC (which ran OK on that box while under similar network load). As far as I know, all projects have the heartbeat check. so yes, congestion within the linux network handling is the most plausible culprit, but I am not absolutely convinced yet. R~~ |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
If hardware tests come back saying the hardware is fine; connecting systems with high error rates to Ralph (at least giving them some time on Ralph) will help track down the source of the higher than average errors showing up on some systems. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
River I have Boinc switch projects every 2hr and when i looked at the messages the time it was showing was about right. It might be a problem with being left in memory when i restarted my PC in the morning, but i haven't had that problem before so it's got me. |
frederick corse Send message Joined: 7 Oct 05 Posts: 10 Credit: 1,545,999 RAC: 0 |
hello I got a unrecoverable error on1hz6ABOINC NATIVEJUMAAPS CLOSE CHAINBREAKS VARY ALL BOND ANGLEAS ALL BOND DISTANCES SAVE ALL OUT1306 14672 0 .mesage <file xfer error> <error code>161</error code. it didn't clear the listing for it, |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
Maybe they should increase the "no heartbeat" time out from 30 seconds to ONE minute before it exits the daemon???? I'd think it likely your other TCP traffic was prohibiting Boinc from talking. Perhaps someone should ask Rom at his blog about this error messages and possible problems with high network traffic computers??? Make sure that there is an accept statement listing the loopback interface *and* make sure that the statement below is fairly high in the table... Allows established TCP connections through without checking the packets every time. -A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT Even if your CPU is overcommitted, you still should not have timeouts within 5 seconds... nevermind 30... Looking for a team ??? Join BoincSynergy!! |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Seems I was wrong. The no heartbeat message has nothing to do with the manager. I asked a question on the Dev mail list and go this back: davea@ssl.berkeley.edu to me, boinc_dev More options 7:13 pm (1 minute ago) The manager is not involved. Applications listen for "heartbeat" messages (sent via shared memory) from the core client. Normally it's sent once a second. If the application doesn't get one in 30 secs, it prints "no heartbeat" and quits -- David |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
Seems I was wrong. The no heartbeat message has nothing to do with the manager. I asked a question on the Dev mail list and go this back: wow... why use shared memory for that... I mean that's what semaphores were for... I mean... shared memory has always been much slower than other methods... It's just damned convenient for shared data... (to which a heartbeat does *not* qualify) Looking for a team ??? Join BoincSynergy!! |
PUDDIN TAME Send message Joined: 3 Oct 06 Posts: 13 Credit: 53,998 RAC: 0 |
What is with some of the new WU. I just got finished running that took 11 hours. It produced only 2 models. The first ran in about 1 hour. The second model took 10 hours! The only reason I didn't abort it was that the step counter was advancing verrry slowly. PUDDIN TAME |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Some of these workunits take a long time. The reason one of your models took significantly less time (1 hour vs. 10 hours) was because we have included a check -- if the model doesn't reach a low enough energy by a certain point, we prematurely exit, so that your client can have another shot from scratch. Both models are given equal credits -- so its a kind of interesting lottery. If you pass the 1 hour-ish limit, your client will keep crunching. Although the model doesn't receive more credit for crunching more, that particular model will be a lot more scientifically valuable than if we stopped the search earlier. We are discussing ways to give more credit for models that required more computational power... but that won't happen soon. For now, I'm looking into ways that keep these workunits shorter and to keep the times per model more even! I just posted a note over in another [url = https://boinc.bakerlab.org/rosetta/forum_thread.php?id=2495] thread [/url]to talk about these issues and get feedback. What is with some of the new WU. I just got finished running that took 11 hours. It produced only 2 models. The first ran in about 1 hour. The second model took 10 hours! The only reason I didn't abort it was that the step counter was advancing verrry slowly. |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
The "DANGER!" messages are actually OK; I'll make sure they don't show up in the next version of Rosetta. For those of you who are worried about how long its taking for some of these workunits, please keep crunching. They're slow but the data coming back is pretty spectacular! I've canceled the send out of these kinds of WUs, so you won't get anymore until the next software update (maybe this week, or next week), which should process these sorts of workunits significantly faster. Rosetta 5.34 has a few new features to allow us to test more accurate energy functions and more interesting variations in the protein's bond geometry. Let us know if you see any problems -- especially if they are reproducible! |
Keith Akins Send message Joined: 22 Oct 05 Posts: 176 Credit: 71,779 RAC: 0 |
Just a side note: Some of the "...VARY_ALL_BOND_DISTANCES..." and "...VARY_ALL_BOND_ANGLES..." jobs appear to loose the native structure on the screen saver. Whether this anomaly is a symtom of something causing these WU's to process slowly or not, I'm not sure. Anyone else noticing this? |
Faust Send message Joined: 7 Sep 06 Posts: 14 Credit: 49,559 RAC: 0 |
I just saw this : 27 Oct 2006 19:34:40 UTC 28 Oct 2006 9:47:52 UTC Over Client error Compute error 43,201.38 91.06 --- stderr out <core_client_version>5.4.11</core_client_version> <message> ? ৣ ?詭 ੰ堧巩 (0x80000003) - exit code -2147483645 (0x80000003) </message> <stderr_txt> # random seed: 1226466 # cpu_run_time_pref: 10800 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! CPU time: 43200.4 seconds. Greater than 4X preferred time: 10800 seconds ********************************************************************** GZIP SILENT FILE: .xx1hz6.out WARNING! attempt to gzip file .xx1hz6.out failed: file does not exist. Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x77F767CD Engaging BOINC Windows Runtime Debugger... than there's a huge dump file .. https://boinc.bakerlab.org/rosetta/result.php?resultid=44236969 it has also happend here. Faust. |
RichardJ Send message Joined: 19 Mar 06 Posts: 8 Credit: 73,014 RAC: 0 |
Same thing for 38887432: Rosetta score is stuck or going too long. Watchdog is ending the run! CPU time: 44451.8 seconds. Greater than 4X preferred time: 10800 seconds May also happen to 1ogw__BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_SAVE_ALL_OUT__1315_5448 minimum quorum 1 which has been running for over 2 hours, is stuck on 1% and has 10 hours still to run! |
dag Send message Joined: 16 Dec 05 Posts: 106 Credit: 1,000,020 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=44151269 1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_VARY_ALL_BOND_ANGLES_SAVE_ALL_OUT__1306_26796_0 sin value out of range [-1,+1] dag --Finding aliens is cool, but understanding the structure of proteins is useful. |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Yes, I was noticing this on my own client. Seems to be occuring for other jobs too. I'll ask David Kim to look into it. Just a side note: |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
An update to those of you who were helping me out with this issue of super-long workunits. I'm testing a new app on ralph that has some tricks to accelerate the workunits without sacrificing too much in terms of finding low energies. Workunits appear to be at least three-fold faster! So in the next application update, we'll try those workunits again, and hopefully not see the same sorts of issues. The update may not occur for another week -- we also want to incorporate a cool new mode that models the fibrils that are correlated with Alzheimer's and other neuro-degenerative diseases, and that's going to take some optimization! Some of these workunits take a long time. The reason one of your models took significantly less time (1 hour vs. 10 hours) was because we have included a check -- if the model doesn't reach a low enough energy by a certain point, we prematurely exit, so that your client can have another shot from scratch. |
Soren Hedberg Send message Joined: 30 Oct 06 Posts: 25 Credit: 3,653 RAC: 0 |
I'm having a problem with R@H as well. My problem is that I set the program up to use 100% of the CPU (which, according to SPeedfan, it is doing), and I leave the program to do its work on a job that it says should take 4 hours CPU time. However, after about 30 minutes of working at full load, it says that it has only completed 1 minute 30 seconds worth of CPU Time. What's up with that? I have ZoneAlarm and AVG Antivirus working on my computer at the same time, is it a conflict with one of these programs? |
R.L. Casey Send message Joined: 7 Jun 06 Posts: 91 Credit: 2,728,885 RAC: 0 |
I'm having a problem with R@H as well. My problem is that I set the program up to use 100% of the CPU (which, according to SPeedfan, it is doing), and I leave the program to do its work on a job that it says should take 4 hours CPU time. However, after about 30 minutes of working at full load, it says that it has only completed 1 minute 30 seconds worth of CPU Time. What's up with that? I have ZoneAlarm and AVG Antivirus working on my computer at the same time, is it a conflict with one of these programs? Soren,welome to Rosetta! I. (Assuming that you meant to say that after 30 minutes CPU Time, the task was showing only a bit more than 1% complete): Some time ago, longer tasks would appear to be "stuck" at one percent, so the project developers changed the Rosetta application to increase the percent complete from 1.000% by small amounts so that people would not become concerned that the task was "stuck". If this is the case, on the graphic display/ screensaver you will see that the tassk is still working on Model 1. The percentage complete will be updated more realistically after tbe first model is completed. II. (If you actually *did* intend to say that the task has used only 1.50 seconds of CPU time in a half hour (as measured by a watch or clock), then you can use the Windows Task Manager under the "Processes" tab to check the Rosetta task to see if it's using CPU. Also, check the "General Preferences" and "Rosetta preferences" from your Rosetta Account web page. In particular, check the General preference for "Do work while Computer is in use". If this is set to "No", then Rosetta will suspend itself (stop working) anytime you are working with the computer (typing, using mouse), and for some time afterward. There are also other conditions that must be satisfied in order for Rosetta to run. These are to provide limitations on Rosetta, if necessary, so that it cannot interfere with other work by, say, slowing response times. However, I have Rosetta set to run all the time and never have seen any significant slow-down on other work I do. (Well, perhaps rendering video might be affected, but few more typical tasks like web browsing, word processing, or email). Note: you can also use the BOINC Manager "Activity" tab to tell Rosetta to "Run Always". This overrides the "Preferences". There are many, many people here that really want to help you perform your best and have fun, too. Always feel free to post questions and comments! Again, welcome! Happy Rosetta crunching! |
Message boards :
Number crunching :
Report problems with Rosetta version 5.34
©2025 University of Washington
https://www.bakerlab.org