Message boards : Number crunching : Problems with Rosetta version 5.93
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next
Author | Message |
---|---|
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
resultid=135831728 CPU time 127601.71875 (35.44 HOURS) Claimed credit 501.989998586804 Granted credit 20 Mod Sense. I'm pretty sure there's something wrong here. Anyone else spot the problem???? It's not like this issue wasn't posted about early enough on Friday for someone at the project to comment upon it. |
JEklund Send message Joined: 24 Sep 06 Posts: 7 Credit: 105,447 RAC: 0 |
resultid=135831728 Based on the info in the log it seems that it was stuck and the watchdog killed it ( and appreciated your work as 20 credits .. which is not fair for 35 hours work IMHO ) No clue what is wrong with that work unit though .. -- Lundi -- |
mhhall Send message Joined: 28 Mar 06 Posts: 7 Credit: 10,193,127 RAC: 0 |
Please post problems and/or bugs with rosetta 5.93. Thanks for your My slower computer (ID #187636 -- older Linspire Linux box) is set to accept jobs of approx 14 hours. I have a job on machine at this time which say it is 99.67% completed with 50:16:19 of CPU time. For time being, I've suspended the job. Name starts "2h4o_BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK (Work unit 123162090). Don't know if this is a Rosetta issue or a problem w/ this specific job. I know that I have another of same name in my queue (135883853). Just wondering if someone else has seen similar issue/problem. Hope this helps!! |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
resultid=135831728 Oh no, you did get 20. You should have got at least an extra 100 for all the effort you put into it. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,870,251 RAC: 776 |
I've got one here: https://boinc.bakerlab.org/rosetta/result.php?resultid=135314464 Rosetta score is stuck or going too long. Watchdog is ending the run! CPU time: 58569.2 seconds. Greater than 4X preferred time: 14400 seconds Claimed credit 211.010587329225 Granted credit 80 |
[AF>France>TDM>Centre]Jeannot Le Tazon Send message Joined: 8 Dec 05 Posts: 6 Credit: 153,161 RAC: 0 |
I've aborted this one https://boinc.bakerlab.org/rosetta/result.php?resultid=135287253 after 11h. (prefs set to 12h) 11 h crunching, then cpu benchmark, and then back to 10% complete. :( it seemed to do nothing interesting after, maybe, 1h and 1 decoy (Model 1, Step 27091, Accepted RMSD 9124, Accepted energy 6.65805) Nothing displayed on "Searching", "Accepted", nothing moving after 1 decoy on "RMSD" & "Accepted Energy". |
Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,668,865 RAC: 5,135 |
I started getting lots of computation errors today. I did make 1 change to the system but it should not have caused this problem. Most of the time the CPU cranks on the WU for 50+ min. before the error. Is there a problem with some of the WUs in the 5.93 beta? I just installed the newest BOINC Client (5.10.30) and I guess it could be at fault as well. Any insight is greatly appreciated. Paul Thx! Paul |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
paul - do the group a favor and tell us which one of your many computers is having fits and which work units as you have alot of different computers and lots of workunits in queue. Its not the BOINC program that has the errors, rather the project work units themselves. You probably notice that you have errors on RAH vs the other projects you are working on. If it was a BOINC program error you would have errors on all your projects. I started getting lots of computation errors today. I did make 1 change to the system but it should not have caused this problem. Most of the time the CPU cranks on the WU for 50+ min. before the error. |
PieBandit Send message Joined: 17 Apr 07 Posts: 6 Credit: 228,220 RAC: 0 |
several of my WU are also failing with compute errors: Result ID 136334535 Result ID 136319412 Result ID 136308989 Result ID 136258153 Result ID 135343580 Result ID 135260720 Result ID 134993972 since January 21st, I've had about a 50% success rate |
Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,668,865 RAC: 5,135 |
paul - do the group a favor and tell us which one of your many computers is having fits and which work units as you have alot of different computers and lots of workunits in queue. Its not the BOINC program that has the errors, rather the project work units themselves. You probably notice that you have errors on RAH vs the other projects you are working on. If it was a BOINC program error you would have errors on all your projects. Greg: Thanks for the note. I do have lots of WUs checked out and it takes a long time to find the issues. The computer is 591177 and it has more compute errors than successes. I will keep fighting with the hardware but I think it is OK now. All of my temps are well in spec and I don't have any other issues. I run 100% R@H so I can not compare these WUs to anything else. I did notice that none of my other systems have the same issues so a BIOS upgrade later, I think we may have some stability. Thx Paul Thx! Paul |
Conan Send message Joined: 11 Oct 05 Posts: 151 Credit: 4,244,078 RAC: 43 |
The problems I was getting over at Ralph appear to have carried over to Rosetta. The Wu's starting with "2h4o" were causing problems on Ralph so I was supprised to see them over here on Rosetta. They have a habit of running well past your preference time (up to 21 hours with preference time of 6 hours), All seem to get to just over 97% completed with 9 minutes 59 seconds to go and just sit there for hours, Says 100% completed but still shows "Waiting to Run" in Boinc Manager, Often giving computation errors after the extra long run time (this was mainly on Ralph), If it does complete after the extra long run time will only give a very poor amount of credit because usually only 1 decoy has been produced in all this time. I have just aborted two of these WU's WU 135437069 ran for over 3 1/2 hours got to 100% but still waiting to run in BM, after aborting results show Zero (0) time taken on job. WU 135437323 was already over an hour past my preference time of 6 hours and still grinding away with 9 minuts 59 seconds to go at 97% completed, it had been this way for quite some time. WU 135372094 completed after more than 21 hours, returning just 2.5 cr/h. If I see any more of these WU type then I will be aborting them. |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
I'm seeing the same problems as Conan on a number of my servers. The trouble workunits are 2h4o and 1zpy and all require manual abortion. Restarting Boinc will just reset the amount of time already spend on them and starting them again. The 2h4o units in particular tend to stay at 100% Completed but state "Running" with no increase in amount of cpu time spend. Looking at the stdout.txt/stderr.txt files shows that there was an attempt by the watchdog to shut down the client (and as far as I know that has never worked properly for Rosetta on Linux). Team Helix |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
I aborted them all as well, Still waiting on my 480 missing credits too... I wonder when the staff gets in to work? These have really got to be affecting the total rate of return (i.e work done). |
FalconFly Send message Joined: 11 Jan 08 Posts: 23 Credit: 2,163,056 RAC: 0 |
Same here, had to abort the last 2h4o Model. One of my faster Hosts effectively stopped working, as the hourly rotation of the last 2h4o__BOINC_TWIST_RINGS WorkUnit apparently reset CPU time over and over, while making zero progress. As a side-effect, the Rosetta Long Term Debt of the affected Clients rocketed upto -90000s (lots of work but almost no progress done) |
MerePeer Send message Joined: 6 Nov 05 Posts: 3 Credit: 1,787,446 RAC: 0 |
Same here. Same problem with 2h4o__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK* just hanging. Restarting boinc results in same problem 8 hours later. Linux box. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
I'm not sure what to think. Complaints about the 2h4o wus started atleast 5 days ago. I ran a test on one of mine starting 5 days ago, which leaves 3 full business days and two weekend days for management to make a statement. I've seen or heard nothing. How often do they monitor these boards? Are they of any importance? I'm feeling a bit like any "beta" tests or any other tests are really a waste of our man hours and CPU Seconds. Perhaps, I'll be considered impatient...hmmmm....How long must one wait before one isn't considered as such??? I don't know. I know I've stopped ALL rosetta work. It really isn't what I wanted, but I don't wanna "Pi**" away my CPU time for nothing when it might be spent more wisely. (I.E if my machines are just going to use electricity without scientific benefit, what's the point of leaving them on) tony I started at 200K and was shooting for 600K before stopping, but I guess 350K is OK. If that's what they want.(well, would stay 350K but I loaned out a machine before I knew the score, so I have to await it's return before I remove it.) |
j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0 |
The problems I was getting over at Ralph appear to have carried over to Rosetta. Were you "really" surprised? |
Conan Send message Joined: 11 Oct 05 Posts: 151 Credit: 4,244,078 RAC: 43 |
The problems I was getting over at Ralph appear to have carried over to Rosetta. G'Day j2satx, No I guess I was not, considering no response over on Ralph either. A lot of wasted time when these things run to over 21 hours and then often error out. It is a shame, I do like the project and it's goals, it was one of the best monitored and responsive projects for a good while. |
j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0 |
The problems I was getting over at Ralph appear to have carried over to Rosetta. I know....I started crunching Ralph again when it looked like they were making a change with the "minis", but seems that was short lived also. |
hedera Send message Joined: 15 Jul 06 Posts: 76 Credit: 5,263,150 RAC: 7 |
The interesting thing with all this is that, after that one bad day a couple of weeks ago, I made a minor adjustment to the amount of memory (from 90% to 85% when computer is not in use) and CPU (from 100% to 90%) allowed, and since that time my WUs have been cranking happily away, finishing in the normal 2-4 hours of CPU time, and not overwhelming my Pentium IV. And no errors. Maybe I'm just lucky. --hedera Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic. |
Message boards :
Number crunching :
Problems with Rosetta version 5.93
©2025 University of Washington
https://www.bakerlab.org