Help us solve the 1% bug!

Message boards : Number crunching : Help us solve the 1% bug!

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,299,279
RAC: 1
Message 12417 - Posted: 21 Mar 2006, 11:35:07 UTC - in response to Message 12374.  

Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it.


OK David,

Have started some RALPH units.

And what's happening you ask???

The first two (I have a P4/HT) have both got "stuck" at 1%.

Checked the graphics - having re-installed BOINC as a single-user - and the time is increasing nicely, as it should, the pictures are real pretty and crunching seems to be taking place, but the 1% is not moving...!.

What do I do now?

Abort these 2 and see what happens with the next couple of WU's

Suspend them and see what happens with the next 2.

Give up?

regards,

Tim
ID: 12417 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,299,279
RAC: 1
Message 12418 - Posted: 21 Mar 2006, 11:42:18 UTC - in response to Message 12417.  

Have started some RALPH units.



Having just wrote the last msg, I thought what the heck !! Need to experiment to help you guys.

So, I went back to BOINC and sure enough, only one of the 2 WU's was still at 1% - the other one has jumped up to 2.34%. But it's got stuck again.

So, I suspended the 1% and allowed BOINC to switch to the next RALPH WU. Upon starting it immediately went to 1%....and stuck!

So, suspended that one and allowed a 4th WU to start. And that went straight to 1% and stuck. Same with 5th and now 6th.

Have now shut-down BOINC and going to "play" a bit with my "project prefs".

regards,

Tim

ID: 12418 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,299,279
RAC: 1
Message 12419 - Posted: 21 Mar 2006, 12:03:13 UTC - in response to Message 12418.  
Last modified: 21 Mar 2006, 12:14:58 UTC

Have now shut-down BOINC and going to "play" a bit with my "project prefs"


OK - changed my project prefs from default to max - 50, 50 and 4 days.

Also set my BOINC prefs to "pre-empted".

Have also set computer to "visible" if it helps.


Restarted BOINC.

RALPH WU's are the only ones I have working.

Immmediately, when BOINC restarted, the very 1st WU reset the crunched time to zero, but still showing 1% progress.

Did a manual update of the project.

Still the same.

The 2nd WU is now on 2.35% (was 2.34%). But hasn't moved at all from there for the last 5 minutes.


In "desparation mode", I've tried to suspend/resume various WU's in the hope of either causing a "computation error" or to at least to get a WU to move off from the 1%. So far, nothing has changed.....!



In both cases, the CPU time (for RALPH WU's) is continuing to increase - it's just the "Progress" that stays stuck - if it weren't for that, you'd think all was well!!

regards,

Tim


PS: System is:
CPU: Pentium 4, inc HT @ 3.06GHz (not overclocked)
Memory: 512Mb
OS: Windows XP + SP2
HDD: 24Gb free space
Graphics: Radeon 9500 Pro
BOINC: v5.2.13 (standard, not optimised)
All other projects crunch OK.

(edit) added BOINC version
ID: 12419 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,299,279
RAC: 1
Message 12420 - Posted: 21 Mar 2006, 12:12:16 UTC - in response to Message 12419.  
Last modified: 21 Mar 2006, 12:19:04 UTC

This is getting stranger.

After about 14 minutes total crunching time, the 1st WU:

(HB_BARCODE_30_1bk2__352_137_0 using rosetta_beta version 493)

has now changed to 0.178% progress (on the graphics screen) and is now stuck again.

After 34 minutes crunching time the 2nd WU

(HB_BARCODE_30_5croA_352_136_0 using rosetta_beta version 493)

is still at 2.35%.


Will let these carry on for an hour or so and report back then.

regards,

Tim

(edit) added WU Names
ID: 12420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12427 - Posted: 21 Mar 2006, 15:16:58 UTC - in response to Message 12417.  

Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it.


OK David,

Have started some RALPH units.

And what's happening you ask???

The first two (I have a P4/HT) have both got "stuck" at 1%.

Checked the graphics - having re-installed BOINC as a single-user - and the time is increasing nicely, as it should, the pictures are real pretty and crunching seems to be taking place, but the 1% is not moving...!.

What do I do now?

Abort these 2 and see what happens with the next couple of WU's

Suspend them and see what happens with the next 2.

Give up?

regards,

Tim


is the protein still jumping around on the screen-if so, definitely let it continue!
ID: 12427 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BadThad

Send message
Joined: 8 Nov 05
Posts: 30
Credit: 71,834,523
RAC: 0
Message 12430 - Posted: 21 Mar 2006, 15:23:07 UTC

Arrgggg.....looks like the 1% stuck wu's are back:

FA_RLXc9_1c9oA_359_372_0

1% after 19 hr 44 min.
ID: 12430 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,299,279
RAC: 1
Message 12441 - Posted: 21 Mar 2006, 16:37:07 UTC - in response to Message 12420.  
Last modified: 21 Mar 2006, 17:09:16 UTC

This is getting stranger.
After about 14 minutes total crunching time, the 1st WU:
(HB_BARCODE_30_1bk2__352_137_0 using rosetta_beta version 493)
has now changed to 0.178% progress (on the graphics screen) and is now stuck again.

After 34 minutes crunching time the 2nd WU
(HB_BARCODE_30_5croA_352_136_0 using rosetta_beta version 493)
is still at 2.35%.



OK - so the 1st WU is now at 4 hr 27 mins of CPU time and the Progress is now at 4.56%

Completion time was around 8 hr 30 m, but now reads: 12 hrs 24m !!!


The 2nd WU is now at 4 hr 47 mins and 4.75% with a completion time of 12 hrs 25m (was about 8 hr 30m)


In both cases, the graphics in the "Searching..." box *is* moving:

with both 1st WU and 2nd WU, the graphics seem to "settle down" for a bit (with the shapes in both boxes being "similar"). The bottom right numbers change slowly.


After a short while, in the "Searching..." box, the graphic then starts moving more rapidly. This corresponds to an faster rate of change of the numbers in the bottom right.


Will let them continue and see what happens over the next 24 hours...!

regards,

Tim
(edit) typo
ID: 12441 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
doc :)

Send message
Joined: 4 Oct 05
Posts: 47
Credit: 1,106,102
RAC: 0
Message 12450 - Posted: 21 Mar 2006, 18:52:12 UTC

timbo, above you wrote you changed your prefs to 4 days, if that was the target cpu run time in your ralph@home preferences then the slow movement of percentage and the increasing time to completion is perfectly normal cuz it will run for 4 days with that setting (boinc doesnt know about that project specific option yet, so it cant include it in that prediction, it has to finish some units first to make the prediction more correct and will be far off again if you change the target cpu time)
as long as the graphics are still moving, even very slowly (when the stage says full atom relax) its not stuck :)
ID: 12450 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Doug Worrall
Avatar

Send message
Joined: 19 Sep 05
Posts: 60
Credit: 58,445
RAC: 0
Message 12454 - Posted: 21 Mar 2006, 20:24:19 UTC

Hello,
I feel embarassed posting the only 1% stuck bug.It,s 4.81_i6 "FA_RLXpt_h....."
yada.It had a problem Downloading also.3 attemepts got "Timed out" {error}
Its red anyways.LOL.Not to concerned about 1 w/u but,will subscribe to this
thread and I am able to help-out I will.Just donnot have enough time to read
all these Posts on this Problem.Also lots are running mutliple Boxes and they
are needing the Help with this Bug.
"Happy Crunching All"

Sincerely
Doug Sluger Worrall
ID: 12454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 12456 - Posted: 21 Mar 2006, 20:42:52 UTC
Last modified: 21 Mar 2006, 20:44:31 UTC

I'm having Many 1% bugs on FA_RLX jobs. I may have a good set of data points here as the failures are ~100% on one multi-processor Linux machine, but not on two other multi-processor Linux machines, and not on a single processor XP-SP2 machine.

The Linux machines are all 2.4.21-XXX Linux (slightly different patch levels) and all have four Intel Xeon processors but are clocked (no overclocking) at 2.8, 3.2, and 3.4. The slowest machine has the failures. They are all running the same BOINC client.

Call if you need to.

dag
719 590 3038
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 12456 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 12457 - Posted: 21 Mar 2006, 20:47:21 UTC - in response to Message 12456.  

Call if you need to.
dag
719 590 3038


Please see David Baker's comment plea, below, which I quote:

Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it.


Regards,
Bob P.
ID: 12457 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,299,279
RAC: 1
Message 12461 - Posted: 21 Mar 2006, 21:24:22 UTC - in response to Message 12450.  

timbo, above you wrote you changed your prefs to 4 days, if that was the target cpu run time in your ralph@home preferences then the slow movement of percentage and the increasing time to completion is perfectly normal cuz it will run for 4 days with that setting (boinc doesnt know about that project specific option yet, so it cant include it in that prediction, it has to finish some units first to make the prediction more correct and will be far off again if you change the target cpu time)
as long as the graphics are still moving, even very slowly (when the stage says full atom relax) its not stuck :)



OK - thanks for that info.

Had assumed that the option to change pref's meant that the PROJECT ran for 4 days straight - not the actual work unit itself. And besides, I would have thought that if you allowed the WU to have "direct control" over what BOINC is supposed to be doing, (for these 4 days), then that must impact other WU that you will be crunching for.

So, will BOINC get in a "tizz" if you work on 4 day long Rosetta WU's and you have other WU from other projects "waiting and getting close or past their deadlines.....

It's nice for the project to give users that amount of control, but I think it's a bit too much....!


BTW: Didn't the problem of these 1% WU's occur sometime around the time Rosetta allowed users to change these exact preferences...?

I've crunched quite a few Rosetta WU's and never really had a problem until recently.


regards,

Tim
ID: 12461 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
doc :)

Send message
Joined: 4 Oct 05
Posts: 47
Credit: 1,106,102
RAC: 0
Message 12469 - Posted: 21 Mar 2006, 22:18:43 UTC

the 1% stuck bug has been there long before the cpu target time option was introduced.
boinc will switch between projects according to your "switch between applications every" setting in your general preferences (and your resource shares ofcourse)

and we are getting a little bit off-topic here :)
ID: 12469 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MD_Willington

Send message
Joined: 8 Dec 05
Posts: 1
Credit: 47,751
RAC: 0
Message 12472 - Posted: 21 Mar 2006, 22:43:58 UTC - in response to Message 12441.  

This is getting stranger.
After about 14 minutes total crunching time, the 1st WU:
(HB_BARCODE_30_1bk2__352_137_0 using rosetta_beta version 493)
has now changed to 0.178% progress (on the graphics screen) and is now stuck again.

After 34 minutes crunching time the 2nd WU
(HB_BARCODE_30_5croA_352_136_0 using rosetta_beta version 493)
is still at 2.35%.



OK - so the 1st WU is now at 4 hr 27 mins of CPU time and the Progress is now at 4.56%

Completion time was around 8 hr 30 m, but now reads: 12 hrs 24m !!!


The 2nd WU is now at 4 hr 47 mins and 4.75% with a completion time of 12 hrs 25m (was about 8 hr 30m)


In both cases, the graphics in the "Searching..." box *is* moving:

with both 1st WU and 2nd WU, the graphics seem to "settle down" for a bit (with the shapes in both boxes being "similar"). The bottom right numbers change slowly.


After a short while, in the "Searching..." box, the graphic then starts moving more rapidly. This corresponds to an faster rate of change of the numbers in the bottom right.


Will let them continue and see what happens over the next 24 hours...!

regards,

Tim
(edit) typo



Same here.. @ ~ 75 hours, ??? should I ditch the WU or let it go for the long haul?

MD
ID: 12472 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 40,071
RAC: 0
Message 12478 - Posted: 22 Mar 2006, 1:42:23 UTC
Last modified: 22 Mar 2006, 1:42:41 UTC

A new version of Rosetta has been posted in the RALPH@Home project.

Release Notes

For those who are so inclined, please help us track down the issue by running RALPH@Home and if/when you find a workunit with the '1% bug' feel free to abort it and call it out in this thread.

Thanks in advance for any help you can provide.

----- Rom
----- Rom
My Blog
ID: 12478 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12489 - Posted: 22 Mar 2006, 4:31:57 UTC - in response to Message 12472.  

This is getting stranger.
After about 14 minutes total crunching time, the 1st WU:
(HB_BARCODE_30_1bk2__352_137_0 using rosetta_beta version 493)
has now changed to 0.178% progress (on the graphics screen) and is now stuck again.

After 34 minutes crunching time the 2nd WU
(HB_BARCODE_30_5croA_352_136_0 using rosetta_beta version 493)
is still at 2.35%.



OK - so the 1st WU is now at 4 hr 27 mins of CPU time and the Progress is now at 4.56%

Completion time was around 8 hr 30 m, but now reads: 12 hrs 24m !!!


The 2nd WU is now at 4 hr 47 mins and 4.75% with a completion time of 12 hrs 25m (was about 8 hr 30m)


In both cases, the graphics in the "Searching..." box *is* moving:

with both 1st WU and 2nd WU, the graphics seem to "settle down" for a bit (with the shapes in both boxes being "similar"). The bottom right numbers change slowly.


After a short while, in the "Searching..." box, the graphic then starts moving more rapidly. This corresponds to an faster rate of change of the numbers in the bottom right.


Will let them continue and see what happens over the next 24 hours...!

regards,

Tim
(edit) typo



Same here.. @ ~ 75 hours, ??? should I ditch the WU or let it go for the long haul?

MD



as long as the graphics show movement, the calculation is proceeding, so best to stick with it..


ID: 12489 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stephen Miller

Send message
Joined: 18 Sep 05
Posts: 13
Credit: 16,294,215
RAC: 0
Message 12501 - Posted: 22 Mar 2006, 9:52:51 UTC - in response to Message 12489.  
Last modified: 22 Mar 2006, 10:08:29 UTC



as long as the graphics show movement, the calculation is proceeding, so best to stick with it..



I've got a stuck unit too.

FA_RLXpt_hom004_1ptq_361_27_0 is stuck at 8.63% at 48:41:25 CPU time in BOINC.

Per the instuctions at the bottom of this thread, I launched:

rosetta_4.82_windows_intelx86.exe xx 1ptq _ -output_silent_gz -silent -increase_cycles 10 -relax_score_filter -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx -ex1 -ex2 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -nstruct 10 -protein_name_prefix hom004_ -frags_name_prefix hom004_ -filter1 -45 -filter2 -55 -termini -cpu_run_time 7200 -constant_seed -jran 2484844

which ran for 19 minutes (started with 18 minutes = 37 minutes total) and stuck at 22.7%, Stage: Ful atom relax, Model 2, step 255492. There is no graphic movement and no step changes.

CPU time is now 0 hr 48 min 0 sec.

Hope this helps.

I have a screen shot of the BOINC application if desired.

I am restarting BOINC to see if it will finish.

On this particular computer, Rosetta is the only project being processed.

update - after a reboot, BOINC is continuing to process the unit. It is currently at 20 minutes 27 secs and at model 3 step 67000+. It took only 10 minutes to get to this point.

Stephen

ID: 12501 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike

Send message
Joined: 21 Dec 05
Posts: 9
Credit: 35,252
RAC: 0
Message 12505 - Posted: 22 Mar 2006, 10:38:29 UTC

Hi. Ok,I'm running Roseta,Seti& Predictor. Since I turned off all screen savers, and keeping results in memory (hard disc) I've had no further problems.
PC runs 24/7. I just turn off the monitor when I away.
ID: 12505 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 12510 - Posted: 22 Mar 2006, 13:17:26 UTC

Hi All,

I experianced the 1 percent bug but not at 1 percent but at 15 percent. It had been spinning its wheels at 15 %for 15 hours before I realized it. Turned BOINC off then back on and roseeta went back to zero and started all over. checked on BOINC 8 hours later and same thing stuck at 15 percent so I just aborted the whole unit.

https://boinc.bakerlab.org/rosetta/result.php?resultid=14382390

FA_RLXpt_hom006_1ptq__361_86_0
ID: 12510 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dorphas

Send message
Joined: 14 Feb 06
Posts: 2
Credit: 60,275
RAC: 0
Message 12511 - Posted: 22 Mar 2006, 14:05:30 UTC

this 1% bug, i think, is a big turnoff for a lot of people. especially the ones with "farms" and can not get to them daily. i had 3 computers at my 2nd job that got stuck for 6 days last week. i reset them saturday and now it looks like 2 of them are stuck on 1% again for the past 2 days. our team is even talking about moving on to something else because of the 1% bug and wasted cpu cycles. we really like rosetta as a whole but it seems to require a lot more monitoring than other projects. hope it is solved soon.
ID: 12511 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : Help us solve the 1% bug!



©2025 University of Washington
https://www.bakerlab.org