BOINC v6.6.20 scheduler issues

Message boards : Number crunching : BOINC v6.6.20 scheduler issues

To post messages, you must log in.

AuthorMessage
Mod.Zilla
Volunteer moderator

Send message
Joined: 5 Sep 06
Posts: 423
Credit: 6
RAC: 0
Message 56152 - Posted: 1 Oct 2008, 15:58:55 UTC
Last modified: 23 Apr 2009, 22:03:13 UTC

New thread created and posts moved in as requested.
Rosetta Informational Moderator: Mod.Zilla
ID: 56152 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TomaszPawel

Send message
Joined: 28 Apr 07
Posts: 54
Credit: 2,791,145
RAC: 0
Message 60760 - Posted: 21 Apr 2009, 7:50:12 UTC - in response to Message 60759.  
Last modified: 21 Apr 2009, 7:55:17 UTC

On one of my hosts i have "nice" problem...

Video: http://www.youtube.com/watch?v=kfclLVJ7cyc

On this host: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=791178 i instal 6.6.20. And problem starts. Rosetta@home don't download WU....

On this machine i run GPUGRID and Rosetta@home in 50/50.(2000/2000)

GPUGIRD takes new WU but Rosetta@home not.

Even if I manually request too update project - rosetta@home Scheduler request completed: got 0 new tasks

Any TIPS?

Soon my R@H WU will depleated...
ID: 60760 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60761 - Posted: 21 Apr 2009, 8:13:53 UTC - in response to Message 60760.  

On one of my hosts i have "nice" problem...

Video: http://www.youtube.com/watch?v=kfclLVJ7cyc

On this host: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=791178 i instal 6.6.20. And problem starts. Rosetta@home don't download WU....

On this machine i run GPUGRID and Rosetta@home in 50/50.(2000/2000)

GPUGIRD takes new WU but Rosetta@home not.

Even if I manually request too update project - rosetta@home Scheduler request completed: got 0 new tasks

Any TIPS?

Soon my R@H WU will depleated...

Not off the top of my head... but the video idea is interesting to say the least ...

hat the heck, not that I expect that it will do any good, but let me post the link on the alpha list and see what gives.

My only suggestion off the top of my head would be to reset debts. You have to make a cc_config file and stop and restart BOINC Client, not just the manager. Shut down the client with Advanced menu then close the manager. Check task manager and make sure all the science applications are stopped.

With the config file in place restart BOINC and the debts should be reset and you *MAY* get work...
ID: 60761 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Buckeye74

Send message
Joined: 5 Jun 06
Posts: 1
Credit: 110,354
RAC: 0
Message 60763 - Posted: 21 Apr 2009, 11:04:28 UTC - in response to Message 60760.  

Suspend GPUGRID and your machine should download 5 days worth of Rosetta work, then resume GPUGRID.
ID: 60763 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TomaszPawel

Send message
Joined: 28 Apr 07
Posts: 54
Credit: 2,791,145
RAC: 0
Message 60765 - Posted: 21 Apr 2009, 15:00:33 UTC - in response to Message 60764.  

I try Suspend GPUGRID but rosetta did not download enything...

Hmmm, I decided to wait to see what happend. And it was 5 task runing.
4 rosetta and 1 GPUGRID. No more Rosetta WU was waiting.

And one task was ended, 3 was crunching (+1GPUGRID) and then, sudenly:

2009-04-21 16:52:49 rosetta@home Computation for task 1dhn__BOINC_ABINITIO_IGNORE_THE_REST-MOO12--1dhn_-_10770_76_0 finished
2009-04-21 16:52:51 rosetta@home Started upload of 1dhn__BOINC_ABINITIO_IGNORE_THE_REST-MOO12--1dhn_-_10770_76_0_0
2009-04-21 16:52:53 rosetta@home Sending scheduler request: To fetch work.
2009-04-21 16:52:53 rosetta@home Requesting new tasks
2009-04-21 16:52:56 rosetta@home Finished upload of 1dhn__BOINC_ABINITIO_IGNORE_THE_REST-MOO12--1dhn_-_10770_76_0_0
2009-04-21 16:52:58 rosetta@home Scheduler request completed: got 8 new tasks

only 8 new tasks was send ... lol.

It is real seroius bug in 6.6.20.
ID: 60765 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TomaszPawel

Send message
Joined: 28 Apr 07
Posts: 54
Credit: 2,791,145
RAC: 0
Message 60773 - Posted: 22 Apr 2009, 6:04:17 UTC - in response to Message 60772.  
Last modified: 22 Apr 2009, 6:04:38 UTC

I seams that 6.6.20 on my host have some strange cycle of scheduler...

It only downloads 8 WU from Rosetta, crunch them, and when crunched 5 of them downloads another 8!!!

LOL

So for few minutes one core of my quad is idle... until it finished download WU.
ID: 60773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,208,737
RAC: 2,882
Message 60774 - Posted: 22 Apr 2009, 10:54:53 UTC - in response to Message 60773.  

I seams that 6.6.20 on my host have some strange cycle of scheduler...

It only downloads 8 WU from Rosetta, crunch them, and when crunched 5 of them downloads another 8!!!

LOL

So for few minutes one core of my quad is idle... until it finished download WU.


Okay Tomas...how long does Boinc think it will take for you to complete a unit? How long is it really taking you to finish a unit? If Boinc thinks it will take 8 hours and you are really only taking 2 hours, you have found your problem. Boinc thinks you have too much work for your settings. You can either change the settings to have larger cache or try and fiddle with the Boinc settings. I have not moved to the Boinc 6.6.? versions yet so can't help you on that part. But the cache settings are changeable either on the website for all your computers, or in Boinc itself for that one computer. Boinc is designed to fix itself, how long that will take is anyones guess, but it could take anywhere from a day or so to a week or so, or longer. Kind of depends on the other projects running on that pc.
ID: 60774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60775 - Posted: 22 Apr 2009, 11:06:02 UTC

There are also issues with 6.6.20 and 6.6.23 with regard to some things... I am in the process of proving one bug, though i do not know the cause at all ... and we may be closing in on the logic flaw that has been driving us nuts for some time (though not related to the problem reported here).

@TomaszPawel

Can you tell me if the debt for Rosetta is just rising negatively? You can turn on the logging flag for work fetch debug or use the "Properties" button (with rosetta selected) on the Projects Tab. Look at it, wait some time, look at it again ... tell me if it is just going in one direction.

Wouldn't hurt to get teh umber for GPU Grid while there ...
ID: 60775 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TomaszPawel

Send message
Joined: 28 Apr 07
Posts: 54
Credit: 2,791,145
RAC: 0
Message 60778 - Posted: 22 Apr 2009, 15:07:27 UTC - in response to Message 60775.  
Last modified: 22 Apr 2009, 15:49:31 UTC

Hi!

It looks line this:



and after some WU

ID: 60778 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TomaszPawel

Send message
Joined: 28 Apr 07
Posts: 54
Credit: 2,791,145
RAC: 0
Message 60792 - Posted: 23 Apr 2009, 11:45:11 UTC - in response to Message 60778.  
Last modified: 23 Apr 2009, 11:46:06 UTC

After longer period of time...:



so... WTF still 8 ...
ID: 60792 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60798 - Posted: 23 Apr 2009, 21:30:15 UTC

@Mod.Sense,

Can we extract out Tomasz's issue to a new thread? It is important, but a sidetrack from RS theory.

@Tomasz

We are looking into your issues. Are you up for more experiments? If so, could you try 6.5.0 for me for a couple days. The major downside is that the run times for GPU Grid are not as well reported on the client.

Other than that, this is the version I use as my standard on all my other systems.

We *ARE* talking about this problem, the problem is that we need data ...

As part of that I want you to fall back and run for a couple days to a week with the other version. If it runs well and as you expect we can go back to 6.6.20 and see if you work back into this situation.

In the mean time, as I learn things I will apprise you of what they are ... honest ... :)

I am sure someone will vouch for me being relatively decent about working the issues I am aware of ...
ID: 60798 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TomaszPawel

Send message
Joined: 28 Apr 07
Posts: 54
Credit: 2,791,145
RAC: 0
Message 60802 - Posted: 23 Apr 2009, 22:34:57 UTC - in response to Message 60792.  

Still nothing...





@Tomasz

We are looking into your issues. Are you up for more experiments? If so, could you try 6.5.0 for me for a couple days. The major downside is that the run times for GPU Grid are not as well reported on the client.

Other than that, this is the version I use as my standard on all my other systems.

We *ARE* talking about this problem, the problem is that we need data ...

As part of that I want you to fall back and run for a couple days to a week with the other version. If it runs well and as you expect we can go back to 6.6.20 and see if you work back into this situation.

In the mean time, as I learn things I will apprise you of what they are ... honest ... :)

I am sure someone will vouch for me being relatively decent about working the issues I am aware of ...


Ok, but first I will try 6.6.15 - it works great on my other computer.

I read that in version 6.6.20 is new scheduler...

http://boinc.berkeley.edu/dev/forum_thread.php?id=2518&nowrap=true#24183

ID: 60802 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60803 - Posted: 24 Apr 2009, 1:07:05 UTC

I don't know which 6.6.x version they started the development of the new scheduler. The only version I have experience with that I personally know worked is 6.5.0 ... if the later one worked, fine, try that ... :)

I am going to try to look at code tonight or tomorrow.

This *IS* an important issue and I think it is affecting me, though in a different way (not sure why) and as such we need to get to the bottom of it.
ID: 60803 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60808 - Posted: 24 Apr 2009, 12:21:19 UTC
Last modified: 24 Apr 2009, 12:51:26 UTC

Posted this this morning:

Ok, I have a glimmer, not sure if I got it right ... but let me try to put my limited understanding down on paper and see if one of you chrome domes can straighten me out.

In the design intent (GpuWorkFetch) we have the following:

A project is "debt eligible" for a resource R if:

• P is not backed off for R, and the backoff interval is not at the max.
• P is not suspended via GUI, and "no more tasks" is not set
Debt is adjusted as follows:

• For each debt-eligible project P, the debt is increased by the amount it's owed (delta T times its resource share relative to other debt-eligible projects) minus the amount it got (the number of instance-seconds).
• An offset is added to debt-eligible projects so that the net change is zero. This prevents debt-eligible projects from drifting away from other projects.
• An offset is added so that the maximum debt across all projects is zero (this ensures that when a new project is attached, it starts out debt-free).

What I am seeing, and my friend on GPU Grid/Rosetta is seeing, is a slow by inexorable growth of debt that eventually "chokes" off one project or another. I THINK I can explain why we are seeing different effects. His is easier.

He is dual project, Rosetta and GPU Grid. His ability to get Rosetta work is choking off.

The problem is that his debt is growing on Rosetta because of GPU Grid's lack of CPU work. So, BOINC "thinks" that GPU Grid is "owed" CPU time and is vainly trying to get work from that project. Eventually, because RS is now biased by compute capability, the multiplier drives his debt into the dirt pretty fast and soon he has trouble getting a queue of CPU work from Rosetta. Because the client wants to get CPU work from GPU Grid to restore "balance".

I have the opposite problem for the same reason. But, mine is because I have 4 GPUs in an 8 core system so my bias is in the other direction ... eventually driving my GPU debt because I am accumulating GPU debt against all 30 other projects ...

My Q9300 sees less of this because the quad core is likely fairly balanced against the GTX280 card so the debt driver is acting more slowly because the GPU is fast enough that the debts stay sort of in balance (best guess), or to put it another way, the 30 projects are building up GPU debt at about the same rate that GPU Grid is running up CPU debt in the other direction ... sooner or later though I do hit walls there are have had to hit debt reset to get back on balance.

This may ALSO partly explain Richard's observation on nil calls to projects (which I also see) where the system is trying manfully to get work from a project that cannot supply it. In my case it is often a call to GPU Grid to get CPU work. Not going to happen.

Not sure how to cure this in that for one thing I think there is at LEAST two problems buried in there if not three.

In effect, we really, really, need to track which projects supply CPU work and which GPU and which both ... and by that I mean the ones that the participant has allowed. So, the debt for me for GPUs should only reflect activities on GPU Grid my sole attached project with GPU work and GPU Grid should never be accumulating CPU debt.
ID: 60808 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TomaszPawel

Send message
Joined: 28 Apr 07
Posts: 54
Credit: 2,791,145
RAC: 0
Message 60812 - Posted: 24 Apr 2009, 21:50:20 UTC - in response to Message 60802.  



6.6.15 also afected.

I will try make clean instal of 6.6.20

Uninstal 6.6.15, then delete files and folders of BOINC....
ID: 60812 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60816 - Posted: 25 Apr 2009, 3:45:48 UTC

I would just roll back to 6.5.0 ... the only thing you lose is the CPU time column shows real CPU time on the GPU tasks so you don't get that nice "elapsed" display.

We are trying to get them to look at this ... honest ...

The problem is that there is at least 3 or 4 main show-stopper type issues that we are working (this being one) and it is difficult to get them to focus ...
ID: 60816 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : BOINC v6.6.20 scheduler issues



©2024 University of Washington
https://www.bakerlab.org