Rosetta 4.0+

Message boards : Number crunching : Rosetta 4.0+

To post messages, you must log in.

Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · Next

AuthorMessage
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 42
Credit: 1,258,039
RAC: 0
Message 95287 - Posted: 24 Apr 2020, 9:05:06 UTC

@Grand + Sid : I don't understand very well what is the problem of having tasks (whatever the number) cancelled by the server because the deadline is reached ? are they not sent back to other crunchers ? the calculation will be done at then, and no resource will actually be "wasted", correct ? or is it just about the "error count" ? it should only affect me finally, not the project... ?

Regarding rosetta deadline I had not noticed is was so short indeed. But my cache is not "rosetta only", I've always been a multi-projects boincer, but it's true it's an old habit when internet was not so stable, and when projects would often come short of tasks, having a cache was always a pleasant idea.

But again : this was absolutely not the problem I faced with the mini tasks (see all the history of my explanations above). And again, I "solved" it by blocking the mini on that machine, it was enough for me and was not doing any harm to the project research.

Thanks.
ID: 95287 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 392
Credit: 12,101,016
RAC: 5,587
Message 95289 - Posted: 24 Apr 2020, 9:28:32 UTC - in response to Message 95287.  

@Grand + Sid : I don't understand very well what is the problem of having tasks (whatever the number) cancelled by the server because the deadline is reached ? are they not sent back to other crunchers ? the calculation will be done at then, and no resource will actually be "wasted", correct ? or is it just about the "error count" ? it should only affect me finally, not the project... ?

Regarding rosetta deadline I had not noticed is was so short indeed. But my cache is not "rosetta only", I've always been a multi-projects boincer, but it's true it's an old habit when internet was not so stable, and when projects would often come short of tasks, having a cache was always a pleasant idea.

But again : this was absolutely not the problem I faced with the mini tasks (see all the history of my explanations above). And again, I "solved" it by blocking the mini on that machine, it was enough for me and was not doing any harm to the project research.

Thanks.


Yes they are sent to other clients but three days late and if the researchers need the results pronto then that is a big problem.
ID: 95289 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,819,328
RAC: 22,701
Message 95291 - Posted: 24 Apr 2020, 9:46:13 UTC - in response to Message 95287.  
Last modified: 24 Apr 2020, 9:50:55 UTC

@Grand + Sid : I don't understand very well what is the problem of having tasks (whatever the number) cancelled by the server because the deadline is reached ? are they not sent back to other crunchers ? the calculation will be done at then, and no resource will actually be "wasted", correct ? or is it just about the "error count" ? it should only affect me finally, not the project... ?
The problem, is that it actually takes considerable time & effort to produce the WUs in the first place that can then be moved here to Rosetta for us to process (that was why the Project was out for work for a day or 2 a while back because of the huge surge in new crunchers, and it took them by surprise & producing new work wasn't just a case of a few keystrokes- it took time). And if it errors out, and then gets sent to another system that is having problems as well, it's a complete loss.
And even if it does get done by another system- it would have been nice if that system had been able to process some new work, not something that had to be resent because it timed out, and of course it takes longer to get back than if it had processed the first time around.

The Estimated completion times eventually get close, but not correct, so BOINC is always going to underestimate how long it takes to return work. Especially so as some run a lot longer than the Target time, than those that do finish early. Yes, if there is an error, it goes out again to be checked. But having to do that because a system keeps continually missing deadlines really is a waste of resources.
If you're not going to process it, then why download it? Especially so when you can easily stop it from occurring?



Regarding rosetta deadline I had not noticed is was so short indeed. But my cache is not "rosetta only", I've always been a multi-projects boincer, but it's true it's an old habit when internet was not so stable, and when projects would often come short of tasks, having a cache was always a pleasant idea.
I'm use to the same thing with Seti having regular& irregular short & extended outages.
But Rosetta isn't Seti, so i don't need a 4 day cache.
I'm down to a 0.6 day cache new, and Rosetta is the only project i'm doing. If i did another project as well i wouldn't even have this much of a cache.



But again : this was absolutely not the problem I faced with the mini tasks (see all the history of my explanations above). And again, I "solved" it by blocking the mini on that machine, it was enough for me and was not doing any harm to the project research.
Yet when i checked out your system at the time you originally posted, before you implemented your fix- most of your errors weren't Rosetta Mini Tasks that had Computation errors, but Rosetta Tasks that had missed their deadlines.
The missed deadlines alone were producing more Errors than you were producing Valid work. And that does harm the project- as they say "Even computation errors are useful" as it lets them determine what is wrong. But missed deadlines aren't useful, just a waste of server time, bandwidth & other system's time checking something that shouldn't require checking.
Grant
Darwin NT
ID: 95291 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,819,328
RAC: 22,701
Message 95292 - Posted: 24 Apr 2020, 9:47:22 UTC - in response to Message 95289.  
Last modified: 24 Apr 2020, 9:48:50 UTC

Yes they are sent to other clients but three days late and if the researchers need the results pronto then that is a big problem.
if they don't end up with another such system and error out again.
Grant
Darwin NT
ID: 95292 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 392
Credit: 12,101,016
RAC: 5,587
Message 95295 - Posted: 24 Apr 2020, 10:28:36 UTC - in response to Message 95292.  

Yes they are sent to other clients but three days late and if the researchers need the results pronto then that is a big problem.
if they don't end up with another such system and error out again.


Ouch - I’d assumed that crunchers with long deadlines were a small minority but if some WUs are hitting multiple deadlines maybe not.
ID: 95295 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,819,328
RAC: 22,701
Message 95296 - Posted: 24 Apr 2020, 10:43:31 UTC - in response to Message 95295.  
Last modified: 24 Apr 2020, 10:44:37 UTC

Ouch - I’d assumed that crunchers with long deadlines
It's not so much the deadline that's the problem, as it is the combination of deadlines, large cache, multiple projects, and the Estimated completion times being less (sometimes a lot less) than what the actual Run time will be (and you've got the 10 hour watchdog timer for those units that run long...).
And if the posts here are anything to go by, there are quite a few of them about.
Grant
Darwin NT
ID: 95296 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 51
Message 95301 - Posted: 24 Apr 2020, 12:42:24 UTC
Last modified: 24 Apr 2020, 12:43:51 UTC

The thread has raised a doubt in my mind. I have my preferred run time set to 12 hours. I know the workunit has a model, it generates a random number, and runs the model to completion. It then looks to see how long that took, how long there is left with my preference, and if suitable, generates a new random number and runs the process again, and again, ad finitum, until it decides there is not enough time to run it again, at that point, it ends the work unit. With the urgency of the current situation, I can see the possibility that the first run of the model had a critical result, but that it was not returned for hours whilst the work unit ran with different random start points. Should the preferred runtime be set down, at least temporarily?
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 95301 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2123
Credit: 41,204,457
RAC: 10,266
Message 95306 - Posted: 24 Apr 2020, 13:37:39 UTC - in response to Message 95287.  

@Grand + Sid : I don't understand very well what is the problem of having tasks (whatever the number) cancelled by the server because the deadline is reached ? are they not sent back to other crunchers ? the calculation will be done at then, and no resource will actually be "wasted", correct ? or is it just about the "error count" ? it should only affect me finally, not the project... ?

The recent post by bcov explained the reasoning. Up to 2-3yrs ago, the server software used meant it wasn't even possible for tasks to be aborted before running. Then they upgraded and aborting tasks was a rare event. The recent cancellation of running tasks is the first time I've ever seen that happen. But it was a decision from the project admins - no need for us to dwell on it.
ID: 95306 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2123
Credit: 41,204,457
RAC: 10,266
Message 95307 - Posted: 24 Apr 2020, 13:44:51 UTC - in response to Message 95301.  

Should the preferred runtime be set down, at least temporarily?

Not if they can still meet deadline. A whole batch of tasks are issued and results returned sooner or later within the deadline. No-one's expecting them to be returned instantaneously.
8hr tasks as a default (containing multiple results) or 12hrs is fine. Even 24 & 36hrs as long as they still meet deadline.
Deadlines were cut from 8-days to 3-day. I think that addressed the concerns you're thinking about.
ID: 95307 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 51
Message 95308 - Posted: 24 Apr 2020, 14:12:41 UTC - in response to Message 95307.  
Last modified: 24 Apr 2020, 14:23:33 UTC

Fair enough, I'll leave it alone. I'm not seeing anything likely to hit the deadline at the moment.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 95308 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,819,328
RAC: 22,701
Message 95320 - Posted: 24 Apr 2020, 20:54:40 UTC - in response to Message 95301.  

I can see the possibility that the first run of the model had a critical result, but that it was not returned for hours whilst the work unit ran with different random start points. Should the preferred runtime be set down, at least temporarily?
Or it could be that last run.
I figure the 8 hour default was chosen by the project as a good compromise between as many models as possible, and very few models. 8 hours gives them a good selection of models to work with, but they do have the 10 hour Watchdog timer so if more time is needed for a Task that is producing exceptionally good data, then that's what happens. And if it ends up running in to a dead end, or producing too much data (the 500MB result file limitation) it will bail out early.
Better to run 36hr Target CPU time Tasks that are returned before the deadline, than to run 2hr Target CPU time Tasks when most of them don't make the deadline. But to do that does require appropriate project settings that take in to account the deadlines & shorter than actual Estimated completion times.
Grant
Darwin NT
ID: 95320 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 42
Credit: 1,258,039
RAC: 0
Message 95355 - Posted: 25 Apr 2020, 16:13:41 UTC

OK I get your point, rosetta requires a short cache.

We'll see in the future if I put back this machine to run on it.
ID: 95355 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 42
Credit: 1,258,039
RAC: 0
Message 95915 - Posted: 3 May 2020, 11:18:58 UTC

OK so now I decided to give it a go again because of Pentathlon who chose (what a surprise) Rosetta as the main project.

On that same machine, my cache is now

work_buf_min_days = 0
work_buf_additional_days = 0.2

(I could verify it in the global_prefs.xml file and the global_pref_override.xml is empty)

Quite reasonable, isn't it ?

Also it is limited to 8 tasks (using app_config.xml) because of the reduced RAM of that machine.
And I am still blocking the mini task (using app_info.xml) because I don't feel like trying, and fighting, again.

And guess what, it has downloaded MORE THAN 1000 TASKS on the machine !!!!

Who is to blame ? not me ! Hundreds of tasks are going to be cancelled by the server within a few days...

(I still think it should normally not be a big problems for the project itself, but apparently all of your scholarly demonstrations above tend to show the contrary, so I hope "everybody" is not going to be angry at me again here...)
ID: 95915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Millenium

Send message
Joined: 20 Sep 05
Posts: 68
Credit: 184,283
RAC: 0
Message 95933 - Posted: 3 May 2020, 15:02:59 UTC

Lol I have to say that is funny. I reattached to the projct for the new address and it just downloaded like 20 tasks or so. BOINC says I have 0.3 and 0.5, so similar values for the work buffer.
ID: 95933 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 4805
Credit: 0
RAC: 0
Message 95968 - Posted: 4 May 2020, 0:03:20 UTC

Please let me know if this is still an issue. I updated the server scheduler to hopefully fix this cache size issue.
ID: 95968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 42
Credit: 1,258,039
RAC: 0
Message 95993 - Posted: 4 May 2020, 7:18:35 UTC

Thank you !

Actually now there is nothing I can do but wait, I see the server has started to reclaim a few tasks and I suppose it is going to do it at a larger scale soon.

Since you are here, if you look at my earlier posts a few weeks ago I had a problem with all the mini tasks on this host (I have posted the kind of error I got then) so I was forced to block the mini tasks using an app_info file to declare only the rosetta app. It is quite tedious since I have to upgrade the file and also download the application versions manually (mine are sill 4.15 but I see I must now go to 4.20). But on the other end I don't want to risk to block several cores with unlimited wasted CPU cycles again with those mini that this machine really doesn't like...

Do you have any idea of where it may come from ? any library version or something like that ?

Thanks.
ID: 95993 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 4805
Credit: 0
RAC: 0
Message 95994 - Posted: 4 May 2020, 7:27:07 UTC - in response to Message 95993.  

The mini tasks should soon be gone forever since we have deprecated the app. There was however a batch that was submitted recently but I imagine most of those tasks have completed by now.
ID: 95994 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 42
Credit: 1,258,039
RAC: 0
Message 96034 - Posted: 4 May 2020, 15:39:37 UTC - in response to Message 95994.  

This is a good news for me then, I'll be able to remove that app_info and go back to fully automated mode !

Thanks for the info.
ID: 96034 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 42
Credit: 1,258,039
RAC: 0
Message 96162 - Posted: 6 May 2020, 11:00:16 UTC

As expected it canceled hundreds of tasks.

The cache instructions seem to be followed, I don't have hundreds of tasks anymore.

I did remove the app_info and I'm now getting 4.20 tasks.

Obviously it killed all the tasks that were currently running when I removed + restarted boinc after removing that file, but I suspected this would happen anyway...

Thanks.
ID: 96162 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2123
Credit: 41,204,457
RAC: 10,266
Message 96196 - Posted: 7 May 2020, 5:57:07 UTC - in response to Message 95994.  

The mini tasks should soon be gone forever since we have deprecated the app. There was however a batch that was submitted recently but I imagine most of those tasks have completed by now.

Now all returned
ID: 96196 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · Next

Message boards : Number crunching : Rosetta 4.0+



©2024 University of Washington
https://www.bakerlab.org