Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 332 · 333 · 334 · 335
Author | Message |
---|---|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2335 Credit: 44,217,916 RAC: 27,821 ![]() |
However, when you later write... Checking in on this (because I have nothing better to do) I did note the cache had reduced from 870 to 700ish at the time of my previous post, but didn't know if that was just a random fluctuation. Now I can see runtime has been knocked back to 8hrs, cache is down to 367, deadlines were only being missed by 7hrs rather than a day and there was a delay in downloading fresh tasks of over a day so that deadlines will now start to be hit. Also there's a further delay in downloading currently going on so that (I speculate) in about a day's time, runtime can be increased to 24hrs again. If I've got that right, and Tom hasn't come back to confirm it yet, that's all exactly the right thing to do. Good job ![]() ![]() |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2335 Credit: 44,217,916 RAC: 27,821 ![]() |
Your cache seems to hold 870 tasks (including running tasks). I know I'm obsessing over this, but I'm at a loose end, so why not... In progress tasks are down to 286 All "Not started by deadline - canceled" and "Timed out - no response" error messages have disappeared. Errored tasks are down by 200 with no new ones being added And task returns are already beating deadlines by as much as 1 day 10hrs, not risking not getting credit and not causing resends to other users who later find them cancelled by the server All problem issues are solved, and with quite some headroom. With a 128-thread server I wouldn't reduce the cache size any further - some might already consider that number to be on the low side, especially when tasks ready to send are so hand-to-mouth. I'd also increase task runtime from 8 to 12hrs, which I personally consider to be a sweeter spot for longer runtimes than 24hrs, reduced server hits compared to 8hrs, less problematic Boinc scheduling and all the other vagaries we have to contend with here. It all looks neatly balanced atm, with that option to slightly increase runtime as well without recreating problems. IMO ![]() ![]() |
Tom M Send message Joined: 20 Jun 17 Posts: 127 Credit: 28,009,619 RAC: 103,855 ![]() |
It looks like I have switched back to the 8 hour profile overnight. I will change the 22-24 hour profile to 12. Boincmgr is set to 0.1/0.01 right now. Do we have any idea what the computation errors are triggered by? I would like to lower my computation errors if possible. I am getting them on both of my systems. A Ryzen 3700x cpu and the Epyc CPU system. Thank you. ===edit=== Bumped the cache from 0.1 to 0.2 The profile is now set to 12 hours. Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel..... |
Tom M Send message Joined: 20 Jun 17 Posts: 127 Credit: 28,009,619 RAC: 103,855 ![]() |
Apparently everyone is die-ing on line 2798 of the Beta tasks. "...ERROR: Error in simple_cycpep_predict app! The imported native pose has a different number of residues than the sequence provided...." Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel..... |
Tom M Send message Joined: 20 Jun 17 Posts: 127 Credit: 28,009,619 RAC: 103,855 ![]() |
And I have started my polling script again. Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel..... |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2335 Credit: 44,217,916 RAC: 27,821 ![]() |
If I've got that right, and Tom hasn't come back to confirm it yet, that's all exactly the right thing to do. Good job On the computation errors, this comes from the project, not from any of us. The last I heard, in the days when someone at Rosetta was speaking to me, was that it was easier to let those tasks error out after very few seconds than to try to extract them from the queue of tasks, which would take a lot of good tasks out as well as the bad. If that view holds, it's something we're going to continue to suffer, unfortunately. Not ideal but pragmatic. On the cache size, I do agree with Grant's view that it should be kept low BUT only if there's a constant supply of tasks for us. For some months now we <haven't> had a constant supply ready to send to us. And this is only made worse by all the tasks that error out. As such I can't agree with the cache only being 0.1 or 0.2 +0.01 With the number of threads you have, the hand-to-mouth supply of tasks and the regular computation errors, I would aim for a cache size somewhere between 0.5 and 1.0 + 0.01 That strikes me as the right ballpark for safety & reliability within the deadline, but tweak it to your own view of each of those competing issues within those bounds. Any less and I can see you regularly having threads free without work. Supply isn't trustworthy enough and, with all the computation errors, even what you do get you can't entirely rely on. Having a 12hr runtime rather than 8hrs gives you that little bit more time to get good tasks through - that's one of its pluses IMO Fwiw on my own machines, I've now settled on a 12hr runtime with a cache of 0.4 + 0.1 which works pretty well with just 16 threads on my main PC (and 6 threads on another and 8 on my work PC), though I run 2 other low-priority projects as a backup in case of unforseen eventualities while they're unattended. Edit: I see you're down to just 149 tasks now, which will be your 128 threads and only 21 tasks waiting to start for when others complete. This is way too tight. It looks like you're likely asking for tasks already but the project hasn't got them to send you. If you were asking for tasks with 0.5days worth left, rather than only 0.1 or 0.2, you'd stand a much better chance of getting some in time. Even 0.5days may not be enough time tbh. You can only see how this goes. It's no good swinging from having too many tasks to complete by deadline all the way to not having enough tasks to keep all your threads occupied. There's a balance somewhere between the two to find. Edit 2: Go straight to 1.0 + 0.01 - even if Rosetta had them all to send you it'd only be ~300 including running tasks which is far from excessive on a 128-thread server. It'd still be nearly 600 fewer than you were stockpiling before ![]() ![]() |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2335 Credit: 44,217,916 RAC: 27,821 ![]() |
Edit: I see you're down to just 149 tasks now, which will be your 128 threads and only 21 tasks waiting to start for when others complete. This is way too tight. It looks like you're likely asking for tasks already but the project hasn't got them to send you. If you were asking for tasks with 0.5days worth left, rather than only 0.1 or 0.2, you'd stand a much better chance of getting some in time. Even 0.5days may not be enough time tbh. I think I panicked. Your tasks dropped to 147 (down to 19 waiting to start) - I didn't know when it was going to stop going down. Then your 12hr runtime kicked in and increased cache from 0.1 to 0.2 as well, and tasks are already up to 160 (32 waiting to start) Still increase your cache, but I now think 0.5 + 0.01 will be enough - no need to go all the way to 1.0 + 0.01 This time tomorrow I expect your cache to be close to 240 12hr tasks, which should be comfortable from every perspective ![]() ![]() |
![]() Send message Joined: 28 Mar 20 Posts: 1839 Credit: 18,534,891 RAC: 0 |
As such I can't agree with the cache only being 0.1 or 0.2 +0.01Why? The idea of a cache is to keep your system busy if there is a lack of work/issues contacting the servers. If you run just one project, and like Rosetta, it's a poorly to not at all managed, then having a cache will help keep your system busy when the project's having issues. But running a single project, with plenty of work and good admin support, keep a few hours worth if you feel the need. But running multiple projects? No cache is really necessary or desirable, let alone multiple days worth. Rosetta isn't the only project he's participating in, so there's no need for a cache at all to keep his systems busy. 0.1 days and 0.01 days means your system will report work pretty much as it's done, and will have some Tasks on hand ready to go as others finish processing so the system isn't waiting to download work before staring on new work once a Task finishes, even if you get a few that might error out as soon they start. Grant Darwin NT |
![]() Send message Joined: 28 Mar 20 Posts: 1839 Credit: 18,534,891 RAC: 0 |
It's been an issue that has been reported for years. No action taken by the project to improve either the BOINC science application to better handle the error or the ones they use to create work.Do we have any idea what the computation errors are triggered by? I would like to lower my computation errors if possible. I am getting them on both of my systems. A Ryzen 3700x cpu and the Epyc CPU system. So every so often you can get a batch of work with a few Tasks there & there that error out, or a batch where a huge percentage of them just error out. And there's another common error that's been around even longer that can error out at any time- right from the start of processing, all the way till just before it's ready to report. 8 (or more) hours of work down the toilet just like that. Grant Darwin NT |
angel Send message Joined: 14 Jun 22 Posts: 2 Credit: 105,815 RAC: 19 |
Hello , since a few weeks Rosetta is not running with the message " feeder is not running" . I installed Rosetta again and I still have the problem . Any idea of solution ? Thank you / angel ( France) |
![]() Send message Joined: 28 Mar 20 Posts: 1839 Credit: 18,534,891 RAC: 0 |
Hello , since a few weeks Rosetta is not running with the message " feeder is not running" . I installed Rosetta again and I still have the problem . From a previous post by Greg_BE It a IPV6 address error. A server went crazy so we use this work around to solve that: Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2335 Credit: 44,217,916 RAC: 27,821 ![]() |
As such I can't agree with the cache only being 0.1 or 0.2 + 0.01Why? For the specific reasons detailed prior to the words "as such" That's what "as such" means. (Lol) Last I looked last night, Tom's cache increased from 147 (128 running + 19 waiting) to 190 (128 running + 62 waiting) This morning it was 126 - all running + 2 threads idle. Exactly as I feared and anticipated, because the 0.1 + 0.01 didn't provide a sufficient buffer to cover for the failure of the project to supply. Which is something we've known about for months now. As I said in what I wrote prior to "as such" I completely agree with you if we have reliability of supply from the project, but we've all known, every single day for literally months, we don't have reliability of supply. I don't know what kind of extra hint there needs to be. 0.1 + 0.01 didn't survive 1 day. We don't know how long 0.2 + 0.01 will survive, but the project's reliability doesn't make me think it'd be much more than a week (I'm guessing obvs, but one borne of experience) I'm speculating that 0.5 + 0.01 will be sufficient to cover a continuation of what we've seen this year, while not hoarding an excess of tasks and not risking a failure to meet deadline. If that turns out not to be the case it'll be for a reason we can't predict right now and can revisit if it arises. There is zero risk of 0.5 + 0.01 being too large a cache, even with a 12hr runtime. One day's worth of tasks (cache plus runtime) with a 3day deadline ensures a speedy return of tasks and <no chance whatsoever> of missing deadline. The <only> risk is that the cache is too small due to the project's inability to supply and threads go unutilised. At 0.1 supply reliability makes that risk high (almost guaranteed). At 0.2 it's likely but at an unknown frequency. At 0.5 I speculate most irregularity of supply is covered unless queued tasks (front page) drops to zero for an extended period, in which case no cache size will be enough. Your preference is what 'should' happen. Mine is what we know actually happened over the last ~3 months. ![]() ![]() |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2335 Credit: 44,217,916 RAC: 27,821 ![]() |
And there's another common error that's been around even longer that can error out at any time- right from the start of processing, all the way till just before it's ready to report. 8 (or more) hours of work down the toilet just like that. The other day I got 3 consecutive "Validation error"s. No error in the task, but failed validation. 3x12hrs down the pan. Highly annoyed at the time. The overnight job that used to run daily to credit those tasks with just a validation error can't come back soon enough <sigh> ![]() ![]() |
Tom M Send message Joined: 20 Jun 17 Posts: 127 Credit: 28,009,619 RAC: 103,855 ![]() |
I just bumped the cache up to 0.5 days. And set the CPU limit (a Pandora parameter) to 300. The polling script is still running. I do have several projects set for "0" resources. So I should have other tasks to process if Rosetta were to burp again. Right now the Epyc system is running 100% Rosetta with a couple of GPU tasks from Einstein at home. I am hoping that yesterdays Free-DC result is a reliable signal of good things to come. Thank you for your discussion and guidance! Respectfully, Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel..... |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2335 Credit: 44,217,916 RAC: 27,821 ![]() |
I just bumped the cache up to 0.5 days. And set the CPU limit (a Pandora parameter) to 300. I noticed. Your cache shot right up to 300 around the time you posted (570 lower than when this exercise started). It looks bang on to me with your 12hr runtimes coming through. If it ever goes wrong from there, it's my fault. I don't expect it to, short of the project itself having a major extended issue. Good stuff. ![]() ![]() |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2025 University of Washington
https://www.bakerlab.org