Message boards : Number crunching : March 2022 - WU error rates
Author | Message |
---|---|
jay Send message Joined: 12 Jan 08 Posts: 20 Credit: 195,801 RAC: 0 |
Greetings, I am working the non-vbox WU and getting errors. I compare my errored results with my wing-man. for example, on one WU a Windows Volunteer errors out in 20 seconds. My Linux WU ran for 35939 seconds and had the error: "Too many errors (may have bug) Too many total results" on validation. See https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1315068895 What gives? Is this a time of testing the WU? If so, can they be tested by Rosetta before releasing? For me it is a matter of electricity, heat, and non-productive work. Another WU, https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1315163591 has already errored-out by my windows wing-man, while My linux box is still crunching. ( I has run for about 6 hours with 3.5 hours remaining. I am concerned about a similar failure.) I looked on the forums for recent errors - but did not see any. Anyone else have Problems - or having no errors? THANKS, Jay |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,552,383 RAC: 6,167 |
Is this a time of testing the WU? Yes If so, can they be tested by Rosetta before releasing? No, Ralph@Home is largely unused |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,750,280 RAC: 22,926 |
What gives?Due to the nature of Rosetta work, Tasks that error out can still give useful results, which is why in most cases you will still get Credit for a Task that produces an error. Unfortunately there has been little work (if any) to actually code so that such Tasks instead of crashing out with an error just end early (as they should). Only Tasks that are truly an error (ie not producing useful data) should actually error out. And the applications should be fixed so that one version -eg for Windows- produces errors when the other -eg LINUX- doesn't produce errors (the reverse has also occurred in the past). But since Rosetta 4.20 has pretty much been abandoned apart from the occasional small batch every so often, there is no such effort. Not surprising as the new type of Rosetta Tasks -Python- have plenty of significant issues of their own of which there has been no updated application to address them at all. If they aren't going to fix their current application, there's no way on Erath they're going to fix the old depreciated ones. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2121 Credit: 41,179,074 RAC: 11,480 |
What gives?Due to the nature of Rosetta work, Tasks that error out can still give useful results, which is why in most cases you will still get Credit for a Task that produces an error. Taking a quick look, they're both "preetham" tasks. I just had one here and it barely reached 2 seconds of CPU time before crashing. Interesting to read that they last much longer on linux than windows. That seems to be a pattern in recent times. Ugh... |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,552,383 RAC: 6,167 |
But since Rosetta 4.20 has pretty much been abandoned apart from the occasional small batch every so often, there is no such effort. Not surprising as the new type of Rosetta Tasks -Python- have plenty of significant issues of their own of which there has been no updated application to address them at all. I continue to write messages (about, for example, multi-attach disks on Virtualbox) on Twitter to "stimulate" admins. Up to now, without results. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,552,383 RAC: 6,167 |
Interesting to read that they last much longer on linux than windows. That seems to be a pattern in recent times. Ugh... They write the native code on linux and after they compiled it for other platforms. So, probably, they don't pay attention to this part of coding (that is important as much as write the code) |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I continue to write messages (about, for example, multi-attach disks on Virtualbox) on Twitter to "stimulate" admins. I could live with their other problems, but not the high writes rates to the SSD. Even with a huge (32 GB) write cache, and running only six work units on 50% of the cores of a Ryzen 3600 (Ubuntu 20.04.4), I was seeing writes to disk of over 800 GB/day. It is probably because of how they handle the .VDI files; computezrmle tells them how to do it. I get the impression that this researcher has never developed a program for BOINC before, and isn't interested in learning how to do it now. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I continue to write messages (about, for example, multi-attach disks on Virtualbox) on Twitter to "stimulate" admins. Your just stating the obvious. As long as they get something around a 95% clean result (perhaps as low as 90) then they are happy. We have discussed this to the end of the world and beyond. They don't care about the PC side as long as they get a pretty good result. They do not monitor twitter as far as I know and never here anymore. They are not open to suggestions. Their way works, why change it or update it. RALPH, they should shut that off. They never use it. That's the summary of it all. You get what you get, if it works good on linux, great, then they have a result from the linux. If it works on windows, even better, if it doesn't, oh well. The only thing that is important to them is their neural network system. The PC's are a nice addition. Very much the way BOINC TACC works. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Your just stating the obvious. I have been discussing it for far longer. And you managed to miss the point about the writes. Maybe you don't monitor them? |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Your just stating the obvious. I don't write as much data to my drive as you probably. So far on my oldest drive I have written around 65TB and it is still in good health. According to Samsung's information I am just approaching the middle age of the drive. Wasn't it you who talked about a cache program that would reduce the writes? But again, its another topic that has been discussed and ignored by the team, so why holler on about it? They obviously don't care. I've said it a lot already and others say the same. The team does NOT care about PC users. They only care about their neural network. We gets the scraps or the wild ideas in whatever form they come in. They do not change anything. You get what you get good or bad, large or small. if you burn out your drive due to the writes, that's nothing of concern to them. There will always be someone to take your place. We can make suggestions and complain all we want, but they are NOT interested. That is very clear here in the messages boards and via twitter and by the one person who can get through to them, of which they just acknowledge the email and do nothing. As long as they get the data by whatever means necessary they are happy. If machines fail, people quit, that does not matter. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,552,383 RAC: 6,167 |
They obviously don't care. I'm starting to think you're right. I love this project, I've supported it for YEARS, but I'm starting to get a little tired |
xroule Send message Joined: 9 Feb 15 Posts: 4 Credit: 58,747,245 RAC: 10,749 |
With 3371 wu and 3118 errors in 12 hours, I cant wait for WCG to reopen. For now, this is the only project for me. What a waste of resources! 9 PM, do you know what your PC is doing?? |
keputnam Send message Joined: 18 Sep 05 Posts: 24 Credit: 2,088,785 RAC: 0 |
"As long as they get something around a 95% clean result (perhaps as low as 90) then they are happy." really? In the last 2 1/2 days I have had 140 "Error while computing" results My wingmen have all had the same results They are NOT getting any results at all, and are awarding NO credit for these jobs |
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 188 Credit: 6,403,244 RAC: 5,439 |
As long as they get something around a 95% clean result (perhaps as low as 90) then they are happy. Well, they sure are not getting 95% clean results from me and my "wingmen." We are getting 100% failure rates. My wingmen and I run different hardware and different operating systems. Some Linux, some Windows. It does not matter: they all fail. I disabled getting new work units last evening, and when I noticed more units added to the list today, I got a bunch more. They all failed immediately. Over 300 failures in this batch just for me. So I disabled getting new work units again. |
spiralfeel Send message Joined: 25 Apr 20 Posts: 1 Credit: 235,796 RAC: 0 |
With 3371 wu and 3118 errors in 12 hours, I cant wait for WCG to reopen. For now, this is the only project for me. What a waste of resources! You should consider TN-Grid http://gene.disi.unitn.it/test/ and SiDock@home https://www.sidock.si/sidock/ |
Message boards :
Number crunching :
March 2022 - WU error rates
©2024 University of Washington
https://www.bakerlab.org