Message boards : Number crunching : Errors galore!! Multiple machines
Author | Message |
---|---|
Dougga Send message Joined: 27 Nov 06 Posts: 28 Credit: 5,248,050 RAC: 0 |
I'm getting errors from all sorts of machines. I get too many restarts errors on a machine with the "keep in memory" option selected. This is an AMD64 machine. Here's an error I'm seing on another machine Intel Core 2 Quad: stderr out <core_client_version>5.10.45</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> Graphics are disabled due to configuration... # cpu_run_time_pref: 10800 # random seed: 3358807 ERROR:: Exit from: fullatom_energy.cc line: 1958 </stderr_txt> ]]> It seems things are suddenly unstable. People are suggesting my machines are showing bad memory but I don't really buy this. ARe others seeing issues with Rosetta? |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
I'm getting errors from all sorts of machines. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
could you post a few links to the tasks that are making errors? they need to know what application of rosie you are using and what specific task or set of tasks that are failing. I would guess just based on the one line that says ERROR:: Exit from: fullatom_energy.cc line: 1958, that there is a problem in the program itself, not with your machine. I had a whole rash of tasks that failed on disk space errors, but the next batch was just fine. I'm getting errors from all sorts of machines. |
Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0 |
It seems things are suddenly unstable. People are suggesting my machines are showing bad memory but I don't really buy this. Well, you didn't say if you've done any of the suggestions made in your last thread... It doesn't need to be bad memory, it can be bad cpu, or something else... The Amd is possibly a problem with OS or drivers to OS, or possibly access-rights. The quad... a very quick look shows a couple wu's crashing within 1 minute, this is likely bad wu's. But, there's also around 25 other crashes... A very quick look through top-computer-list, and looking on 3 Linux-systems from top-60, showed some 1-minute-crashing, but of the longer-running there was only 4 crashes across 3 computers... I've no idea on the "Validation"-errors, and I've not counted them, possibly this is a Rosetta-server-based problem... So, maybe you're just unlucky, but with 20x the error-rate of other Linux-computers, would still guess it's a computer-problem... BTW, it doesn't need to be anything hardware-related, it can be the Linux-distibution you're using, or the libraries installed, is reason for the errors, while the other linux-computers usesother distribution/librarier and doesn't get the errors... "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
take a look at this error msg i found in one of his tasks: https://boinc.bakerlab.org/rosetta/result.php?resultid=168818917 8741.23 stderr out <core_client_version>5.10.45</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> Graphics are disabled due to configuration... # cpu_run_time_pref: 10800 # random seed: 3343687 ABORT: bad to aa_rotno_to_packedrotno aa,rot1/2/3/4: ILE 8 0 2 0 0 chi no 1 nchi 2 aav 1 is_chi_proton_rotamer(aa,aav,i) 0 ERROR:: Exit from: rotamer_functions.cc line: 1465 He has one or two others like this as well. Later he gets a validation error after succesfully completing the task. Of course being that some of these are CASP8 that could be a cause. They are running on roesetta 5.96 The quad machine had 5 errors in 24 hours of which 4 were program errors and 1 was a validate error. One of the dual cores has validate errors which is a RAH issue not his computer. Another random sample of work shows a mini that crashed on 2 systems immediatly. I would call it a string of bad luck, not a hardware issue. |
Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0 |
take a look at this error msg i found in one of his tasks: The "Validate errors" is a Rosetta-problem, and the wu's crashing after a couple seconds on 2 different computers is obviously buggy. The problem in my opinion is, (appart for crappy keyboard - not my computer), is all the wu's his comtuter is erroring-ot while someone else manages to finish correctly... Example, 154098541 that gives a "ERROR:: Exit from: fullatom_energy.cc line: 2030" 153933379 same error 1538933379 with "ERROR:: Exit from: refold.cc line: 338" 153835521 with "ERROR: NANs occured in hbonding! ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763" 153204401 with a long string of "sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range" And the list goes on, with both Mini and Beta-application. Now, only a couple is paired with another Linux, so it is possible 2 buggy Linux-aplications. For the few he is paired with Linux, either shorter run-time or slower speed can be possible his crash is longer out in wu, so not a good indication either way. Still, his 4-core having much higher computer error-rate than 3 of the 8-core Linux-comuters in top-60 looks suspicious to me, so taking a little closer look on his computer shouldn't be a big problem. Afterall, atleast a couple of the checks like "Overclocked or not" or "Oops, the cpu is running at 100 Celsius" is easil checked (and answered)... Now, running Gromacs, Prime95 and memory-tests on the other hand is much more time-consuming... BTW, one method to test if it's a bad Rosetta-application or not is, download a ton of work, disable network, exit boinc, backup boinc, and re-start boinc. If one or more of wu's gives an error, re-run the same wu from the backup. If the backup-copy crashes on the same spot (example 1st. crashed after 2h and backup after 2h1m), it's most likely a bad wu or application. If on the other hand the backup-copy finish withot crashing, or one copy crashed after 1 hour while the other after 2.5 hours, it looks more like a hardware-problem than a wu/application-problem... If there aren't any errors, this method will only lose the 1-minute or something taken to make a backup-copy. And, even if there are errors, re-running a couple wu's (optimal is to check 4 errors at once), will only take a couple hours, and not 24h+ that using another program will do. BTW, in case stop/re-start from checkpoint has any influence, let wu's run from start to finish... "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
Message boards :
Number crunching :
Errors galore!! Multiple machines
©2025 University of Washington
https://www.bakerlab.org