Message boards : Number crunching : WU crash after some hours
Author | Message |
---|---|
marsinph Send message Joined: 13 Apr 18 Posts: 10 Credit: 372,225 RAC: 0 |
Hello, after about two hours : "computation error" Reason : out of memory ! By looking stderr, the peak is about 1.5Gb Considering I run 8 WU and I have 16Gb RAM, who can explain ? See host https://boinc.bakerlab.org/rosetta/results.php?hostid=3660644 WU : https://boinc.bakerlab.org/rosetta/result.php?resultid=1086121226 https://boinc.bakerlab.org/rosetta/result.php?resultid=1086121227 https://boinc.bakerlab.org/rosetta/result.php?resultid=1086121228 https://boinc.bakerlab.org/rosetta/result.php?resultid=1086121238 ....... Best regards |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
That would normally be a reasonable match between active WUs and memory. The BOINC Manager allows you to configure values for how much memory BOINC is allowed to use while the machine is idle and while in use. What are you settings there? Rosetta Moderator: Mod.Sense |
marsinph Send message Joined: 13 Apr 18 Posts: 10 Credit: 372,225 RAC: 0 |
Hello Mod.Sense As admin you can see all. You have my host ID, WU ID !!! So you can look the stderr.txt and reason of crash. You can see that host have 16Gb RAM, But WU sas at around 9Gb ; not ram enough !!! Most stranege I have a other host ( with only 8Gb RAM host 3676676) and no problem !!!! To be clear but fully understable ; 16 Gb RAM , Crash because not enough RAM 8 Gb RAM all OK I want to understand ! https://www.boincstats.com/stats/-1/team/detail/111/projectList |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,603,651 RAC: 8,734 |
I want to understand ! There is nothing to understand. The "not enough memory" it's only a repetitive excuse. I had this problem even when i had 60% of ram in use. It's a sw problem, not an hw problem But, as i say in the other thread, nobody is interested in solving problems..... |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 1 |
It is simple to understand, some (many) of the current robetta_08_07_xxx wu use 2 GB RAM or above. if your are "lucky" enough to process simultaneously 8 of them you will be over your 16 GB and even if boinc is configured to use no more than 95% of your available memory, you will experience slowdown and most probably wu crash or host crash or both. It is quite known problem here, investigators not always fix correctly the wu memory requirements, they are also human after all. It has just happened to me both in a host with 64GB but with 72 threads and another one with 124GB and 112 threads. So, I've set to use the 50% of CPUs while the storm goes way :). |
marsinph Send message Joined: 13 Apr 18 Posts: 10 Credit: 372,225 RAC: 0 |
It is simple to understand, some (many) of the current robetta_08_07_xxx wu use 2 GB RAM or above. if your are "lucky" enough to process simultaneously 8 of them you will be over your 16 GB and even if boinc is configured to use no more than 95% of your available memory, you will experience slowdown and most probably wu crash or host crash or both. It is quite known problem here, investigators not always fix correctly the wu memory requirements, they are also human after all. Hello Trotador. Before posting, I tested with only one WU and no any other PRJ. I repeat only one WU, with 16Gb RAM. But crashed with error not enough memory ! Stderr.txt write mem peak use 2.5Gb (2500Mb) , I have 16Gb, but exit code : not enough mem ! |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
It sounds like the bottom line here is that there are presently some work units that are using excessive memory. These are reported in several other threads. So this would mean your low-memory system did not happen to get any of these, or perhaps they are flagged to only be sent to systems with large memory. This sometimes happens when the Project Team is working on new protocols. And it is generally resolved within a few days. Rosetta Moderator: Mod.Sense |
marsinph Send message Joined: 13 Apr 18 Posts: 10 Credit: 372,225 RAC: 0 |
It sounds like the bottom line here is that there are presently some work units that are using excessive memory. These are reported in several other threads. So this would mean your low-memory system did not happen to get any of these, or perhaps they are flagged to only be sent to systems with large memory. Hello Mod_sense. Instead to stop reading after the first line, If you had read the next line, It is SIXTEEN gigabyte available. Not two gigabyte. If you had took the time to analyze, my hosts ... So, Please understand very good and read all very good !!! On two of my host with 16Gb, all crashs with error "not enough ram". As stpudid and because I want to understand, on those host I removed 8Gb RAM. Who will do a downgrade of RAM ??? But now, with only 8Gb no problem !!! I added again 8Gb (again crash, and not the same RAM) Where is the problem ??? I am HW scientist, not SW developper. But once again, I want to understand. A bottlezneck on CPU northbridge was possible due to CAS and LAT time. But it is not Your WU says on 16Gb RAM : not enough. With only two running WU ! But with only 8Gb , all OK with 8 (eight) simultanous WU !?!? Waiting explanation, and solution, Best regards |
marsinph Send message Joined: 13 Apr 18 Posts: 10 Credit: 372,225 RAC: 0 |
It sounds like the bottom line here is that there are presently some work units that are using excessive memory. These are reported in several other threads. So this would mean your low-memory system did not happen to get any of these, or perhaps they are flagged to only be sent to systems with large memory. Mod_sense Oups, Additional info to my previous message. As team founder (wide team, in Berkeley sense ) , I centralize all problems, questions of my team and on all projects. It is perhaps why some users consider me as the "black sheep of boinc" ... It not hurt me. I (and our team) want to help evolution of research. So One of my user with 64Gb RAM (sixty-four) and 8 cores, reports the same problem of "not enough memory) !!! I think with 64Gb RAM, it is enough. But it also crashes ! Sorry if I repeat, with simultanous WU with only 8Gb RAM is OK on my host More RAM, ... If it crashs at beginning, not very worst. But always after some hours !!! Or perhaps a serie of WU with problems. Please analyze host 3391074. You have access. I got 35 new WU on 11aug19:08 UTC. I stop all other PRJ. To give full ressources. We shall see Only PrimeGrid run GPU (RAM use 300Mb) CPU 0.09 So after five minutes start of 8 Rosetta : RAM ; 70% free (from 16Gb) Normally no problems. Estimated running 7 hours. But I never look it. It is Windows estimation. I stop to write. I wait action, explanation . Best reagards |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
As I said, the problem seems to be some work units that are consuming more memory than was expected when they were created. Or, at least there are cases where they may consume a lot of memory. As I understand it, you are saying that a machine where BOINC Manager is set to be able to use 16GB of memory cannot complete a specific WU, even when it runs alone. That would be a problem that will need to be fixed in the method the software uses to compute the work unit. Thank you for fielding the questions of your team. I know English is not easy for many people. When the question is coming from someone else, it is helpful if you can link to the actual host or work unit that is having the problem. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
WU crash after some hours
©2024 University of Washington
https://www.bakerlab.org