WU crash after some hours

Message boards : Number crunching : WU crash after some hours

To post messages, you must log in.

AuthorMessage
marsinph

Send message
Joined: 13 Apr 18
Posts: 10
Credit: 372,225
RAC: 0
Message 90974 - Posted: 5 Aug 2019, 12:40:45 UTC

Hello,
after about two hours : "computation error"
Reason : out of memory !
By looking stderr, the peak is about 1.5Gb
Considering I run 8 WU and I have 16Gb RAM, who can explain ?
See host https://boinc.bakerlab.org/rosetta/results.php?hostid=3660644
WU :
https://boinc.bakerlab.org/rosetta/result.php?resultid=1086121226
https://boinc.bakerlab.org/rosetta/result.php?resultid=1086121227
https://boinc.bakerlab.org/rosetta/result.php?resultid=1086121228
https://boinc.bakerlab.org/rosetta/result.php?resultid=1086121238
.......




Best regards
ID: 90974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 91006 - Posted: 8 Aug 2019, 18:15:26 UTC - in response to Message 90974.  

That would normally be a reasonable match between active WUs and memory. The BOINC Manager allows you to configure values for how much memory BOINC is allowed to use while the machine is idle and while in use. What are you settings there?
Rosetta Moderator: Mod.Sense
ID: 91006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
marsinph

Send message
Joined: 13 Apr 18
Posts: 10
Credit: 372,225
RAC: 0
Message 91007 - Posted: 8 Aug 2019, 21:04:16 UTC - in response to Message 91006.  

Hello Mod.Sense
As admin you can see all. You have my host ID, WU ID !!!
So you can look the stderr.txt and reason of crash.
You can see that host have 16Gb RAM,
But WU sas at around 9Gb ; not ram enough !!!

Most stranege I have a other host ( with only 8Gb RAM host 3676676) and no problem !!!!

To be clear but fully understable ;
16 Gb RAM , Crash because not enough RAM
8 Gb RAM all OK

I want to understand !

https://www.boincstats.com/stats/-1/team/detail/111/projectList
ID: 91007 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 8,387
Message 91008 - Posted: 9 Aug 2019, 5:42:36 UTC - in response to Message 91007.  

I want to understand !

There is nothing to understand.
The "not enough memory" it's only a repetitive excuse.
I had this problem even when i had 60% of ram in use.
It's a sw problem, not an hw problem
But, as i say in the other thread, nobody is interested in solving problems.....
ID: 91008 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 1
Message 91009 - Posted: 9 Aug 2019, 10:17:40 UTC

It is simple to understand, some (many) of the current robetta_08_07_xxx wu use 2 GB RAM or above. if your are "lucky" enough to process simultaneously 8 of them you will be over your 16 GB and even if boinc is configured to use no more than 95% of your available memory, you will experience slowdown and most probably wu crash or host crash or both. It is quite known problem here, investigators not always fix correctly the wu memory requirements, they are also human after all.

It has just happened to me both in a host with 64GB but with 72 threads and another one with 124GB and 112 threads. So, I've set to use the 50% of CPUs while the storm goes way :).
ID: 91009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
marsinph

Send message
Joined: 13 Apr 18
Posts: 10
Credit: 372,225
RAC: 0
Message 91011 - Posted: 9 Aug 2019, 15:47:19 UTC - in response to Message 91009.  

It is simple to understand, some (many) of the current robetta_08_07_xxx wu use 2 GB RAM or above. if your are "lucky" enough to process simultaneously 8 of them you will be over your 16 GB and even if boinc is configured to use no more than 95% of your available memory, you will experience slowdown and most probably wu crash or host crash or both. It is quite known problem here, investigators not always fix correctly the wu memory requirements, they are also human after all.

It has just happened to me both in a host with 64GB but with 72 threads and another one with 124GB and 112 threads. So, I've set to use the 50% of CPUs while the storm goes way :).





Hello Trotador.

Before posting, I tested with only one WU and no any other PRJ.
I repeat only one WU, with 16Gb RAM. But crashed with error not enough memory !
Stderr.txt write mem peak use 2.5Gb (2500Mb) , I have 16Gb, but exit code : not enough mem !
ID: 91011 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 91014 - Posted: 9 Aug 2019, 20:03:29 UTC

It sounds like the bottom line here is that there are presently some work units that are using excessive memory. These are reported in several other threads. So this would mean your low-memory system did not happen to get any of these, or perhaps they are flagged to only be sent to systems with large memory.

This sometimes happens when the Project Team is working on new protocols. And it is generally resolved within a few days.
Rosetta Moderator: Mod.Sense
ID: 91014 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
marsinph

Send message
Joined: 13 Apr 18
Posts: 10
Credit: 372,225
RAC: 0
Message 91019 - Posted: 11 Aug 2019, 18:54:12 UTC - in response to Message 91014.  

It sounds like the bottom line here is that there are presently some work units that are using excessive memory. These are reported in several other threads. So this would mean your low-memory system did not happen to get any of these, or perhaps they are flagged to only be sent to systems with large memory.

This sometimes happens when the Project Team is working on new protocols. And it is generally resolved within a few days.



Hello Mod_sense.
Instead to stop reading after the first line, If you had read the next line, It is SIXTEEN gigabyte available.
Not two gigabyte.
If you had took the time to analyze, my hosts ...

So, Please understand very good and read all very good !!!
On two of my host with 16Gb, all crashs with error "not enough ram".

As stpudid and because I want to understand, on those host I removed 8Gb RAM.
Who will do a downgrade of RAM ???

But now, with only 8Gb no problem !!!
I added again 8Gb (again crash, and not the same RAM)

Where is the problem ??? I am HW scientist, not SW developper.
But once again, I want to understand.
A bottlezneck on CPU northbridge was possible due to CAS and LAT time.
But it is not

Your WU says on 16Gb RAM : not enough. With only two running WU !
But with only 8Gb , all OK with 8 (eight) simultanous WU !?!?

Waiting explanation, and solution,
Best regards
ID: 91019 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
marsinph

Send message
Joined: 13 Apr 18
Posts: 10
Credit: 372,225
RAC: 0
Message 91020 - Posted: 11 Aug 2019, 19:27:46 UTC - in response to Message 91014.  

It sounds like the bottom line here is that there are presently some work units that are using excessive memory. These are reported in several other threads. So this would mean your low-memory system did not happen to get any of these, or perhaps they are flagged to only be sent to systems with large memory.

This sometimes happens when the Project Team is working on new protocols. And it is generally resolved within a few days.



Mod_sense

Oups,
Additional info to my previous message.
As team founder (wide team, in Berkeley sense ) , I centralize all problems, questions of my team and on all projects.
It is perhaps why some users consider me as the "black sheep of boinc" ...
It not hurt me. I (and our team) want to help evolution of research.

So

One of my user with 64Gb RAM (sixty-four) and 8 cores, reports the same problem of "not enough memory) !!!

I think with 64Gb RAM, it is enough. But it also crashes !
Sorry if I repeat, with simultanous WU with only 8Gb RAM is OK on my host
More RAM, ...

If it crashs at beginning, not very worst. But always after some hours !!!

Or perhaps a serie of WU with problems.
Please analyze host 3391074. You have access.
I got 35 new WU on 11aug19:08 UTC. I stop all other PRJ. To give full ressources. We shall see
Only PrimeGrid run GPU (RAM use 300Mb) CPU 0.09
So after five minutes start of 8 Rosetta : RAM ; 70% free (from 16Gb)
Normally no problems.
Estimated running 7 hours. But I never look it. It is Windows estimation.

I stop to write. I wait action, explanation .

Best reagards
ID: 91020 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 91022 - Posted: 12 Aug 2019, 13:17:31 UTC

As I said, the problem seems to be some work units that are consuming more memory than was expected when they were created. Or, at least there are cases where they may consume a lot of memory. As I understand it, you are saying that a machine where BOINC Manager is set to be able to use 16GB of memory cannot complete a specific WU, even when it runs alone. That would be a problem that will need to be fixed in the method the software uses to compute the work unit.

Thank you for fielding the questions of your team. I know English is not easy for many people. When the question is coming from someone else, it is helpful if you can link to the actual host or work unit that is having the problem.
Rosetta Moderator: Mod.Sense
ID: 91022 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : WU crash after some hours



©2024 University of Washington
https://www.bakerlab.org