Linux Hung Machine

Message boards : Number crunching : Linux Hung Machine

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Buckeye4lf
Avatar

Send message
Joined: 29 Aug 08
Posts: 43
Credit: 8,545,119
RAC: 2,077
Message 92324 - Posted: 26 Mar 2020, 10:24:58 UTC

I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help.

Thanks,
Greg

ID: 92324 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 393
Credit: 12,113,928
RAC: 4,486
Message 92327 - Posted: 26 Mar 2020, 12:46:20 UTC - in response to Message 92324.  

I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help.

Thanks,
Greg


Not something that I’ve experienced but the evidence should still exist in /var/logs/...
ID: 92327 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Buckeye4lf
Avatar

Send message
Joined: 29 Aug 08
Posts: 43
Credit: 8,545,119
RAC: 2,077
Message 92329 - Posted: 26 Mar 2020, 13:05:32 UTC - in response to Message 92327.  

I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help.

Thanks,
Greg


Not something that I’ve experienced but the evidence should still exist in /var/logs/...


I can check when I get home, usually when I am forced to power cycle the jobs all error out which I suspect may mask the true issue.

ID: 92329 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,081,660
RAC: 345
Message 92333 - Posted: 26 Mar 2020, 13:46:59 UTC - in response to Message 92324.  

I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help.

Thanks,
Greg


Possible memory/swap issues? Maybe the machine is starting to use a good amount of swap space? How much memory does the machine have?
Charlie
-Charlie
ID: 92333 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,081,660
RAC: 345
Message 92334 - Posted: 26 Mar 2020, 13:49:44 UTC - in response to Message 92333.  
Last modified: 26 Mar 2020, 13:55:28 UTC

Excuse the bogus signature. Back crunching after being away for a few years. Fixed it in my profile. Now to go fix it in my forum signature.

<edit>Fixed</edit>
ID: 92334 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Buckeye4lf
Avatar

Send message
Joined: 29 Aug 08
Posts: 43
Credit: 8,545,119
RAC: 2,077
Message 92336 - Posted: 26 Mar 2020, 14:00:04 UTC - in response to Message 92333.  

I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help.

Thanks,
Greg


Possible memory/swap issues? Maybe the machine is starting to use a good amount of swap space? How much memory does the machine have?
Charlie


The machine has 128 GB of RAM and nothing other than BOINC is running when it hangs. I did reduce the default file swap from 75% to 50% this morning though.....will not know if the machine has hung until I get home from work. No other project is having issues with current BOINC settings though...

ID: 92336 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92337 - Posted: 26 Mar 2020, 14:01:09 UTC

I see your linux machine shows that it has 64 processors, and 128GB of memory, and is running:
Linux LinuxMint
Linux Mint 19.3 Tricia [5.3.0-42-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]

Is the machine running a mix of BOINC projects? Is it running other types of work as well?

With that many tasks running, it would be possible that one got to a point that it was using excessive memory. But I believe the BOINC core monitors that and insulates the rest of the system by making the task wait for memory or ending it.

Just looking at a few of the failed tasks, their peak memory was about 1.2 GB.

Hang conditions are always difficult. Have you seen this happen a few times?

Is BOINC allowed to use most of that memory (CPU preferences)? What about the disk? Is BOINC allowed to use plenty of disk space? (say 2GB per task)

I can only suggest using the settings to run on less than 100 percent of your CPUs and see if this helps.
Rosetta Moderator: Mod.Sense
ID: 92337 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Buckeye4lf
Avatar

Send message
Joined: 29 Aug 08
Posts: 43
Credit: 8,545,119
RAC: 2,077
Message 92338 - Posted: 26 Mar 2020, 14:14:20 UTC - in response to Message 92337.  
Last modified: 26 Mar 2020, 14:16:28 UTC

I see your linux machine shows that it has 64 processors, and 128GB of memory, and is running:
Linux LinuxMint
Linux Mint 19.3 Tricia [5.3.0-42-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]

Is the machine running a mix of BOINC projects? Is it running other types of work as well?

With that many tasks running, it would be possible that one got to a point that it was using excessive memory. But I believe the BOINC core monitors that and insulates the rest of the system by making the task wait for memory or ending it.

Just looking at a few of the failed tasks, their peak memory was about 1.2 GB.

Hang conditions are always difficult. Have you seen this happen a few times?

Is BOINC allowed to use most of that memory (CPU preferences)? What about the disk? Is BOINC allowed to use plenty of disk space? (say 2GB per task)

I can only suggest using the settings to run on less than 100 percent of your CPUs and see if this helps.



It is running a mix of projects but currently the machine is only loaded with Seti (GPU) and Rosetta (CPU). This hanging occurs once a day at least and sometimes again within minutes of rebooting. Machine has 128GB RAM. I currently have Boinc set to use 50% swap and up to 50GB of HD space, the HD itself is 2TB so space should be no issue. I currently have Computation set at 85% for 90% of the time. In the past when I ran Seti only, I set 90% and 100% time with no issues.

You mentioned 2GB per job, should I increase the HD more than the 50GB already established?

ID: 92338 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,658,896
RAC: 10,801
Message 92344 - Posted: 26 Mar 2020, 15:13:54 UTC - in response to Message 92338.  

Are you sure it's not overheating? Rosetta might push the FPU or RAM harder than other projects. Can the machine handle a stress test like P95?

D
ID: 92344 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Buckeye4lf
Avatar

Send message
Joined: 29 Aug 08
Posts: 43
Credit: 8,545,119
RAC: 2,077
Message 92346 - Posted: 26 Mar 2020, 15:19:27 UTC - in response to Message 92344.  

Are you sure it's not overheating? Rosetta might push the FPU or RAM harder than other projects. Can the machine handle a stress test like P95?

D


Unsure, it is liquid cooled but not sure how to test this theory.... I can back off the percentages....

ID: 92346 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Buckeye4lf
Avatar

Send message
Joined: 29 Aug 08
Posts: 43
Credit: 8,545,119
RAC: 2,077
Message 92385 - Posted: 27 Mar 2020, 10:55:42 UTC
Last modified: 27 Mar 2020, 10:56:47 UTC

I have not had the issue since yesterday, I was not a good engineer and changed numerous things all at once:

I doubled the HD space available to 100GB
I decreased CPU utilization to 75% from 100%
I decreased CPU count to 75% from 90%
I reduced the file swap from 75% to 50%
I suspended all Seti jobs from running on GPUs (do not think this matters though)

If I still do not see any issues by tomorrow I will start to increase CPU %s as I have been running between 90-100% on other projects. I am leaving quite a bit of computation power on the table only running at 75%.

Thanks everyone for your suggestions!

ID: 92385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,249,734
RAC: 9,368
Message 92428 - Posted: 27 Mar 2020, 23:52:31 UTC - in response to Message 92385.  

I have not had the issue since yesterday, I was not a good engineer and changed numerous things all at once:

I doubled the HD space available to 100GB
I decreased CPU utilization to 75% from 100%
I decreased CPU count to 75% from 90%
I reduced the file swap from 75% to 50%
I suspended all Seti jobs from running on GPUs (do not think this matters though)

If I still do not see any issues by tomorrow I will start to increase CPU %s as I have been running between 90-100% on other projects. I am leaving quite a bit of computation power on the table only running at 75%.

Thanks everyone for your suggestions!

As a generalisation (because I'm no expert on this) if things go well (or even if they don't) increase CPU utilisation back to 100%. Having it lower, eg 75%, turns out to mean it runs at 100% for 75% of the time and 0% for 25% of the time, which isn't what you might expect. All that switching on and off can't help, so 100% utilisation might even remove a problem.

If that works, look to increasing CPU count next. The other 3 look reasonable and better choices already
ID: 92428 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Buckeye4lf
Avatar

Send message
Joined: 29 Aug 08
Posts: 43
Credit: 8,545,119
RAC: 2,077
Message 92432 - Posted: 28 Mar 2020, 0:33:29 UTC

well crap, even with those changes I just hung my computer.......

ID: 92432 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Buckeye4lf
Avatar

Send message
Joined: 29 Aug 08
Posts: 43
Credit: 8,545,119
RAC: 2,077
Message 92436 - Posted: 28 Mar 2020, 6:08:44 UTC

This issue is clearly a rosetta one. I can run current settings on other projects with no issues....If I remove Rosetta jobs, computer does not hang at all.

ID: 92436 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,658,896
RAC: 10,801
Message 92446 - Posted: 28 Mar 2020, 12:41:51 UTC

Does sound strange. I would try moving the BOINC data directory to another drive - can you do that?
ID: 92446 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92456 - Posted: 28 Mar 2020, 14:13:17 UTC

With 64 processor cores, how many threads is BOINC trying to run? Is it a hyperthreaded CPU? This would cause BOINC to attempt 128 active tasks, which would then make the 128GB of memory rather tight. (actually a quick search, it looks like there are 32 physical cores, hyperthreaded to 64 active threads).

I would suggest bumping CPU utilization back to 100% as dcdc suggests (we've seen odd issues with <100% in the past). And dial back the CPU count %. Maybe start at 50% and work your way up.

Have you run any stress tests on the machine? CPU or memory tests? Sometimes R@h ends up being the first stress test a machine has seen.

Also, have you checked for any updates to your Linux version?

I'm not seeing others reporting hangs like this. So, what else could be unique about your machine? (besides that it is such a BEAST of a machine! :)
Rosetta Moderator: Mod.Sense
ID: 92456 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Buckeye4lf
Avatar

Send message
Joined: 29 Aug 08
Posts: 43
Credit: 8,545,119
RAC: 2,077
Message 92480 - Posted: 28 Mar 2020, 18:07:57 UTC - in response to Message 92456.  

With 64 processor cores, how many threads is BOINC trying to run? Is it a hyperthreaded CPU? This would cause BOINC to attempt 128 active tasks, which would then make the 128GB of memory rather tight. (actually a quick search, it looks like there are 32 physical cores, hyperthreaded to 64 active threads).

I would suggest bumping CPU utilization back to 100% as dcdc suggests (we've seen odd issues with <100% in the past). And dial back the CPU count %. Maybe start at 50% and work your way up.

Have you run any stress tests on the machine? CPU or memory tests? Sometimes R@h ends up being the first stress test a machine has seen.

Also, have you checked for any updates to your Linux version?

I'm not seeing others reporting hangs like this. So, what else could be unique about your machine? (besides that it is such a BEAST of a machine! :)


You are correct, it has 32 cores that are dual threaded so I can run 64 CPU jobs at the same time. I have bumped cpu utilization to 100% and reduced cpu count to 50%. I am running most recent linux mint, thought about doing a reinstall but have not gone that far yet. I dont seem to have issues with other boinc projects hanging....not sure why rosetta would be any different. I have not done any memory/stress tests.

ID: 92480 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Buckeye4lf
Avatar

Send message
Joined: 29 Aug 08
Posts: 43
Credit: 8,545,119
RAC: 2,077
Message 92502 - Posted: 29 Mar 2020, 4:59:39 UTC

Machine has not hung in last 24 hours. I backed off the number of CPU jobs to 70% instead of the 90% I had been running on other projects. Maybe I was just on the edge of unstable before and Rosetta was the project where I was getting issues. It seems to be more stable now, just less throughput. Has Rosetta ever considered GPU jobs?

ID: 92502 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1684
Credit: 17,933,837
RAC: 22,604
Message 92506 - Posted: 29 Mar 2020, 5:54:46 UTC - in response to Message 92480.  

I have bumped cpu utilization to 100% and reduced cpu count to 50%. I am running most recent linux mint, thought about doing a reinstall but have not gone that far yet. I dont seem to have issues with other boinc projects hanging....not sure why rosetta would be any different. I have not done any memory/stress tests.
The main difference between projects would appear to be memory usage. Even running all the others, you're not likely to be using much RAM. Running Rosetta, with all those threads & cores, along with other projects will result in RAM being used that probably doesn't normally get touched.
Hence the system lockups.

I'd suggest a thorough memtest session (you may or may not have a copy of memtest86+ with your distro).
The other option is swapping RAM modules. Check exactly how much RAM is being used, limit the number of Rosetta jobs so RAM in use (with other projects running) is just under the limit for only 2 modules on the motherboard- make sure you have them in the appropriate slots to maintain at least dual channel operation.
Pull all other modules. Let the system run & see if there are any issues (how long does it usually take for a problem to occur?) If no problems, pull those modules, add others that have been removed. Run again. Do it till you get a failure. If no failure, add more modules, bump up the number of Rosetta jobs to near the memory limit. See how it goes, Repeat.
Or run that intensive memtest session (although it could take most of a day for that amount of RAM).


I notice you have several WIn10 systems, if all are DDR4 and the same size, pull modules form the Win10 systems to put in the Threadripper, the Thread ripper modules in the Win10 systems. Even if they don't error out, Win10 comes with it's own memory tester so you could use those system to do the memory tests.
Just keep track of which modules are where...
Grant
Darwin NT
ID: 92506 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92541 - Posted: 29 Mar 2020, 16:09:12 UTC - in response to Message 92502.  

Has Rosetta ever considered GPU jobs?


Yes, it ends up becoming a rather contentious discussion. One of the developers did offer some insight last week. There are several older threads elsewhere about the topic as well.

The bottom line is that GPUs are fantastic at doing lots of things, but many GPU enthusiasts do not understand they are not general purpose processors, and the coding effort required to get from one platform to the other. And many people have tried to follow up with me to further persuade me to the merits of GPU. Rest assured it does no good. My personal limited understanding of GPU, and the coding efforts required to migrate do not effect the project at all. I am not on the Development Team, I am just an at-home moderator. The perspective, directly from a developer, is expressed well in the thread linked above.
Rosetta Moderator: Mod.Sense
ID: 92541 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Linux Hung Machine



©2024 University of Washington
https://www.bakerlab.org