Posts by Buckeye4lf

1) Message boards : Number crunching : Computation errors (Message 93170)
Posted 3 Apr 2020 by Buckeye4lf
Post:
It shows you are on BOINC 7.17.0, (the current recommended version is 7.4.22)

It also indicates that the task did not complete the first model within the 1hr preferred runtime plus 4hr watchdog timeout. The task was created in a way that causes it not to go out to another host for validation. So the one error was "too many", and the WU (which, sometimes, could be more than just the task that went to you) was ended. And then I guess as the watchdog went to end the task, it found no output file.

In a nutshell, you hit a long running model against the smallest possible runtime, and it was ended for you.



Okay thanks. I just wanted to make sure it was not a hardware issue to prevent issue in the future. This round was lots of wasted computation time.
2) Message boards : Number crunching : Computation errors (Message 93136)
Posted 3 Apr 2020 by Buckeye4lf
Post:
I just had a whole batch of jobs error out all of the had the error " Too many result"

name rb_04_01_20095_19938_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_03_06_904919_9
application Rosetta
created 2 Apr 2020, 0:50:45 UTC
minimum quorum 1
initial replication 1
max # of error/total/success tasks 1, 1, 1
errors Too many total results


Stderr output
<core_client_version>7.17.0</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.12_x86_64-pc-linux-gnu @rb_04_01_20095_19938_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -psipred_ss2 t000_.spider3_ss2 -kill_hairpins t000_.nobuformat.spider3_ss2 -jumps:pairing_file t000_.fasta.bbcontacts.jumps -abinitio::use_filters false -skip_convergence_check -jumps:overlap_chainbreak -seq_sep_stages 1 1 1 -ramp_chainbreaks -sep_switch_accelerate 0.8 -jumps:random_sheets 7 2 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_04_01_20095_19938_ab_t000__robetta.zip -frag3 rb_04_01_20095_19938_ab_t000__robetta.200.3mers.index.gz -fragA rb_04_01_20095_19938_ab_t000__robetta.200.6mers.index.gz -fragB rb_04_01_20095_19938_ab_t000__robetta.200.3mers.index.gz -nstruct 10000 -cpu_run_time 57600 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3752306
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 18587.7s, 14400s + 3600s[2020- 4- 2 16:44:30:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 18587.7 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
16:44:30 (10035): called boinc_finish(0)

</stderr_txt>
]]>

What does this indicate. I do not want to spend all my time running jobs and them just erroring out at the end.
3) Message boards : Number crunching : Linux Hung Machine (Message 92611)
Posted 30 Mar 2020 by Buckeye4lf
Post:
I have had no further issues with the machine freezing after reducing the number of simultaneous jobs. I never thought 128GB of RAM would not be enough but apparently it is if I want to run 64 jobs at once. I will have to get some more RAM the next time I am out. Thanks for everyone's help.
4) Message boards : Number crunching : Linux Hung Machine (Message 92502)
Posted 29 Mar 2020 by Buckeye4lf
Post:
Machine has not hung in last 24 hours. I backed off the number of CPU jobs to 70% instead of the 90% I had been running on other projects. Maybe I was just on the edge of unstable before and Rosetta was the project where I was getting issues. It seems to be more stable now, just less throughput. Has Rosetta ever considered GPU jobs?
5) Message boards : Number crunching : Linux Hung Machine (Message 92480)
Posted 28 Mar 2020 by Buckeye4lf
Post:
With 64 processor cores, how many threads is BOINC trying to run? Is it a hyperthreaded CPU? This would cause BOINC to attempt 128 active tasks, which would then make the 128GB of memory rather tight. (actually a quick search, it looks like there are 32 physical cores, hyperthreaded to 64 active threads).

I would suggest bumping CPU utilization back to 100% as dcdc suggests (we've seen odd issues with <100% in the past). And dial back the CPU count %. Maybe start at 50% and work your way up.

Have you run any stress tests on the machine? CPU or memory tests? Sometimes R@h ends up being the first stress test a machine has seen.

Also, have you checked for any updates to your Linux version?

I'm not seeing others reporting hangs like this. So, what else could be unique about your machine? (besides that it is such a BEAST of a machine! :)


You are correct, it has 32 cores that are dual threaded so I can run 64 CPU jobs at the same time. I have bumped cpu utilization to 100% and reduced cpu count to 50%. I am running most recent linux mint, thought about doing a reinstall but have not gone that far yet. I dont seem to have issues with other boinc projects hanging....not sure why rosetta would be any different. I have not done any memory/stress tests.
6) Message boards : Number crunching : Linux Hung Machine (Message 92436)
Posted 28 Mar 2020 by Buckeye4lf
Post:
This issue is clearly a rosetta one. I can run current settings on other projects with no issues....If I remove Rosetta jobs, computer does not hang at all.
7) Message boards : Number crunching : Linux Hung Machine (Message 92432)
Posted 28 Mar 2020 by Buckeye4lf
Post:
well crap, even with those changes I just hung my computer.......
8) Message boards : Number crunching : Linux Hung Machine (Message 92385)
Posted 27 Mar 2020 by Buckeye4lf
Post:
I have not had the issue since yesterday, I was not a good engineer and changed numerous things all at once:

I doubled the HD space available to 100GB
I decreased CPU utilization to 75% from 100%
I decreased CPU count to 75% from 90%
I reduced the file swap from 75% to 50%
I suspended all Seti jobs from running on GPUs (do not think this matters though)

If I still do not see any issues by tomorrow I will start to increase CPU %s as I have been running between 90-100% on other projects. I am leaving quite a bit of computation power on the table only running at 75%.

Thanks everyone for your suggestions!
9) Message boards : Number crunching : Linux Hung Machine (Message 92346)
Posted 26 Mar 2020 by Buckeye4lf
Post:
Are you sure it's not overheating? Rosetta might push the FPU or RAM harder than other projects. Can the machine handle a stress test like P95?

D


Unsure, it is liquid cooled but not sure how to test this theory.... I can back off the percentages....
10) Message boards : Number crunching : Linux Hung Machine (Message 92338)
Posted 26 Mar 2020 by Buckeye4lf
Post:
I see your linux machine shows that it has 64 processors, and 128GB of memory, and is running:
Linux LinuxMint
Linux Mint 19.3 Tricia [5.3.0-42-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]

Is the machine running a mix of BOINC projects? Is it running other types of work as well?

With that many tasks running, it would be possible that one got to a point that it was using excessive memory. But I believe the BOINC core monitors that and insulates the rest of the system by making the task wait for memory or ending it.

Just looking at a few of the failed tasks, their peak memory was about 1.2 GB.

Hang conditions are always difficult. Have you seen this happen a few times?

Is BOINC allowed to use most of that memory (CPU preferences)? What about the disk? Is BOINC allowed to use plenty of disk space? (say 2GB per task)

I can only suggest using the settings to run on less than 100 percent of your CPUs and see if this helps.



It is running a mix of projects but currently the machine is only loaded with Seti (GPU) and Rosetta (CPU). This hanging occurs once a day at least and sometimes again within minutes of rebooting. Machine has 128GB RAM. I currently have Boinc set to use 50% swap and up to 50GB of HD space, the HD itself is 2TB so space should be no issue. I currently have Computation set at 85% for 90% of the time. In the past when I ran Seti only, I set 90% and 100% time with no issues.

You mentioned 2GB per job, should I increase the HD more than the 50GB already established?
11) Message boards : Number crunching : Linux Hung Machine (Message 92336)
Posted 26 Mar 2020 by Buckeye4lf
Post:
I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help.

Thanks,
Greg


Possible memory/swap issues? Maybe the machine is starting to use a good amount of swap space? How much memory does the machine have?
Charlie


The machine has 128 GB of RAM and nothing other than BOINC is running when it hangs. I did reduce the default file swap from 75% to 50% this morning though.....will not know if the machine has hung until I get home from work. No other project is having issues with current BOINC settings though...
12) Message boards : Number crunching : Linux Hung Machine (Message 92329)
Posted 26 Mar 2020 by Buckeye4lf
Post:
I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help.

Thanks,
Greg


Not something that I’ve experienced but the evidence should still exist in /var/logs/...


I can check when I get home, usually when I am forced to power cycle the jobs all error out which I suspect may mask the true issue.
13) Message boards : Number crunching : Linux Hung Machine (Message 92324)
Posted 26 Mar 2020 by Buckeye4lf
Post:
I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help.

Thanks,
Greg






©2024 University of Washington
https://www.bakerlab.org