Posts by LarryMajor

1) Message boards : Number crunching : File transfers. (Message 91663)
Posted 8 Feb 2020 by LarryMajor
Post:
I'm having the same problem with two machines. It happens occasionally, but it's been bad the past 24 hours.
2) Message boards : Number crunching : Unrealistic expectations (Message 91041)
Posted 18 Aug 2019 by LarryMajor
Post:
Boinc will adjust its work requests based on the history of a given host, but it takes some time. In your case, these jobs were in the original request and, lacking any history, it took a guess.

Now that completed jobs are being reported, it's showing a turnaround time of four days, which it should use in future calcuations. To make sure things are working as they should, look in the messages from the last time it contacted the server (or do a manual update) and it should say, "not requesting jobs, job queue full" or something to that effect. If this looks correct, I'd just let it cycle through a couple times to the next server contacts where it does request jobs to see if the number of jobs it downloads is reasonable.

Settings such as target CPU time, number of processors and stuff like that will affect Boinc's calculations, so if you want to change any of these, do that first before letting it settle out.

There are certain circumstances that require adjusting things like job queue setting but, in the majority of cases, Boinc will (eventually) do a decent job of managing the job queue.
3) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 90966)
Posted 2 Aug 2019 by LarryMajor
Post:
Yeah, I got a bunch of these on different machines, and they all fail when they are resent to someone else.
It's the work units, not your computer.
4) Message boards : Number crunching : More WUs simultaneously (Message 90314)
Posted 7 Feb 2019 by LarryMajor
Post:
Just raising (or lowering) the CPU count to change the number of concurrent jobs won't change the runtime; each job will follow the project's target runtime.
If you change the project target runtime, then yes, it will affect all WUs including those you have already downloaded.
5) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 89930)
Posted 25 Nov 2018 by LarryMajor
Post:
Much earlier, i5-2500 received for each completed WU approximately 800~850 credits.
What can i do?


I'd do nothing for a few days. It appears to have been the recent WUs/scoring that caused a big drop. Mine started to look more typical in the past 24 hours.
6) Message boards : Number crunching : Work Units less than 100% CPU Utilization (Message 89786)
Posted 28 Oct 2018 by LarryMajor
Post:
I have run Rosy for a long time. These are dedicated crunchers set for 100% CPU Utilization. I have never seen this many tasks bounce around on utilization. I wonder if this is a larger issue with some of the new work units.


Paul, are you seeing this on your Intel boxes, or just on the Opterons?

What happens on my Opterons, but not on the FX boxes, is that every two or three days a job’s PID will enter a sleep state and stay there. A normal client stop and restart clears it up and the job finishes normally. The workaround is quick enough that I never looked for a cause.

Also, after seeing your post, I watched things run for a while and noticed that some of the recent jobs are using a LOT of system overhead. I use htop, which splits out job and system overhead per CPU(core), but if your utility doesn’t, it will look exactly the way you described. I have to think that this is specific to a WU since I never noticed it before.
7) Message boards : Number crunching : New WUs failing (Message 89704)
Posted 7 Oct 2018 by LarryMajor
Post:
Apart from that, any comments as to why all of a sudden so many WU's fail ?


Log on to your Rosetta account and click the "view" link next to TASKS. On the next screen, you can view tasks by completion state.

Most of your errors were caused on Oct 3 because the WUs did not complete before the deadline. There are a few things that will cause this; resetting the project, not processing jobs for a period of time and other reasons.
You probably want to keep an eye on the deadline on jobs that you have queued up, in case your account needs setting adjusted, but that doesn't appear to be the case.

When you look at the reason for the errors, many times you can tell if there was just something wrong with the WU (you had some of these) or a local problem.

Hope this helps.
8) Message boards : Number crunching : Error while computing - AMD Opteron (Message 88865)
Posted 12 May 2018 by LarryMajor
Post:

Yours seem to error with a signal 11. One of the standard moderator suggestions is to RESET the project and download clean copies of all the Rosetta files. You probably have already tried that. In a number of cases that has seemed to heal the problem.

Does "dmesg" show any boinc related errors?


Yeah, did the reset and dmesg is clean.
One thing I did just realize, is that the FX box is running Linux WUs under FREEBSD. The Opteron is Debian Linux.

I'm tempted to build a BSD system disk for the Opteron this weekend, just to see what happens.
9) Message boards : Number crunching : Error while computing - AMD Opteron (Message 88863)
Posted 11 May 2018 by LarryMajor
Post:
My Opteron box with the same problem has glibc 2.24 and nearly 4G of memory per core.
The FX box has about 2G per core and has posted none of these errors.

It started suddenly and a few hundred jobs failed before I noticed. I switched the machine over to WGC where it runs and verifies with no errors.

I tried letting some more WUs in yesterday and 45 out of about 160 failed at about one minute.
10) Message boards : Number crunching : Rosetta 4.0+ (Message 88348)
Posted 23 Feb 2018 by LarryMajor
Post:
It has nothing to do with segfaults. I have the "fixed" version which avoids those. It is something specific to Rosetta, as it does not have unusual errors on any of the other projects I do. I don't really know what other AMD chips are affected, except that as a class they have a higher error rate than the Intels here..

I see your point. Looking at another Opteron machine, I found the same job with the same error, except his blew at a little over 12 hours, which is likely because his target time is 8 hours.

Some of these really do work when they take well over 12 hours, so I bumped my target time to 16 hours.

If they fail then, it appears that I have some decisions to make.

Thanks for your help.
11) Message boards : Number crunching : Rosetta 4.0+ (Message 88345)
Posted 22 Feb 2018 by LarryMajor
Post:
i got a high percentage of errors with my Ryzen 1700, and noticed that a lot of other people with AMD chips have a high error rate too.

Yeah, I saw where some Ryzens were posting segfaults, but the output from them (at least the ones I saw) had different errors from what I'm getting.
The Opterons are 61xx and don't do threads, so that rules out any threading issues.

I have an AMD FX machine on this project, and it hasn't had any problems, but that may be sheer chance since it does only a fraction of the work that the Opteron box does.
12) Message boards : Number crunching : Rosetta 4.0+ (Message 88344)
Posted 22 Feb 2018 by LarryMajor
Post:
The errors are 4.x.

One thing I noticed, is that the successful one created only 1 decoy. I have the target CPU time set at 12 hours and they fail at 16 hours.
Is there a chance that the watchdog is terminating these jobs as an overrun if they go 4 hours over the target time?
13) Message boards : Number crunching : Rosetta 4.0+ (Message 88337)
Posted 22 Feb 2018 by LarryMajor
Post:
I started getting errors about a week ago. The common points are that the jobs are all PF*_bnd_aivan_SAVE_ALL_OUT*, and that I only get errors on the machine with AMD Opterons. Some WU’s with this name run successfully, and the ones that fail all exceed the target CPU time by four hours before failing.
The error, in part, is “WARNING! cannot get file size for default.out.gz: could not open file” and “Output exists: default.out.gz Size: -1.” The Exit Status is 11.

About half the jobs fail when re-sent to other machines, but when I looked at one that finished successfully on another machine, I see the same errors in both outputs:
Failed:
https://boinc.bakerlab.org/result.php?resultid=974837214
Completed:
https://boinc.bakerlab.org/result.php?resultid=975103716

After seeing the same errors, but an Exit Status 0 on the re-send, I’m really confused about where the problem lies, and will appreciate any help you guys can give me.
14) Message boards : Number crunching : WUs estimated time way off to elapsed time (Message 87106)
Posted 19 Aug 2017 by LarryMajor
Post:
apt-get install package_name=version
is the syntax for it. I'm running Debian, and apt only wants to use 7.6.33 of boinc. boinc-client and boinc manager after I upgraded to kernel 4.9.0 - if that's any help.

When you find the versions and packages that work for you -
apt-mark hold package_name
should keep the system from upgrading them.

My machine is very similar to yours and it's been on 7.6.33 for close to a month with no problems, so if there's anything that I can check on it that might help, let me know.
15) Message boards : Number crunching : Legacy CPU Performance (Message 87044)
Posted 11 Aug 2017 by LarryMajor
Post:
It appears as if 500 per core is about right for a 6100. (I have a 6128 and that's around where it runs.)

The 6272 has an all-core boost of 2.4GHz, which is only .1 GHz faster than your 6176, so the big difference would be in having 64 cores.

The 6378 has an all-core boost of 2.7GHz, and I've seen tests (rumors?) that they run 5-10 per cent faster than the previous generation. This suggests a per core of 625, but that IS just extrapolation.

I, too, have been watching the price of used processors, and for the small price difference, I'm planning my upgrade to one of the 6300 processors.
16) Message boards : Number crunching : Larger Memory Models (Message 86974)
Posted 4 Aug 2017 by LarryMajor
Post:
As others have mentioned, some of us have enough memory to run 2Gb jobs, and will gladly do it. If it works better for you guys, it might be worth designing a way for us to request such jobs.
17) Message boards : Number crunching : Stuck on uploading is a new problem? (Message 81496)
Posted 18 Apr 2017 by LarryMajor
Post:
My machine's backlog cleared out today.
A big THANK YOU to all the hard working folks that resolved this.

P.S. Some of us are dying to know what caused such a strange problem.
18) Message boards : Number crunching : Stuck on uploading is a new problem? (Message 81481)
Posted 17 Apr 2017 by LarryMajor
Post:
I'm getting 1 WU that hangs for every 40-50 that process normally (under Linux).

Just a thought - have you guys taken a WU known to have the problem, and tried running it on another host to see if anything looks unusual?
19) Message boards : Number crunching : DNS Problems and Late Work Units (Message 81280)
Posted 10 Mar 2017 by LarryMajor
Post:
A static list:

128.95.160.140 boinc.bakerlab.org
128.95.160.141 ralph.bakerlab.org
128.95.160.142 srv1.bakerlab.org
128.95.160.143 srv2.bakerlab.org
128.95.160.144 srv3.bakerlab.org
128.95.160.145 srv4.bakerlab.org
128.95.160.146 srv5.bakerlab.org

This covers all the names and IPs I needed at least, for both Rosetta & Ralph.


Thank you! I've been poking at this for a couple days and your list had the one server I missed.
Just reported a couple dozen completed WUs just under deadline.
20) Message boards : Number crunching : DNS Problems and Late Work Units (Message 81279)
Posted 10 Mar 2017 by LarryMajor
Post:
A static list:

128.95.160.140 boinc.bakerlab.org
128.95.160.141 ralph.bakerlab.org
128.95.160.142 srv1.bakerlab.org
128.95.160.143 srv2.bakerlab.org
128.95.160.144 srv3.bakerlab.org
128.95.160.145 srv4.bakerlab.org
128.95.160.146 srv5.bakerlab.org

This covers all the names and IPs I needed at least, for both Rosetta & Ralph.



Next 20



©2024 University of Washington
https://www.bakerlab.org