Error while computing - AMD Opteron

Message boards : Number crunching : Error while computing - AMD Opteron

To post messages, you must log in.

AuthorMessage
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 65,736,681
RAC: 3,079
Message 88855 - Posted: 11 May 2018, 12:35:36 UTC

All:

I have many failed work units fail with Error while computing after about 1 min of run time. All on my AMD Opteron cores are chewing through these WUs. All of them are rb_05_10_167_247__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_600454_

I think I have 500 failed work units and growing.

examples:
https://boinc.bakerlab.org/workunit.php?wuid=898124245
https://boinc.bakerlab.org/workunit.php?wuid=898124145
https://boinc.bakerlab.org/workunit.php?wuid=898124156
Thx!

Paul

ID: 88855 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 54
Credit: 20,058,207
RAC: 31,720
Message 88858 - Posted: 11 May 2018, 17:03:05 UTC
Last modified: 11 May 2018, 17:03:14 UTC

ID: 88858 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,038,193
RAC: 17,131
Message 88859 - Posted: 11 May 2018, 17:24:00 UTC - in response to Message 88855.  

All:

I have many failed work units fail with Error while computing after about 1 min of run time. All on my AMD Opteron cores are chewing through these WUs. All of them are rb_05_10_167_247__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_600454_

I think I have 500 failed work units and growing.

examples:
https://boinc.bakerlab.org/workunit.php?wuid=898124245
https://boinc.bakerlab.org/workunit.php?wuid=898124145
https://boinc.bakerlab.org/workunit.php?wuid=898124156


They seem to be failing with a SIGNAL 11 which indicates a resource problem ... disk or memory.
errno 11
EAGAIN 11 Resource temporarily unavailable


You have plenty of memory, but some of the Rosetta WU seem to get into a state where they take many GB of memory.
ID: 88859 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,038,193
RAC: 17,131
Message 88860 - Posted: 11 May 2018, 17:42:36 UTC - in response to Message 88858.  

Same thing reported here:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=88854#88854


These are aborting but for a different reason. The command line is followed with the error message which seems to point to a problem with the locale settings.

rosetta_4.07_x86_64-pc-linux-gnu: loadlocale.c:129: _nl_intern_locale_data: Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed.
SIGABRT: abort called


My Fedora 27 machine is running fine with

cat /etc/locale.conf
LANG="en_US.UTF-8"


locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
ID: 88860 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 54
Credit: 20,058,207
RAC: 31,720
Message 88861 - Posted: 11 May 2018, 17:52:56 UTC

Ah yeah, I saw 193 and thought it was the same as I've seen it on several posts.
ID: 88861 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,038,193
RAC: 17,131
Message 88862 - Posted: 11 May 2018, 17:57:58 UTC - in response to Message 88860.  

After some more digging, it appear that there "may" be an incompatibility between glibc version 2.27 and applications statically linked with glibc version 2.26.

You can check the version of glibc with the ldd command.
My glibc is still the 2.26 version which should work.

ldd --version
ldd (GNU libc) 2.26

I think you may have problems with Rosetta if you have glibc 2.27.

This feature will cause existing statically compiled applications
to fail to load locales and fall back to the builtin C/POSIX locales.
See notes below for other changes affecting compatibility.

https://lists.gnu.org/archive/html/info-gnu/2018-02/msg00000.html
ID: 88862 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 88863 - Posted: 11 May 2018, 23:27:28 UTC

My Opteron box with the same problem has glibc 2.24 and nearly 4G of memory per core.
The FX box has about 2G per core and has posted none of these errors.

It started suddenly and a few hundred jobs failed before I noticed. I switched the machine over to WGC where it runs and verifies with no errors.

I tried letting some more WUs in yesterday and 45 out of about 160 failed at about one minute.
ID: 88863 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,038,193
RAC: 17,131
Message 88864 - Posted: 11 May 2018, 23:54:08 UTC - in response to Message 88863.  

My Opteron box with the same problem has glibc 2.24 and nearly 4G of memory per core.
The FX box has about 2G per core and has posted none of these errors.

It started suddenly and a few hundred jobs failed before I noticed. I switched the machine over to WGC where it runs and verifies with no errors.

I tried letting some more WUs in yesterday and 45 out of about 160 failed at about one minute.



The WUs that error out with "Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed.
SIGABRT: abort called" are the ones that look like a glibc 2.27 problem.

Yours seem to error with a signal 11. One of the standard moderator suggestions is to RESET the project and download clean copies of all the Rosetta files. You probably have already tried that. In a number of cases that has seemed to heal the problem.

Does "dmesg" show any boinc related errors?
ID: 88864 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 88865 - Posted: 12 May 2018, 0:20:34 UTC - in response to Message 88864.  


Yours seem to error with a signal 11. One of the standard moderator suggestions is to RESET the project and download clean copies of all the Rosetta files. You probably have already tried that. In a number of cases that has seemed to heal the problem.

Does "dmesg" show any boinc related errors?


Yeah, did the reset and dmesg is clean.
One thing I did just realize, is that the FX box is running Linux WUs under FREEBSD. The Opteron is Debian Linux.

I'm tempted to build a BSD system disk for the Opteron this weekend, just to see what happens.
ID: 88865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,038,193
RAC: 17,131
Message 88866 - Posted: 12 May 2018, 3:40:12 UTC - in response to Message 88865.  


Yours seem to error with a signal 11. One of the standard moderator suggestions is to RESET the project and download clean copies of all the Rosetta files. You probably have already tried that. In a number of cases that has seemed to heal the problem.

Does "dmesg" show any boinc related errors?


Yeah, did the reset and dmesg is clean.
One thing I did just realize, is that the FX box is running Linux WUs under FREEBSD. The Opteron is Debian Linux.

I'm tempted to build a BSD system disk for the Opteron this weekend, just to see what happens.


If you like your install, you can also use a vbox environment to test changes without reinstalling everything.
ID: 88866 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 65,736,681
RAC: 3,079
Message 88870 - Posted: 12 May 2018, 11:59:34 UTC - in response to Message 88865.  
Last modified: 12 May 2018, 12:12:11 UTC

I am running Ubuntu 16.04 LTS. I can try a project reset. How do I look at dmesg?

ldd version 2.23

64GB RAM
4 AMD Opteron 6176 Processors with 12 Cores each
250GB SSD

100% dedicated to Rosetta. Everything else runs fine including other Rosetta WUs

Problem started all at once. I did not reset the project as she is running 48 active WUs. Hate to waste all that progress.
Thx!

Paul

ID: 88870 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ChristianVirtual

Send message
Joined: 29 Apr 17
Posts: 5
Credit: 1,684,275
RAC: 0
Message 88872 - Posted: 12 May 2018, 12:16:31 UTC

I have also quite some trouble with WU, 24 hours and fail ... on Ryzen with Ubuntu
like this https://boinc.bakerlab.org/result.php?resultid=996310562

"Too many total results"
ID: 88872 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ChristianVirtual

Send message
Joined: 29 Apr 17
Posts: 5
Credit: 1,684,275
RAC: 0
Message 88873 - Posted: 12 May 2018, 12:18:44 UTC

another strange one

https://boinc.bakerlab.org/workunit.php?wuid=898555819

why the server cancelled those ?

(sorry, might should have made a new thread)
ID: 88873 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,038,193
RAC: 17,131
Message 88874 - Posted: 12 May 2018, 15:38:01 UTC - in response to Message 88870.  
Last modified: 12 May 2018, 16:03:46 UTC

delete me
ID: 88874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,038,193
RAC: 17,131
Message 88875 - Posted: 12 May 2018, 15:40:14 UTC - in response to Message 88870.  

I am running Ubuntu 16.04 LTS. I can try a project reset. How do I look at dmesg?

ldd version 2.23

64GB RAM
4 AMD Opteron 6176 Processors with 12 Cores each
250GB SSD

100% dedicated to Rosetta. Everything else runs fine including other Rosetta WUs

Problem started all at once. I did not reset the project as she is running 48 active WUs. Hate to waste all that progress.


Hi Paul,
Linux has a very good set of online MANUAL pages. You can type "man dmesg" and it will tell you the exact syntax. I just do and "dmesg" and Linux will display a long list of all the information and error messages. You might just grep for "caseless" boinc messages. Ubuntu 16.04 is stable and I expect you will find nothing except possible "disk out of space" messages.

dmesg | grep -i boinc


IMO, my guess is your system is fine, but the drive is a little small. I suspect that system partition Ubuntu where the boinc directory reside is getting full.
Ubuntu chops the SSD into sections ... user, system, ....
Boinc, on my Fedora, is put in the system section which is typically not expected to grow as much as the user part so Linux allocated a smaller part of the disk.
Fedora puts the boinc directory at /var/lib/boinc
Boinc projects are placed in /var/boinc/projects/
Each WU is given a /var/boinc/slots/ directory and Rosetta uses about 0.5GB of space.
In theory, you have enough space on the SSD drive if Ubuntu gave enough to the system.

I would look to see if othe disk partitions is filling up. I don't like to see partitions getting near full ... more than 80% or so.
df -h
will display the disk resources in human format.

NOTE: my boinc directory is large for my machine, since I have multiple projects connected AND I was doing some other work there.

df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G
ID: 88875 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,038,193
RAC: 17,131
Message 88876 - Posted: 12 May 2018, 16:11:52 UTC - in response to Message 88873.  

another strange one

https://boinc.bakerlab.org/workunit.php?wuid=898555819

why the server cancelled those ?

(sorry, might should have made a new thread)


I suspect that the "Too many total results" message on a stable machine is from the Rosetta side. Sometimes a researcher will post WU with some problems and the admins will do a "stop all his jobs" on their side to keep from wasting resources. I think they manually add credit for compute hours that were donated. Moderators will correct me if wrong.

I expect that the problem will naturally drain and the machine will run smoothly. I would just track it until the WUs are running properly again.
ID: 88876 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Usuario1_S

Send message
Joined: 24 Mar 14
Posts: 92
Credit: 3,059,705
RAC: 0
Message 88965 - Posted: 21 May 2018, 15:59:31 UTC

I get veery few days in my AMD FX-8370 Computing error, maybe is a compiler thing that won't use the Intel Instruction Set or expects something Intel CPUs only will give? Why would I get an error, my apps or anything never crashes really Win 8.1 64-bit fully update, including drivers
ID: 88965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,038,193
RAC: 17,131
Message 88967 - Posted: 21 May 2018, 19:34:51 UTC - in response to Message 88965.  

I get veery few days in my AMD FX-8370 Computing error, maybe is a compiler thing that won't use the Intel Instruction Set or expects something Intel CPUs only will give? Why would I get an error, my apps or anything never crashes really Win 8.1 64-bit fully update, including drivers


I think your machine is OK. It looks like a Rosetta researcher submitted some bad jobs.

I looked at the ERROR and the INVALID WUs that you got. 1 ERROR and 11 INVALID.

The Error was an "OUT OF MEMORY" error. I have periodically seen Rosetta WU on my machine with 32GB of memory using ~12GB of memory. IMO, that is a Rosetta memory leak or similar problem. This was not your problem.

Every one of the INVALID WU's had similar names of the form: "number_JW-16052018-ffdtest_number_globalDocking_6_SAVE_ALL_OUT_652671_5_0"
like ...
1b2be7b0a57c1dfd6afe5dac5bbef86f_JW-16052018-ffdtest_18_05_12_00_37_localDocking_8_SAVE_ALL_OUT_652692_22_0
0f88b190960d44e73870d8f8bc8deae2_JW-16052018-ffdtest_18_05_12_04_49_globalDocking_6_SAVE_ALL_OUT_652671_5_0

I checked Rosetta results on one of my Intel machines and the same INVALID error on a job named. Seems like the group is bad.
"0c36992d889c7c78029a6405354254d6_JW-16052018-ffdtest_18_05_12_01_19_globalDocking_6_SAVE_ALL_OUT_652669_35_0".
ID: 88967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Usuario1_S

Send message
Joined: 24 Mar 14
Posts: 92
Credit: 3,059,705
RAC: 0
Message 88994 - Posted: 25 May 2018, 4:55:11 UTC - in response to Message 88967.  
Last modified: 25 May 2018, 4:55:32 UTC

Thank you for the info mate,yes I think you're right, that group is bad, a relieve, sorry for the late response
ID: 88994 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Error while computing - AMD Opteron



©2024 University of Washington
https://www.bakerlab.org