Error while computing

Author	Message
Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,343,970 RAC: 8,285	Message 88855 - Posted: 11 May 2018, 12:35:36 UTC All: I have many failed work units fail with Error while computing after about 1 min of run time. All on my AMD Opteron cores are chewing through these WUs. All of them are rb_05_10_167_247__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_600454_ I think I have 500 failed work units and growing. examples: https://boinc.bakerlab.org/workunit.php?wuid=898124245 https://boinc.bakerlab.org/workunit.php?wuid=898124145 https://boinc.bakerlab.org/workunit.php?wuid=898124156 Thx! Paul ID: 88855 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jun 16 Posts: 58 Credit: 23,650,470 RAC: 42,190	Message 88858 - Posted: 11 May 2018, 17:03:05 UTC Last modified: 11 May 2018, 17:03:14 UTC Same thing reported here: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=88854#88854 ID: 88858 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,021,655 RAC: 6,749	Message 88859 - Posted: 11 May 2018, 17:24:00 UTC - in response to Message 88855. All: I have many failed work units fail with Error while computing after about 1 min of run time. All on my AMD Opteron cores are chewing through these WUs. All of them are rb_05_10_167_247__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_600454_ I think I have 500 failed work units and growing. examples: https://boinc.bakerlab.org/workunit.php?wuid=898124245 https://boinc.bakerlab.org/workunit.php?wuid=898124145 https://boinc.bakerlab.org/workunit.php?wuid=898124156 They seem to be failing with a SIGNAL 11 which indicates a resource problem ... disk or memory. errno 11 EAGAIN 11 Resource temporarily unavailable You have plenty of memory, but some of the Rosetta WU seem to get into a state where they take many GB of memory. ID: 88859 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,021,655 RAC: 6,749	Message 88860 - Posted: 11 May 2018, 17:42:36 UTC - in response to Message 88858. Same thing reported here: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=88854#88854 These are aborting but for a different reason. The command line is followed with the error message which seems to point to a problem with the locale settings. rosetta_4.07_x86_64-pc-linux-gnu: loadlocale.c:129: _nl_intern_locale_data: Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed. SIGABRT: abort called My Fedora 27 machine is running fine with cat /etc/locale.conf LANG="en_US.UTF-8" locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= ID: 88860 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jun 16 Posts: 58 Credit: 23,650,470 RAC: 42,190	Message 88861 - Posted: 11 May 2018, 17:52:56 UTC Ah yeah, I saw 193 and thought it was the same as I've seen it on several posts. ID: 88861 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,021,655 RAC: 6,749	Message 88862 - Posted: 11 May 2018, 17:57:58 UTC - in response to Message 88860. After some more digging, it appear that there "may" be an incompatibility between glibc version 2.27 and applications statically linked with glibc version 2.26. You can check the version of glibc with the ldd command. My glibc is still the 2.26 version which should work. ldd --version ldd (GNU libc) 2.26 I think you may have problems with Rosetta if you have glibc 2.27. This feature will cause existing statically compiled applications to fail to load locales and fall back to the builtin C/POSIX locales. See notes below for other changes affecting compatibility. https://lists.gnu.org/archive/html/info-gnu/2018-02/msg00000.html ID: 88862 · Rating: 0 · rate: / Reply Quote

LarryMajor Send message Joined: 1 Apr 16 Posts: 22 Credit: 31,533,212 RAC: 0	Message 88863 - Posted: 11 May 2018, 23:27:28 UTC My Opteron box with the same problem has glibc 2.24 and nearly 4G of memory per core. The FX box has about 2G per core and has posted none of these errors. It started suddenly and a few hundred jobs failed before I noticed. I switched the machine over to WGC where it runs and verifies with no errors. I tried letting some more WUs in yesterday and 45 out of about 160 failed at about one minute. ID: 88863 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,021,655 RAC: 6,749	Message 88864 - Posted: 11 May 2018, 23:54:08 UTC - in response to Message 88863. My Opteron box with the same problem has glibc 2.24 and nearly 4G of memory per core. The FX box has about 2G per core and has posted none of these errors. It started suddenly and a few hundred jobs failed before I noticed. I switched the machine over to WGC where it runs and verifies with no errors. I tried letting some more WUs in yesterday and 45 out of about 160 failed at about one minute. The WUs that error out with "Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed. SIGABRT: abort called" are the ones that look like a glibc 2.27 problem. Yours seem to error with a signal 11. One of the standard moderator suggestions is to RESET the project and download clean copies of all the Rosetta files. You probably have already tried that. In a number of cases that has seemed to heal the problem. Does "dmesg" show any boinc related errors? ID: 88864 · Rating: 0 · rate: / Reply Quote

LarryMajor Send message Joined: 1 Apr 16 Posts: 22 Credit: 31,533,212 RAC: 0	Message 88865 - Posted: 12 May 2018, 0:20:34 UTC - in response to Message 88864. Yours seem to error with a signal 11. One of the standard moderator suggestions is to RESET the project and download clean copies of all the Rosetta files. You probably have already tried that. In a number of cases that has seemed to heal the problem. Does "dmesg" show any boinc related errors? Yeah, did the reset and dmesg is clean. One thing I did just realize, is that the FX box is running Linux WUs under FREEBSD. The Opteron is Debian Linux. I'm tempted to build a BSD system disk for the Opteron this weekend, just to see what happens. ID: 88865 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,021,655 RAC: 6,749	Message 88866 - Posted: 12 May 2018, 3:40:12 UTC - in response to Message 88865. Yours seem to error with a signal 11. One of the standard moderator suggestions is to RESET the project and download clean copies of all the Rosetta files. You probably have already tried that. In a number of cases that has seemed to heal the problem. Does "dmesg" show any boinc related errors? Yeah, did the reset and dmesg is clean. One thing I did just realize, is that the FX box is running Linux WUs under FREEBSD. The Opteron is Debian Linux. I'm tempted to build a BSD system disk for the Opteron this weekend, just to see what happens. If you like your install, you can also use a vbox environment to test changes without reinstalling everything. ID: 88866 · Rating: 0 · rate: / Reply Quote

Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,343,970 RAC: 8,285	Message 88870 - Posted: 12 May 2018, 11:59:34 UTC - in response to Message 88865. Last modified: 12 May 2018, 12:12:11 UTC I am running Ubuntu 16.04 LTS. I can try a project reset. How do I look at dmesg? ldd version 2.23 64GB RAM 4 AMD Opteron 6176 Processors with 12 Cores each 250GB SSD 100% dedicated to Rosetta. Everything else runs fine including other Rosetta WUs Problem started all at once. I did not reset the project as she is running 48 active WUs. Hate to waste all that progress. Thx! Paul ID: 88870 · Rating: 0 · rate: / Reply Quote

ChristianVirtual Send message Joined: 29 Apr 17 Posts: 5 Credit: 1,684,275 RAC: 0	Message 88872 - Posted: 12 May 2018, 12:16:31 UTC I have also quite some trouble with WU, 24 hours and fail ... on Ryzen with Ubuntu like this https://boinc.bakerlab.org/result.php?resultid=996310562 "Too many total results" ID: 88872 · Rating: 0 · rate: / Reply Quote

ChristianVirtual Send message Joined: 29 Apr 17 Posts: 5 Credit: 1,684,275 RAC: 0	Message 88873 - Posted: 12 May 2018, 12:18:44 UTC another strange one https://boinc.bakerlab.org/workunit.php?wuid=898555819 why the server cancelled those ? (sorry, might should have made a new thread) ID: 88873 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,021,655 RAC: 6,749	Message 88874 - Posted: 12 May 2018, 15:38:01 UTC - in response to Message 88870. Last modified: 12 May 2018, 16:03:46 UTC delete me ID: 88874 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,021,655 RAC: 6,749	Message 88875 - Posted: 12 May 2018, 15:40:14 UTC - in response to Message 88870. I am running Ubuntu 16.04 LTS. I can try a project reset. How do I look at dmesg? ldd version 2.23 64GB RAM 4 AMD Opteron 6176 Processors with 12 Cores each 250GB SSD 100% dedicated to Rosetta. Everything else runs fine including other Rosetta WUs Problem started all at once. I did not reset the project as she is running 48 active WUs. Hate to waste all that progress. Hi Paul, Linux has a very good set of online MANUAL pages. You can type "man dmesg" and it will tell you the exact syntax. I just do and "dmesg" and Linux will display a long list of all the information and error messages. You might just grep for "caseless" boinc messages. Ubuntu 16.04 is stable and I expect you will find nothing except possible "disk out of space" messages. dmesg \| grep -i boinc IMO, my guess is your system is fine, but the drive is a little small. I suspect that system partition Ubuntu where the boinc directory reside is getting full. Ubuntu chops the SSD into sections ... user, system, .... Boinc, on my Fedora, is put in the system section which is typically not expected to grow as much as the user part so Linux allocated a smaller part of the disk. Fedora puts the boinc directory at /var/lib/boinc Boinc projects are placed in /var/boinc/projects/ Each WU is given a /var/boinc/slots/ directory and Rosetta uses about 0.5GB of space. In theory, you have enough space on the SSD drive if Ubuntu gave enough to the system. I would look to see if othe disk partitions is filling up. I don't like to see partitions getting near full ... more than 80% or so. df -h will display the disk resources in human format. NOTE: my boinc directory is large for my machine, since I have multiple projects connected AND I was doing some other work there. df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 16G 0 16G 0% /dev tmpfs 16G ID: 88875 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,021,655 RAC: 6,749	Message 88876 - Posted: 12 May 2018, 16:11:52 UTC - in response to Message 88873. another strange one https://boinc.bakerlab.org/workunit.php?wuid=898555819 why the server cancelled those ? (sorry, might should have made a new thread) I suspect that the "Too many total results" message on a stable machine is from the Rosetta side. Sometimes a researcher will post WU with some problems and the admins will do a "stop all his jobs" on their side to keep from wasting resources. I think they manually add credit for compute hours that were donated. Moderators will correct me if wrong. I expect that the problem will naturally drain and the machine will run smoothly. I would just track it until the WUs are running properly again. ID: 88876 · Rating: 0 · rate: / Reply Quote

Usuario1_S Send message Joined: 24 Mar 14 Posts: 92 Credit: 3,059,705 RAC: 0	Message 88965 - Posted: 21 May 2018, 15:59:31 UTC I get veery few days in my AMD FX-8370 Computing error, maybe is a compiler thing that won't use the Intel Instruction Set or expects something Intel CPUs only will give? Why would I get an error, my apps or anything never crashes really Win 8.1 64-bit fully update, including drivers ID: 88965 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,021,655 RAC: 6,749	Message 88967 - Posted: 21 May 2018, 19:34:51 UTC - in response to Message 88965. I get veery few days in my AMD FX-8370 Computing error, maybe is a compiler thing that won't use the Intel Instruction Set or expects something Intel CPUs only will give? Why would I get an error, my apps or anything never crashes really Win 8.1 64-bit fully update, including drivers I think your machine is OK. It looks like a Rosetta researcher submitted some bad jobs. I looked at the ERROR and the INVALID WUs that you got. 1 ERROR and 11 INVALID. The Error was an "OUT OF MEMORY" error. I have periodically seen Rosetta WU on my machine with 32GB of memory using ~12GB of memory. IMO, that is a Rosetta memory leak or similar problem. This was not your problem. Every one of the INVALID WU's had similar names of the form: "number_JW-16052018-ffdtest_number_globalDocking_6_SAVE_ALL_OUT_652671_5_0" like ... 1b2be7b0a57c1dfd6afe5dac5bbef86f_JW-16052018-ffdtest_18_05_12_00_37_localDocking_8_SAVE_ALL_OUT_652692_22_0 0f88b190960d44e73870d8f8bc8deae2_JW-16052018-ffdtest_18_05_12_04_49_globalDocking_6_SAVE_ALL_OUT_652671_5_0 I checked Rosetta results on one of my Intel machines and the same INVALID error on a job named. Seems like the group is bad. "0c36992d889c7c78029a6405354254d6_JW-16052018-ffdtest_18_05_12_01_19_globalDocking_6_SAVE_ALL_OUT_652669_35_0". ID: 88967 · Rating: 0 · rate: / Reply Quote

Usuario1_S Send message Joined: 24 Mar 14 Posts: 92 Credit: 3,059,705 RAC: 0	Message 88994 - Posted: 25 May 2018, 4:55:11 UTC - in response to Message 88967. Last modified: 25 May 2018, 4:55:32 UTC Thank you for the info mate,yes I think you're right, that group is bad, a relieve, sorry for the late response ID: 88994 · Rating: 0 · rate: / Reply Quote

Error while computing - AMD Opteron