Minirosetta 3.73-3.78

Message boards : Number crunching : Minirosetta 3.73-3.78

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11

AuthorMessage
Tero

Send message
Joined: 22 Jul 17
Posts: 1
Credit: 149,809
RAC: 184
Message 87554 - Posted: 22 Oct 2017, 13:03:22 UTC

I seems that version 3.78 broke compatibility with the Linux client. After Minirosetta 3.78 update, tasks started to fail with "computation error". Latest version of the "regular" Rosetta works fine. I run CentOS linux 7.3 with client 7.6.22. It seems that the error is with how the new version handles files:

ERROR: in::file::zip minirosetta_database.zip does not exist!
ERROR:: Exit from: src/apps/public/boinc/minirosetta.cc line: 195

(Example workunit 853521226)

There is a database zip-file, but it's name is minirosetta_database_d0bf94b.zip. If I make a copy of the zip file to minirosetta_database.zip, I get file errors like "ERROR: ERROR: Option file open failed for: 'flags_rb_10_11_78082_120670__t000__0_C1_robetta'" (workunit 854223185). That file was present in the project folder.
ID: 87554 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile planetclown

Send message
Joined: 27 Jan 12
Posts: 3
Credit: 1,468,901
RAC: 1,756
Message 87776 - Posted: 30 Nov 2017, 11:59:01 UTC
Last modified: 30 Nov 2017, 12:14:12 UTC

Hello, I'm occasionally seeing two different errors on the following apps:

    Rosetta Mini v3.78 x86_64-pc-linux-gnu
    Rosetta Mini v3.78 i686-pc-linux-gnu


I've seen it on Lubuntu and Linux Mint (both Ubuntu 16.04/Xenial) along with BOINC 7.6.31. Link to computer.

The first error is glibc detected with free(): invalid pointer

BOINC:: Worker startup. 
Starting watchdog...
Watchdog active.
*** glibc detected *** ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu: free(): invalid pointer: 0x13867fb8 ***
======= Backtrace: =========
[0xdf36941]
[0xdf3a45b]
[0xede768c]
[0xdeffb51]
[0x81630ad]
[0xd45eb92]
[0xd45ebcb]
[0xd465336]
[0xd46ca67]
[0xd46feef]
[0xd474232]
[0xd400a01]
[0xd40c69a]
[0xc9ac83d]
[0xc9ad47f]
[0xca8b53f]
[0xb08de97]
[0xb265920]
[0xb2a83b6]
[0xb29f4d2]
[0x8aaae73]
[0x8aae71d]
[0x8ab361b]
[0x8a925f9]
[0x8a65a47]
[0xb371855]
[0xb3743be]
[0xb434b13]
[0xb43119d]
[0x8a6fa23]
[0x8056303]
[0xdf0cfd8]
[0x8048131]
======= Memory map: ========
08048000-0ede4000 r-xp 00000000 08:05 1183736                            /var/lib/boinc-client/projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu
0ede4000-0edec000 rw-p 06d9c000 08:05 1183736                            /var/lib/boinc-client/projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu
0edec000-0f115000 rw-p 00000000 00:00 0 
10d45000-17e18000 rw-p 00000000 00:00 0                                  [heap]
ebd2d000-f2cd4000 rw-p 00000000 00:00 0 
f305c000-f3d64000 rw-p 00000000 00:00 0 
f4200000-f4221000 rw-p 00000000 00:00 0 
f4221000-f4300000 ---p 00000000 00:00 0 
f517e000-f517f000 ---p 00000000 00:00 0 
f517f000-f5e8f000 rw-p 00000000 00:00 0 
f5e8f000-f7667000 rw-s 00000000 08:05 1581177                            /var/lib/boinc-client/slots/11/boinc_minirosetta_11
f7667000-f7668000 ---p 00000000 00:00 0 
f7668000-f766b000 rw-p 00000000 00:00 0 
f766b000-f766d000 rw-s 00000000 08:05 1581173                            /var/lib/boinc-client/slots/11/boinc_mmap_file
f766d000-f776a000 rw-p 00000000 00:00 0 
f776a000-f776c000 r--p 00000000 00:00 0                                  [vvar]
f776c000-f776e000 r-xp 00000000 00:00 0                                  [vdso]
ffc6c000-ffc8e000 rw-p 00000000 00:00 0                                  [stack]

</stderr_txt>
]]>


The second error is SIGSEGV: segmentation violation
BOINC:: Worker startup. 
Starting watchdog...
Watchdog active.
SIGSEGV: segmentation violation
Stack trace (4 frames):
[0xde75dcf]
[0xf77ceca0]
[0xdf36358]
[0xeffb51ff]

Exiting...

</stderr_txt>
]]>


I haven't seen any errors while running Rosetta v4.06 app or other BOINC projects. Any help would be appreciated. Thank you!
ID: 87776 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 7
Credit: 2,639,248
RAC: 10,651
Message 87800 - Posted: 3 Dec 2017, 12:10:36 UTC - in response to Message 87776.  

Hello, I'm occasionally seeing two different errors on the following apps:

    Rosetta Mini v3.78 x86_64-pc-linux-gnu
    Rosetta Mini v3.78 i686-pc-linux-gnu


I've seen it on Lubuntu and Linux Mint (both Ubuntu 16.04/Xenial) along with BOINC 7.6.31. Link to computer.

Sorry to say that but your crappy Ryzen is the problem. It would be good if we had the choice to run only Rosetta tasks and not Rosetta Mini. Come on project staff, it can't be that hard to do. Every project I know allows you to choose your applications, it's probably already in the standard server code. In the meantime Ryzen users could reduce their run time to lose less time per crash, or switch to other projects.
ID: 87800 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 632
Credit: 1,820,049
RAC: 2,680
Message 87802 - Posted: 3 Dec 2017, 16:09:12 UTC - in response to Message 87800.  

Sorry to say that but your crappy Ryzen is the problem


Ryzen is crappy? Are you a troll?
ID: 87802 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 7
Credit: 2,639,248
RAC: 10,651
Message 87803 - Posted: 3 Dec 2017, 19:30:15 UTC - in response to Message 87802.  

Ryzen is crappy? Are you a troll?
Yes. No. You don't seem to own a Ryzen. I do.

Let me give some brief information about my current computers. One Ryzen 7 1700, right now showing 516 valid tasks and 60 errors. And one FX-8320E, 67 valid and 1 error. I can assure you the Ryzen behaves exactly as planetclown describes. Either the application crashes outright with a segmentation fault, or the C library kills it because it detected an invalid pointer, this way preventing a possible segfault. If you think about it there must also be cases where an invalid pointer goes unnoticed but doesn't cause a segfault. The result could be anything. I wouldn't rely on a Ryzen for something important, let's hope this project's validator is good. If you dig through the project's host list you'll find more Ryzens showing these symptoms, the most obvious running Linux, but also some Windows hosts with a high number of access violations that could be related.

Also as planetclown describes, the errors don't seem to happen with the new Rosetta application and not at other projects, so you could be tempted to dismiss this as an application error in Rosetta Mini. But there's at least one other example of spontaneous segfaults on Ryzens. Search for "kill_ryzen" or "marginality error" and you'll find many reports on Ryzens segfaulting in a particular use case: massive parallel compiler runs on Linux. An extreme scenario, but not unrealistic, and there's no excuse for simply crashing. People there claim you're safe if you don't do that kind of thing, but without arguments, and Rosetta proves them wrong.

So there's at least two completely unrelated cases of several Ryzens segfaulting out of the blue and no valid reason to assume thats's all. In other words, those things can unpredictably crash for unknown reasons and if they don't crash you still can't trust the results. Crap.
ID: 87803 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 632
Credit: 1,820,049
RAC: 2,680
Message 87805 - Posted: 4 Dec 2017, 11:16:18 UTC - in response to Message 87803.  

Either the application crashes outright with a segmentation fault, or the C library kills it because it detected an invalid pointer, this way preventing a possible segfault. If you think about it there must also be cases where an invalid pointer goes unnoticed but doesn't cause a segfault.

If you have a invalid pointer in your sw it's your problem, not a cpu problem.

Search for "kill_ryzen" or "marginality error" and you'll find many reports on Ryzens segfaulting in a particular use case: massive parallel compiler runs on Linux. An extreme scenario, but not unrealistic, and there's no excuse for simply crashing.

Problem solved months ago, with free replaces of early Ryzen and with bios update (agesa 1.0.0.6b).
ID: 87805 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 99
Credit: 3,655,197
RAC: 7,035
Message 87806 - Posted: 4 Dec 2017, 12:11:26 UTC - in response to Message 87805.  

Problem solved months ago, with free replaces of early Ryzen and with bios update (agesa 1.0.0.6b).

I purchased a Ryzen 1700 made in week 33 of 2017, so it is a fixed version. It is on an ASRock Fatal1ty X370 Gaming X motherboard with the agesa 1.0.0.6b BIOS, and with 32 GB of Patriot DDR4 memory (15-15-15-36).

The CPU is not overclocked, and runs Ubuntu 17.10. I just started running Rosetta on 15 cores, with the other core supporting a GTX 970 on Folding. Previously, it had been running WCG for about a month with no errors, but that is too easy.
https://boinc.bakerlab.org/rosetta/results.php?hostid=3299745

In addition to errors, I am interested in the output. These are the 24-hour work units, and I was averaging about 800 points each on an i7-3770 (7 cores, with one reserved for a GPU, also on Ubuntu) for those that ran the full 24 hours.

We will see how it goes.
ID: 87806 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 19
Credit: 999,653
RAC: 19,061
Message 87808 - Posted: 4 Dec 2017, 14:05:01 UTC

You can RMA segfault Zen chips.

http://www.extremetech.com/computing/254750-amd-replaces-ryzen-cpus-users-affected-rare-linux-bug
ID: 87808 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 7
Credit: 2,639,248
RAC: 10,651
Message 87813 - Posted: 4 Dec 2017, 17:20:50 UTC - in response to Message 87805.  

If you have a invalid pointer in your sw it's your problem, not a cpu problem.
I'm missing the word "because" in that sentence.

Problem solved months ago
I'm not aware of an official statement saying the problem's been identified, let alone solved. Care to give me a pointer? *giggles*

with free replaces of early Ryzen
That's not a solution, it's an emergency measure. And of course I expect it to be free. Good thing this option exists though. But in this RMA process they'll ask you to run tests and document them with photos. Believe it or not, I have no means to take photos, so no RMA for me.

and with bios update (agesa 1.0.0.6b).
AGESA 1.0.0.6b doesn't solve this. Is it even supposed to?
ID: 87813 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 632
Credit: 1,820,049
RAC: 2,680
Message 87814 - Posted: 4 Dec 2017, 17:26:01 UTC - in response to Message 87813.  
Last modified: 4 Dec 2017, 17:34:03 UTC

That's not a solution, it's an emergency measure. And of course I expect it to be free. Good thing this option exists though. But in this RMA process they'll ask you to run tests and document them with photos. Believe it or not, I have no means to take photos, so no RMA for me.


There is a radical solution: pass to Windows 10. Problem goes away :-P
Or wait 4.06 become the default application.
ID: 87814 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 632
Credit: 1,820,049
RAC: 2,680
Message 87815 - Posted: 4 Dec 2017, 17:40:30 UTC - in response to Message 87813.  

Problem solved months ago
I'm not aware of an official statement saying the problem's been identified, let alone solved. Care to give me a pointer? *giggles*

New "RMA Ryzen" has not this problem, so they find it and resolve...
ID: 87815 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 7
Credit: 2,639,248
RAC: 10,651
Message 87831 - Posted: 5 Dec 2017, 18:29:16 UTC - in response to Message 87815.  

New "RMA Ryzen" has not this problem, so they find it and resolve...
I can't agree with that conclusion. The fact that you get a "good" processor (i.e. one that passes this particular test) back only shows that those things exist. It does not prove that current processors in general are good, nor that anything has changed at all.
ID: 87831 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 99
Credit: 3,655,197
RAC: 7,035
Message 87833 - Posted: 6 Dec 2017, 3:26:21 UTC - in response to Message 87806.  
Last modified: 6 Dec 2017, 3:32:32 UTC

We will see how it goes.

I have gotten rather poor performance with the Ryzen 1700, somewhat less output per core than an i7-3770, and three errors. But I have now disabled SMT in the BIOS. There we some problems with that early on with Ryzen, and maybe Rosetta does not work well with it on AMD. So I am now running Rosetta on 7 full cores, with one core reserved for the GPU. I will run it for about two or three more days to see.
ID: 87833 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 670
Credit: 5,251,377
RAC: 7,708
Message 87863 - Posted: 8 Dec 2017, 20:11:15 UTC - in response to Message 87813.  

If you have a invalid pointer in your sw it's your problem, not a cpu problem.
I'm missing the word "because" in that sentence.

I just saw a similar problem but under Windows 10 and on an Intel CPU.

7H2LD3_51C703_fold_and_dock_SAVE_ALL_OUT_538615_1685
http://boinc.bakerlab.org/workunit.php?wuid=864346673

Rosetta Mini 3.78

64-bit Windows 10
Intel i7-5950X, 32 GB, SSD

Perhaps someone could check if it's the same problem, but under conditions much less likely to have the problem become visible.
ID: 87863 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 7
Credit: 2,639,248
RAC: 10,651
Message 87870 - Posted: 9 Dec 2017, 14:14:09 UTC - in response to Message 87863.  

I just saw a similar problem but under Windows 10 and on an Intel CPU.

7H2LD3_51C703_fold_and_dock_SAVE_ALL_OUT_538615_1685
http://boinc.bakerlab.org/workunit.php?wuid=864346673
There's many possible causes for an access violation. Your task list doesn't show any other errors and you'll unlikely ever find out what happened in this single incident. If it doesn't happen repeatedly just ignore it.
ID: 87870 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 99
Credit: 3,655,197
RAC: 7,035
Message 87874 - Posted: 9 Dec 2017, 20:17:14 UTC - in response to Message 87833.  
Last modified: 9 Dec 2017, 21:06:40 UTC

I have gotten rather poor performance with the Ryzen 1700, somewhat less output per core than an i7-3770, and three errors. But I have now disabled SMT in the BIOS. There we some problems with that early on with Ryzen, and maybe Rosetta does not work well with it on AMD. So I am now running Rosetta on 7 full cores, with one core reserved for the GPU. I will run it for about two or three more days to see.

After disabling SMT in the BIOS on my Ryzen 1700 machine (Ubuntu 17.10), I have obtained the following results, which are slightly complicated:
https://boinc.bakerlab.org/results.php?hostid=3299745

Good News: No more errors, with 31 work units being completed successfully. This compares with 3 errors out of 21 work units when SMT was enabled.
Bad News: The output, as measured by the credits is still quite low even on full Ryzen cores (running 7 cores, with the other one dedicated to a GPU) when you are running only Rosetta (but see below).

And the credits are all over the place. Just considering the Rosetta mini 3.78 that ran the full 24 hours, they range from 178 to 815 (except for the last, at 1160 points), and averaged 337 points. That seems to be about the same (per core) as with SMT enabled and running Rosetta on 15 cores, so enabling SMT should at least increase the total output, even with errors.

However, in neither case is the Ryzen as good a the i7-3770 (with hyperthreading). I get no errors on 3.78, and credits average around 800 points per work unit running with 7 cores. I see no advantage to Ryzen thus far as compared to Ivy Bridge if you run only Rosetta.

But the Ryzen 1700 does much better on WCG (running mainly MCM and MIP, with a few of the others). There I get no errors, and twice the output of the i7-3770. So there is something wrong with how the Rosetta AMD app runs on Ryzen. I hope they can fix it, as I will probably be converting most of my machines to AMD eventually.

And, in another twist, the last of the Rosettas did quite well at 1160 points. That was because as I was finishing the Rosettas, I allowed the WCG work units to run. Therefore, when most of the cores were running WCG, the last Rosetta got very good points (though the very last of the 3.78 got stuck and I had to abort it).

Moral: Until they fix Rosetta to run properly on Ryzen, it would be best to mix Rosetta with something else on the majority of the cores (WCG works). You will probably need to experiment to find out what works best though.

=====================================================================================================
Work units that ran the full 24 hours (3.78 only) run with SMT disabled (running on 7 full cores):

Returned 9 Dec:
1160.19
 815.46
 187.98 	
 186.55 	

Returned 8 Dec:
 815.21
 178.20
 184.03
 182.54 	
 747.96
 796.89

Returned 7 Dec:
 184.87
 187.17 	
 186.49
 182.87 
 183.08
 183.50 	
 181.75

Ave: 337 points (excluding the last work unit at 1160 points).

NOTE: very little difference in credits per core with SMT enabled (but twice the number of cores).


Addendum: I don't know how 4.06 Rosetta runs on Ryzen, except that the points are lower as compared to 3.78 Rosetta mini. But how it runs on an Intel chip is another matter.
ID: 87874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 632
Credit: 1,820,049
RAC: 2,680
Message 87879 - Posted: 10 Dec 2017, 19:16:38 UTC - in response to Message 87874.  

Error after 5 hours.... 958310977
-529697949 (0xE06D7363) Unknown error code
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x740B08B2

Engaging BOINC Windows Runtime Debugger...

ID: 87879 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11

Message boards : Number crunching : Minirosetta 3.73-3.78



©2017 University of Washington
http://www.bakerlab.org