invalid results; 24 hours wasted

Message boards : Number crunching : invalid results; 24 hours wasted

To post messages, you must log in.

AuthorMessage
ChristianVirtual

Send message
Joined: 29 Apr 17
Posts: 5
Credit: 1,684,275
RAC: 0
Message 88879 - Posted: 13 May 2018, 7:05:24 UTC

It's really frustrating to spend 24 hours of CPU cycles to get a WU invalidated

https://boinc.bakerlab.org/workunit.php?wuid=897665706
https://boinc.bakerlab.org/workunit.php?wuid=897666513

Enough RAM and storage; that should not be a limit.
Ryzen 1700x, Ubuntu 17.10


<core_client_version>7.11.0</core_client_version>
<![CDATA[
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_05_08_164_241__t000__1_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_05_08_164_241__t000__1_C1_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3291164
Starting watchdog...
Watchdog active.
======================================================
DONE :: 329 starting structures 86145 cpu seconds
This process generated 329 decoys from 329 attempts
======================================================
BOINC :: WS_max 4.30068e+08

BOINC :: Watchdog shutting down...
12:01:32 (3632): called boinc_finish(0)

</stderr_txt>
]]>

what an one do ?
ID: 88879 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 316
Credit: 10,689,437
RAC: 15,111
Message 88881 - Posted: 13 May 2018, 12:10:09 UTC - in response to Message 88879.  

You are suffering the same fate with your Ryzen 1700X that I encountered, and reported earlier with my Ryzen 1700 (Lubuntu 17.10). That is, low (and inconsistent) output, as indicated by the credits, along with a higher than normal error rate.
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87833#87833
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87874#87874

And I am hardly alone. Others have reported similar problems with Ryzen. In fact, it seems to be an AMD problem in general insofar as I can tell, affecting most of their CPUs. So it may be that the Rosetta app is not compiled with an AMD optimized compiler, for example.

But Ryzen works great on WCG and all of the other projects I have tried it on, which is a lot of them, including LHC which uses VirtualBox. So I use it for WCG, which is what I built it for originally anyway.

If you want a good Rosetta machine, use Intel. And the later the Intel chip, the better. Having tried it on Ivy Bridge and Haswell, I have now found that Coffee Lake (i7-8700) gives the most consistent output (Ubuntu 18.04), though I am still in the early testing phase.
https://boinc.bakerlab.org/rosetta/results.php?hostid=3399951&offset=0&show_names=0&state=4&appid=

My results with Ivy Bridge and Haswell may be of some interest, though the results were somewhat inconclusive. But the i7-8700 makes them irrelevant now for me.
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=12544

Use your excellent Ryzen 1700X elsewhere. Maybe Rosetta will see the light and fix there stuff someday, though I don't know that they have even looked into the problem, or even consider it a problem yet.
ID: 88881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ChristianVirtual

Send message
Joined: 29 Apr 17
Posts: 5
Credit: 1,684,275
RAC: 0
Message 88892 - Posted: 13 May 2018, 19:21:46 UTC

I think you are right; I had a i7-8700 and 3930 in the past days and they had less problems.
and also agree, that other projects like WCG have much less issues with Ryzen

too bad for Rosetta ...
ID: 88892 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 41
Credit: 6,328,169
RAC: 4,001
Message 88893 - Posted: 14 May 2018, 0:00:12 UTC

Quite a few teams are ending a 3 day team event where Rosetta is the project, the Pentathlon.

Errors with Rosetta app are why I select 1 hr tasks here. If its running for 6 hours then it'll prob error out anyway. I can always add more clients for more tasks if needed.
ID: 88893 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
John P. Myers

Send message
Joined: 13 Apr 10
Posts: 5
Credit: 346,265
RAC: 0
Message 88894 - Posted: 14 May 2018, 1:25:09 UTC

The issue may not be with Ryzen itself but with the number of threads it has. It seems anything with 16 or more threads was getting crazy high error rates, including Opterons and Xeons. I took my Xeon rig off of Rosetta for this exact reason about 2 hours after the project was announced due to the errors.
ID: 88894 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 250
Credit: 8,037,564
RAC: 0
Message 88897 - Posted: 14 May 2018, 4:08:44 UTC - in response to Message 88894.  

The issue may not be with Ryzen itself but with the number of threads it has. It seems anything with 16 or more threads was getting crazy high error rates, including Opterons and Xeons. I took my Xeon rig off of Rosetta for this exact reason about 2 hours after the project was announced due to the errors.


I was running 12 on my Linux machine and "perf top" showed that the 4.07 application's hottest code was looping in a "LOCKED" spin loop. I disassembled the binary so I could poke around. I isolated the top 5 or so hottest sections of code and they were all locked spin loops OR accessing memory following a function return. I did not see much floating point computation.
ID: 88897 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 954
Credit: 3,638,870
RAC: 1,756
Message 88904 - Posted: 14 May 2018, 15:40:25 UTC - in response to Message 88897.  
Last modified: 14 May 2018, 15:40:37 UTC

I was running 12 on my Linux machine and "perf top" showed that the 4.07 application's hottest code was looping in a "LOCKED" spin loop. I disassembled the binary so I could poke around. I isolated the top 5 or so hottest sections of code and they were all locked spin loops OR accessing memory following a function return. I did not see much floating point computation.


From the last Boinc PCM, David wrote:
Rosetta@home
1 developer/programmer/tester, 2 systems administrators/engineers
David Kim, Luki Goldschmidt, Patrick Vecchiato


Not a lot of people to optimize/debug the code :-(
ID: 88904 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 187
Credit: 11,742,357
RAC: 5,770
Message 88968 - Posted: 21 May 2018, 20:17:08 UTC

Sounds similar to what I'm seeing. Unfortunately at this point I don't care that much, but maybe the laid back attitude is okay. Anyway, here's the properties of one of the sick tasks:

Application
Rosetta Mini 3.78
Name
nRoCM_new_01_P04805_group0_7_congq_SAVE_ALL_OUT_IGNORE_THE_REST_609269_3
State
Running
Received
Sat 19 May 2018 05:12:15 AM JST
Report deadline
Sun 27 May 2018 05:12:14 AM JST
Estimated computation size
80,000 GFLOPs
CPU time
00:44:04
CPU time since checkpoint
00:44:04
Elapsed time
13:13:50
Estimated time remaining
---
Fraction done
6.107%
Virtual memory size
451.04 MB
Working set size
308.38 MB
Directory
slots/0
Process ID
7829
Progress rate
0.360% per hour
Executable
minirosetta_3.78_x86_64-pc-linux-gnu
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 88968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ChristianVirtual

Send message
Joined: 29 Apr 17
Posts: 5
Credit: 1,684,275
RAC: 0
Message 89331 - Posted: 22 Jul 2018, 17:10:44 UTC

name rb_07_16_508_732__t000__0_C3_SAVE_ALL_OUT_IGNORE_THE_REST_682151_12503
application Rosetta
created 17 Jul 2018, 13:17:32 UTC
canonical result 1016016653
granted credit 238.10
minimum quorum 1
initial replication 1
max # of error/total/success tasks 1, 1, 1
errors Too many total results


get other one; this time on "Too many results" ... what does that mean ? Server is handing out more and dump those who still contribute their CPU cycles ?
ID: 89331 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 316
Credit: 10,689,437
RAC: 15,111
Message 89332 - Posted: 22 Jul 2018, 17:29:57 UTC
Last modified: 22 Jul 2018, 17:38:29 UTC

I am having a stroke of good luck on both Ubuntu 16.04 (i7-3770) and Win7 64-bit (i7-4771), so I don't think a high error rate is the projects fault, at least on Intel chips.
I pick up about one error a day on my Ryzen 1700, but never a long runner thus far (though I don't use it much).
https://boinc.bakerlab.org/rosetta/results.php?hostid=3421421
https://boinc.bakerlab.org/rosetta/results.php?hostid=3118747

But if you want to use Ubuntu 18.04, you have to do the fix that rjs5/juha proposed.
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=12242&postid=88954#88954
ID: 89332 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : invalid results; 24 hours wasted



©2020 University of Washington
http://www.bakerlab.org