Posts by mmonnin

21) Message boards : Number crunching : Rosetta 4.0+ (Message 89191)
Posted 29 Jun 2018 by mmonnin
Post:
Yeah...just abort Rosetta app tasks. Mini works fine. Project admins could fix it if they allow an app selection in user preferences.

For some reason my 2700x works fine but not my 1950x. Both on ubuntu 18.04
22) Message boards : Number crunching : For the betterment of BOINC (Message 89185)
Posted 29 Jun 2018 by mmonnin
Post:
In projects like SETI where the wingman can be either a CPU or GPU, your own credit can vary. It's not really credit for work done any more when the same work is 100 points for one wingman and 50 for a different type of processor on another wingman get the same work was done. I still prefer a fixed credit for a given task size/length. Let the completion time and # of tasks per day determine which processor get more credit overall, Much easier to compare performance as well.
23) Message boards : Number crunching : invalid results; 24 hours wasted (Message 88893)
Posted 14 May 2018 by mmonnin
Post:
Quite a few teams are ending a 3 day team event where Rosetta is the project, the Pentathlon.

Errors with Rosetta app are why I select 1 hr tasks here. If its running for 6 hours then it'll prob error out anyway. I can always add more clients for more tasks if needed.
24) Message boards : Number crunching : Error while computing - AMD Opteron (Message 88861)
Posted 11 May 2018 by mmonnin
Post:
Ah yeah, I saw 193 and thought it was the same as I've seen it on several posts.
25) Message boards : Number crunching : Error while computing - AMD Opteron (Message 88858)
Posted 11 May 2018 by mmonnin
Post:
Same thing reported here:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=88854#88854
26) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 88856)
Posted 11 May 2018 by mmonnin
Post:
Most of mine do as well. Same OS. There are a few that do complete but over half have this error.
27) Message boards : Number crunching : Output versus work unit size (Message 88840)
Posted 9 May 2018 by mmonnin
Post:
It'd be great if we could at least select one or both apps in preferences. I'd assume that would limit the models to an extent.

Something like gpugrid, with "Long runs" and "Short runs" wus?



There are currently two apps, Rosetta and Rosetta Mini but no way in the preferences to select one or the other. Off the top of my head I don't recall another project that doesn't allow selecting between its apps. In the recent past, Rosetta ap has had a much higher change of running for much longer than set in preferences and returning results. Could I not select them? No. Rosetta also had comp errors at like 5sec on one computer with error 193. Mini was fine.

This project already allows for multiple length options.
28) Message boards : Number crunching : Output versus work unit size (Message 88831)
Posted 8 May 2018 by mmonnin
Post:
Yes, R@h is memory intensive. Any memory intensive application is potentially going to be labelled as not playing well with others. It is just how memory contention works in a system. So I don't see a specific problem with your scenario. But wanted to assure you that the developers do look at memory usage and attempt to improve the algorithms used to dial back the use of memory where possible. Also wanted to point out that you said in prior posts that R@h doesn't play well with others, which always sounds like a skirmish for resources and people often invent logic that says it is the application being aggressive, when in fact such things are controlled by the operating system. But I wanted to point out that your last post essentially now boils down to you saying that R@h doesn't play well with itself either. So, at least there is no bias on what is being impacted. As you say, L2 cache contention is going to crop up with any memory intensive application. The larger the L2 cache, the faster any memory intensive application will run.

One approach to optimizing the work on a machine is to get a mixture of work with lower memory requirements. I often suggest people attach to world community grid. Their projects have humanitarian and medical implications, and typically have much lower memory requirements. You can define your preference for mixture of work using the "resource share" for each project. So, for example a resource share of 70% R@h and 30% WCG, you could setup R@h with resource share of 700 and WCG with resource share of 300. On an 8 core system, that would typically result in at least two WCG tasks running alongside 6 R@h tasks. This mix is often enough to make full use of the cores that you just suggested leaving idle.




I guess you are addressing me. I really don't know what the developers pay attention to. I just make my conclusions based on empirical observations. IMO, PrimeGrid is probably the project with the biggest optimization problems. They have over-tuned the code. R@H has done some simple things, but they have overlooked issues that are typically not understood by developers. What they have done is fine with me. Their design decisions determine the power cost, network traffic, disk sizes needed, .... of the machines. IMO, they can make some changes to use those resources more efficiently.

Including all the models in one binary is a design decision and it puts extra pressure on the TLB and networks.
Basing a design on a library of small functions (BOOST) causes a page of code to be read into memory so the program can execute 1 function. Loading the rest of that page is overhead, takes memory and puts pressure on the TLB.
Compiling the code with options like "-O3 -funroll-loops -finline-functions" unwinds the loops (makes code footprint larger) and inlining code puts a copy of the code in multiple places that take up multiple locations in memory, cache, ...

If a cruncher gets WU using all the same model, the machine will use memory most efficiently and ... run faster and get more credits.
If a cruncher gets WU needing 8 different models for an 8-CPU machine, the machine will run slower because the WU do not share CODE or DATA as effectively. The cruncher will be penalized for R@H less efficient use of the caches.
As WU complete and drain, the kind of WU that the machine will affect running WU.
A WU in the first case would give them more credits that in the second case, just because of the R@H interaction.


If most of the WU use just one model, then the problem is low. If there is a lot of variation, the impact will be larger.

Again, I think what R@H is doing is fine and have zero problems with their decisions.


It'd be great if we could at least select one or both apps in preferences. I'd assume that would limit the models to an extent.

Is it that hard to take an existing functioning app as a baseline app any new models/code added that instead of piling it all into one app? Some projects have many apps that do different things. PrimeGrid has different algorithms (right term?) to find primes set as different kind of apps. Maybe then we wouldn't have like a gig download for two apps plus the task files.
29) Message boards : Number crunching : Rosetta 4.0+ (Message 88830)
Posted 8 May 2018 by mmonnin
Post:
Just had over 200 of these fail. So many in such quick succession that my BoincTasks client lost connection while they were all have computation errors. It finally came back with 4 running. Rosetta only, mini seem ok.
1950x, 32gb RAM, 500gb M.2 running 18.04. A 2700x also on 18.04 is running both apps just fine.
30) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 88319)
Posted 19 Feb 2018 by mmonnin
Post:
Are you sure that 1 hour is even an allowed value for CPU time? I haven't checked lately, but 3 hours used to be the lowest allowed value.

3hrs used to be the default, but 1hr was (and still is) the minimum allowed.

I agree the 1hr option should be removed. And with so many multi-core processors out there, the minimum should probably be 3hrs. 2hrs is also a current option.


Due to the tasks running for 6 hours then having computation errors its better to have the 1 hour task. Then once it reaches 2-3 hours its know to be bad and can be manually aborted instead of wasting a full 6 hours.
31) Message boards : Number crunching : Rosetta 4.0+ (Message 88083)
Posted 17 Jan 2018 by mmonnin
Post:
Rosetta 4.06 needs to die. So many just run and run until they abort themselves.
32) Message boards : Number crunching : no new tasks (Message 87985)
Posted 2 Jan 2018 by mmonnin
Post:
I had not received any for quite a few hours today but just got some a minute ago.
33) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 87871)
Posted 9 Dec 2017 by mmonnin
Post:
Now I'm just aborting tasks that aren't staying on track with the others. Lots of wasted hours now.
34) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 87865)
Posted 9 Dec 2017 by mmonnin
Post:
several Rosetta 4.06 units failing with this error:

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.06_x86_64-pc-linux-gnu @9res_cis_hydrophobic_nmethyl_c.103.1_1_0001.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2224394
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 43533.6s, 14400s + 28800s[2017-12- 6 12:16:35:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 43533.6 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
12:16:35 (64706): called boinc_finish(0)
pure virtual method called
terminate called without an active exception

</stderr_txt>
]]>


I have had several of these as well. They take much longer and eventually get a comp error.
https://boinc.bakerlab.org/workunit.php?wuid=864077368
35) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 87816)
Posted 4 Dec 2017 by mmonnin
Post:
40 valid tasks and 37 in progress. Seems fine.
36) Message boards : Number crunching : Minirosetta 3.73-3.78 (Message 87808)
Posted 4 Dec 2017 by mmonnin
Post:
You can RMA segfault Zen chips.

http://www.extremetech.com/computing/254750-amd-replaces-ryzen-cpus-users-affected-rare-linux-bug
37) Message boards : Number crunching : Only get 2 workunits actuallly (Message 87794)
Posted 2 Dec 2017 by mmonnin
Post:
Hi there,

since yesterday i only get 2 workunits instead of the usual 7. I did not change the settings at my PC, i only changed the settings on the homepage from 8-hour-units to 4-hour-units, since i dont run the pc that often actually and dont want to have them sent in to late. Anyway, is there a shortage or something like that ?

Edit: found that thread actually: http://boinc.bakerlab.org/rosetta/forum_thread.php?id=12333

Seems its a shortage, so nvm and Merry Christmas !


For reference, every BOINC project has a server status page that will show how many tasks are available to send. At Rosetta it is at the top under the Computing menu.
38) Message boards : Number crunching : rosetta, minirosetta_android, and rosetta_android version 4.0+ (Message 87785)
Posted 2 Dec 2017 by mmonnin
Post:
And again only 10k tasks for Android and nothing else.
39) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 87771)
Posted 29 Nov 2017 by mmonnin
Post:
I see about 10k Mini tasks now. Another 10k for Android.
40) Message boards : Cafe Rosetta : Kings Distributed Systems - Alpha Registration (Message 87762)
Posted 28 Nov 2017 by mmonnin
Post:
aL9N3Y4rh304bJwDhW0vMieneX2KrrM2


Previous 20 · Next 20



©2022 University of Washington
https://www.bakerlab.org