Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 34 · 35 · 36 · 37 · 38 · 39 · 40 . . . 274 · Next

AuthorMessage
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,324,299
RAC: 16,484
Message 93042 - Posted: 2 Apr 2020, 9:24:03 UTC - in response to Message 93039.  

I'm saying it doesn't look productive because the decoys are taking approximately 4 to 6 times longer to process.
Whereas the half dozen or so Tasks i've processed so far with the new application actually got more work done in 8 hours than the previous applications did in the same time.
Early days yet, with not much work to actually see how things are going.
Grant
Darwin NT
ID: 93042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,324,299
RAC: 16,484
Message 93044 - Posted: 2 Apr 2020, 9:46:17 UTC
Last modified: 2 Apr 2020, 9:56:40 UTC

Definitely an issue with Rosetta Mini target run times since the rollout of the new applications.

Finally managed to pick up some new work- a whole 2 Tasks.
1 Rosetta Mini and 1 Rosetta 4.12. Rosetta 4.12 is on target for the target CPU time (8hrs). Rosetta Mini is on target for double the target CPU time (16hrs).
Has only started since the new applications were released (as far a i can tell; did they also do a fix for the Rosetta Mini Tasks that were paying next to nothing at the same time?)


Edit-
Finally finished a few of these longer running Rosetta Minis and i've decided this isn't really a problem at all. While the Tasks take twice as long to process, they pay out 4 times more Credit than they usually do.
I can live with that.
Grant
Darwin NT
ID: 93044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,174,417
RAC: 10,123
Message 93045 - Posted: 2 Apr 2020, 10:10:04 UTC - in response to Message 93039.  

The tasks in progress is incorrect, I reset the project twice this week due to multiple downloads failing so they arent really there as discussed in a different thread.

I remember now - apologies. That's around 2400 'ghost' tasks showing that aren't there for you to run.
They'll be part of the 900k+ looking to be in progress tasks that aren't running down in spite of nolittle work for days.

I'm saying it doesn't look productive because the decoys are taking approximately 4 to 6 times longer to process. If you watch the graphics, it gets to a certain number of steps and then almost stops, taking 30-60 minutes for each additional step.

Half last night before I went to bed stopped at step 24600, then took 30 mins to do step 24601 etc.

Something else I missed. I don't look at the graphics to see how things are running. I'm not sure it reflects anything too much about the task - just a show. Models, yes, but not steps. Maybe I'm wrong.

Ignore me.
ID: 93045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
strongboes

Send message
Joined: 3 Mar 20
Posts: 27
Credit: 5,394,270
RAC: 0
Message 93046 - Posted: 2 Apr 2020, 10:22:20 UTC - in response to Message 93045.  
Last modified: 2 Apr 2020, 10:28:39 UTC

the first of the new tasks has just finished, took 4 hours to run the 1 decoy for me, these were definitely running under an hour previously. If you have your runtime to 4 hours you wont really notice the difference in time, but i'm more concerned with the actual work being done by the program. If points are an accurate indication then with 4.07 I was running at an average of 300pts per hour per core, this just finished task has returned 300 points in 4 hours, which ties in with my thinking they are not running efficiently.

Is there a mod reading who can make a comment?

edit, there are 60 of these now finishing so plenty to look athttps://boinc.bakerlab.org/rosetta/result.php?resultid=1138591491
ID: 93046 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,174,417
RAC: 10,123
Message 93047 - Posted: 2 Apr 2020, 10:27:53 UTC - in response to Message 93041.  

I can give a little further info also, my cpu is currently 99% utilised. boinc is running 60 cores, 2 are running gpus for folding, 2 spare for overhead. normally when boinc is running with all cores running the clock speed is approx 3.2ghz, and it will pull as many watts as i let it (doubling the power with an overclock only get me to 3.55ghz) , at the moment it's pulling 15% less power, and the clock speed is up at 4.2ghz for all cores. If each core was being run hard it would be impossible for it to run this speed. This is the speed it normally runs with say 3 or 4 cores loaded.

Imo 4.12 is not making use of the cpu properly, it's taking 4-6 times longer to complete a decoy which ties in with the fact my cpu is running a very high clock speed which indicates the cores are doing very little work.

That's very interesting. I also overclock on a FX8370 Piledriver, but prevent any throttling so my base 4.3GHz is running at 4.768GHz which is surprisingly stable.
But I know 8 cores is no comparison to 60-64 cores. I pay attention to people with Ryzens because that's where I'm likely to go next once I've run this one into the ground
ID: 93047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
strongboes

Send message
Joined: 3 Mar 20
Posts: 27
Credit: 5,394,270
RAC: 0
Message 93048 - Posted: 2 Apr 2020, 10:31:12 UTC - in response to Message 93047.  

This is the 3990x so 64 cores, 128 threads, I've turned off smt so only running the 64 so as to give more l3 cache per core which allows the tasks to progress very rapidly. It's a fantastic chip.
ID: 93048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,174,417
RAC: 10,123
Message 93050 - Posted: 2 Apr 2020, 10:41:48 UTC - in response to Message 93048.  

This is the 3990x so 64 cores, 128 threads, I've turned off smt so only running the 64 so as to give more l3 cache per core which allows the tasks to progress very rapidly. It's a fantastic chip.

Noted, ta. But also aware you've gone for top spec.
I'm more likely to look at 3700/3800 for the price/performance tbh. Cost is an issue. But I'll reassess at the point I'm ready to buy. Not even sure I could afford your RAM, let alone your MBCPU!
ID: 93050 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,174,417
RAC: 10,123
Message 93051 - Posted: 2 Apr 2020, 10:48:41 UTC - in response to Message 93046.  

the first of the new tasks has just finished, took 4 hours to run the 1 decoy for me, these were definitely running under an hour previously. If you have your runtime to 4 hours you wont really notice the difference in time, but i'm more concerned with the actual work being done by the program. If points are an accurate indication then with 4.07 I was running at an average of 300pts per hour per core, this just finished task has returned 300 points in 4 hours, which ties in with my thinking they are not running efficiently.

Is there a mod reading who can make a comment?

edit, there are 60 of these now finishing so plenty to look at https://boinc.bakerlab.org/rosetta/result.php?resultid=1138591491

Are you sure? Looks more like 75/core/hr in the past to me. Sometimes 50
Also, new versions take a little while to get their scoring sorted out iirc. Looks like it started at 150/4hrs and risen to nearer 300 now. But this isn't my strong suit.

Anyway, I only chimed in because I'd be happy with 8 or 16 WUs atm. 11 now here on my 8-core but still nothing for my 2 4-core machines. 60 would be a dream
ID: 93051 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,324,299
RAC: 16,484
Message 93052 - Posted: 2 Apr 2020, 10:49:10 UTC - in response to Message 93044.  

Edit-
Finally finished a few of these longer running Rosetta Minis and i've decided this isn't really a problem at all. While the Tasks take twice as long to process, they pay out 4 times more Credit than they usually do.
I can live with that.
Well, it was nice while it lasted.
Gone from 4 times as much down to 2 times as much- so back on par with Tasks that run for normal Target times.
Grant
Darwin NT
ID: 93052 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile nastasache

Send message
Joined: 24 Feb 07
Posts: 16
Credit: 171,383
RAC: 0
Message 93053 - Posted: 2 Apr 2020, 11:00:46 UTC - in response to Message 92687.  

Thanks a lot, Robert

I changed all to use 99% of RAM (was 90% as default and 50% for other). And 1% of swap.
It looks no out of memory errors for now but memory usage stay as before.

For 12 tasks, the total memory usage is about 6GB. It looks R@H using less memory per task than max available for 32bit app.

Here is a task with max mem usage:

Application Rosetta 4.12 
Name 4dy3ga3h_jhr_design1_COVID-19_SAVE_ALL_OUT_903392_1
State Running
Received 2020-04-01 21:33:01
Report deadline 2020-04-09 21:33:00
Estimated computation size 80,000 GFLOPs
CPU time 08:11:40
CPU time since checkpoint 00:04:37
Elapsed time 15:34:17
Estimated time remaining 2d 05:56:33
Fraction done 22.400%
Virtual memory size 1.12 GB
Working set size 1.14 GB
Directory slots/2
Process ID 14460
Progress rate 2.520% per hour
Executable rosetta_4.12_windows_intelx86.exe


Btw a task take about 2-3 days to finish, from an initial 4 hours estimation; it's that normal?

Iulian
ID: 93053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
strongboes

Send message
Joined: 3 Mar 20
Posts: 27
Credit: 5,394,270
RAC: 0
Message 93054 - Posted: 2 Apr 2020, 11:10:49 UTC - in response to Message 93051.  

see below, there are no 4.07 tasks left showing, there was 9000 yesterday only 400 today, the mini was taking around an hour but gives an idea. the 4.07 were averaging a 40 min runtime, with a rate of 1 credit for 11.5 secs of runtime on average. 3600/11.5 = 313

The last 4.12 is running at 1 credit for 59.95 seconds of runtime. 4.7* slower

https://boinc.bakerlab.org/rosetta/results.php?hostid=3800945&offset=340&show_names=0&state=4&appid=
ID: 93054 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JoshuaScholar

Send message
Joined: 26 Mar 20
Posts: 18
Credit: 232,183
RAC: 0
Message 93058 - Posted: 2 Apr 2020, 12:03:18 UTC
Last modified: 2 Apr 2020, 12:08:04 UTC

I know this affects so few people that it won't matter much but:
I have an older 2 socket Xeon system (Sandy Bridge era e5-2690s).

Let me tell you what DOESN'T work properly with the Windows client on my Windows 10 pro setup:
1) NUMA.
Having two sockets, the most common way to run Windows is with each processor accessing the memory that's attached to it directly preferentially. This is called NUMA, and it's slightly faster.
But with NUMA enabled, the client picks the proper number of threads as if it's going to use both sockets, but then it runs all of the threads on only ONE of the sockets.

2) Hyperthreading with NUMA off. [NUMA off is called "uniform memory access", by the way.] With NUMA off and Hyperthreading enabled, the client creates the right number of threads for using both sockets BUT it allocates both threads to the SAME hyperthread in each core. So each core has one empty hyperthread and one hyperthread shared by two threads.

So on this old 2 socket Xeon system running Windows 10 pro, the only efficient way to run the BOINC client is to turn off NUMA and also turn off hyperthreading.

Then it works properly.

On a machine this old, on a highly parallel workload, turning off hyperthreading is about a 20% throughput hit. On a newer processor it would be a greater hit.

I'm not sure if there's any real hit to turning off NUMA, but it isn't a big one.

Josh Scholar
ID: 93058 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile nastasache

Send message
Joined: 24 Feb 07
Posts: 16
Credit: 171,383
RAC: 0
Message 93059 - Posted: 2 Apr 2020, 12:06:36 UTC

Hi especially @Grant (SSSF)

Where I am wrong?
I need 2x more time to finish the tasks and 50% GFLOPS on similar i7-8700K CPU

Compare:
- https://boinc.bakerlab.org/rosetta/host_app_versions.php?hostid=3933928
- https://boinc.bakerlab.org/rosetta/host_app_versions.php?hostid=3914491

Thanks in advance.
ID: 93059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 93061 - Posted: 2 Apr 2020, 12:21:07 UTC - in response to Message 93039.  

strongboes,

[snip]

I'm saying it doesn't look productive because the decoys are taking approximately 4 to 6 times longer to process. If you watch the graphics, it gets to a certain number of steps and then almost stops, taking 30-60 minutes for each additional step.

Half last night before I went to bed stopped at step 24600, then took 30 mins to do step 24601 etc.

So that's what I mean, it is taking 4-6 times longer to process the same work, so it appears.

The latest batch which are rb 04 01 20235 19963 ab t000 robetta cstwt... Are currently on 2 hours 49, 56% on first decoy. Looks like 5hrs to run. 4.07 was running very similar tasks under an hour.

You are assuming that each decoy does an equal amount of work, and that each step does an equal amount of work. I don't expect that to be true.

Generally, the first decoy is only for checking that your computer works correctly and is the same every time, The second decoy starts the useful work.
ID: 93061 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 93063 - Posted: 2 Apr 2020, 12:34:41 UTC

One thing to watch for when using CPUs with especially high numbers of cores - the bandwidth from the CPU to the memory may not be adequate to run all of the cores very well. This could leave each core in use waiting for access to memory most of the time,

If so, it can be useful to reduce the number of cores BOINC is allowed to use and see if that speeds up the work enough to more than compensate for fewer cores in use.
ID: 93063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JoshuaScholar

Send message
Joined: 26 Mar 20
Posts: 18
Credit: 232,183
RAC: 0
Message 93066 - Posted: 2 Apr 2020, 12:42:22 UTC - in response to Message 93063.  

That might be because of the bugs I noticed.
Make sure that every thread is really allocated in its own hyperhthread, because BOINC doesn't leave it up to the OS.
ID: 93066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
strongboes

Send message
Joined: 3 Mar 20
Posts: 27
Credit: 5,394,270
RAC: 0
Message 93071 - Posted: 2 Apr 2020, 12:48:24 UTC - in response to Message 93063.  

One thing to watch for when using CPUs with especially high numbers of cores - the bandwidth from the CPU to the memory may not be adequate to run all of the cores very well. This could leave each core in use waiting for access to memory most of the time,

If so, it can be useful to reduce the number of cores BOINC is allowed to use and see if that speeds up the work enough to more than compensate for fewer cores in use.


If you read previous posts you will see that i'm not hyper threading and have large l3 cache and ram, I tried running just 10 cores also. It isn't that, they run roughly 4 times slower than 4.07 if they start with rb, It will be obvious soon enough.
ID: 93071 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JoshuaScholar

Send message
Joined: 26 Mar 20
Posts: 18
Credit: 232,183
RAC: 0
Message 93072 - Posted: 2 Apr 2020, 12:51:00 UTC - in response to Message 93071.  
Last modified: 2 Apr 2020, 13:10:19 UTC

Oh you're right.
I just looked at my task list.
Time per WU has jumped from 8 hours to 16 hours!
The cores are running cooler than the last version too, suggests a bottleneck.
Note 2, I just noticed that the most recent few are fast again.
Maybe there was just a run of WU for a harder problem.
ID: 93072 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 93074 - Posted: 2 Apr 2020, 13:25:00 UTC

A typical cause here for harder problems is larger proteins.
ID: 93074 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,174,417
RAC: 10,123
Message 93077 - Posted: 2 Apr 2020, 14:25:26 UTC - in response to Message 93054.  

see below, there are no 4.07 tasks left showing, there was 9000 yesterday only 400 today, the mini was taking around an hour but gives an idea. the 4.07 were averaging a 40 min runtime, with a rate of 1 credit for 11.5 secs of runtime on average. 3600/11.5 = 313

The last 4.12 is running at 1 credit for 59.95 seconds of runtime. 4.7* slower

https://boinc.bakerlab.org/rosetta/results.php?hostid=3800945&offset=340&show_names=0&state=4&appid=

I didn't look back that far earlier. What I notice now is that starting today, 2-Apr, the scoring for mini-Rosetta has plunged to 75/hr, down from 300/hr and 4.12 are 300/4hr - 75/hr too

It looks like something has happened to <all> scoring from today - a step change down - but consistent between the two on validation. Very odd.
ID: 93077 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 34 · 35 · 36 · 37 · 38 · 39 · 40 . . . 274 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org