Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 34 · 35 · 36 · 37 · 38 · 39 · 40 . . . 309 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2141
Credit: 41,525,460
RAC: 10,413
Message 93051 - Posted: 2 Apr 2020, 10:48:41 UTC - in response to Message 93046.  

the first of the new tasks has just finished, took 4 hours to run the 1 decoy for me, these were definitely running under an hour previously. If you have your runtime to 4 hours you wont really notice the difference in time, but i'm more concerned with the actual work being done by the program. If points are an accurate indication then with 4.07 I was running at an average of 300pts per hour per core, this just finished task has returned 300 points in 4 hours, which ties in with my thinking they are not running efficiently.

Is there a mod reading who can make a comment?

edit, there are 60 of these now finishing so plenty to look at https://boinc.bakerlab.org/rosetta/result.php?resultid=1138591491

Are you sure? Looks more like 75/core/hr in the past to me. Sometimes 50
Also, new versions take a little while to get their scoring sorted out iirc. Looks like it started at 150/4hrs and risen to nearer 300 now. But this isn't my strong suit.

Anyway, I only chimed in because I'd be happy with 8 or 16 WUs atm. 11 now here on my 8-core but still nothing for my 2 4-core machines. 60 would be a dream
ID: 93051 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1725
Credit: 18,382,444
RAC: 19,446
Message 93052 - Posted: 2 Apr 2020, 10:49:10 UTC - in response to Message 93044.  

Edit-
Finally finished a few of these longer running Rosetta Minis and i've decided this isn't really a problem at all. While the Tasks take twice as long to process, they pay out 4 times more Credit than they usually do.
I can live with that.
Well, it was nice while it lasted.
Gone from 4 times as much down to 2 times as much- so back on par with Tasks that run for normal Target times.
Grant
Darwin NT
ID: 93052 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile nastasache

Send message
Joined: 24 Feb 07
Posts: 16
Credit: 171,383
RAC: 0
Message 93053 - Posted: 2 Apr 2020, 11:00:46 UTC - in response to Message 92687.  

Thanks a lot, Robert

I changed all to use 99% of RAM (was 90% as default and 50% for other). And 1% of swap.
It looks no out of memory errors for now but memory usage stay as before.

For 12 tasks, the total memory usage is about 6GB. It looks R@H using less memory per task than max available for 32bit app.

Here is a task with max mem usage:

Application Rosetta 4.12 
Name 4dy3ga3h_jhr_design1_COVID-19_SAVE_ALL_OUT_903392_1
State Running
Received 2020-04-01 21:33:01
Report deadline 2020-04-09 21:33:00
Estimated computation size 80,000 GFLOPs
CPU time 08:11:40
CPU time since checkpoint 00:04:37
Elapsed time 15:34:17
Estimated time remaining 2d 05:56:33
Fraction done 22.400%
Virtual memory size 1.12 GB
Working set size 1.14 GB
Directory slots/2
Process ID 14460
Progress rate 2.520% per hour
Executable rosetta_4.12_windows_intelx86.exe


Btw a task take about 2-3 days to finish, from an initial 4 hours estimation; it's that normal?

Iulian
ID: 93053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
strongboes

Send message
Joined: 3 Mar 20
Posts: 27
Credit: 5,394,270
RAC: 0
Message 93054 - Posted: 2 Apr 2020, 11:10:49 UTC - in response to Message 93051.  

see below, there are no 4.07 tasks left showing, there was 9000 yesterday only 400 today, the mini was taking around an hour but gives an idea. the 4.07 were averaging a 40 min runtime, with a rate of 1 credit for 11.5 secs of runtime on average. 3600/11.5 = 313

The last 4.12 is running at 1 credit for 59.95 seconds of runtime. 4.7* slower

https://boinc.bakerlab.org/rosetta/results.php?hostid=3800945&offset=340&show_names=0&state=4&appid=
ID: 93054 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JoshuaScholar

Send message
Joined: 26 Mar 20
Posts: 18
Credit: 232,183
RAC: 0
Message 93058 - Posted: 2 Apr 2020, 12:03:18 UTC
Last modified: 2 Apr 2020, 12:08:04 UTC

I know this affects so few people that it won't matter much but:
I have an older 2 socket Xeon system (Sandy Bridge era e5-2690s).

Let me tell you what DOESN'T work properly with the Windows client on my Windows 10 pro setup:
1) NUMA.
Having two sockets, the most common way to run Windows is with each processor accessing the memory that's attached to it directly preferentially. This is called NUMA, and it's slightly faster.
But with NUMA enabled, the client picks the proper number of threads as if it's going to use both sockets, but then it runs all of the threads on only ONE of the sockets.

2) Hyperthreading with NUMA off. [NUMA off is called "uniform memory access", by the way.] With NUMA off and Hyperthreading enabled, the client creates the right number of threads for using both sockets BUT it allocates both threads to the SAME hyperthread in each core. So each core has one empty hyperthread and one hyperthread shared by two threads.

So on this old 2 socket Xeon system running Windows 10 pro, the only efficient way to run the BOINC client is to turn off NUMA and also turn off hyperthreading.

Then it works properly.

On a machine this old, on a highly parallel workload, turning off hyperthreading is about a 20% throughput hit. On a newer processor it would be a greater hit.

I'm not sure if there's any real hit to turning off NUMA, but it isn't a big one.

Josh Scholar
ID: 93058 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile nastasache

Send message
Joined: 24 Feb 07
Posts: 16
Credit: 171,383
RAC: 0
Message 93059 - Posted: 2 Apr 2020, 12:06:36 UTC

Hi especially @Grant (SSSF)

Where I am wrong?
I need 2x more time to finish the tasks and 50% GFLOPS on similar i7-8700K CPU

Compare:
- https://boinc.bakerlab.org/rosetta/host_app_versions.php?hostid=3933928
- https://boinc.bakerlab.org/rosetta/host_app_versions.php?hostid=3914491

Thanks in advance.
ID: 93059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 2,014
Message 93061 - Posted: 2 Apr 2020, 12:21:07 UTC - in response to Message 93039.  

strongboes,

[snip]

I'm saying it doesn't look productive because the decoys are taking approximately 4 to 6 times longer to process. If you watch the graphics, it gets to a certain number of steps and then almost stops, taking 30-60 minutes for each additional step.

Half last night before I went to bed stopped at step 24600, then took 30 mins to do step 24601 etc.

So that's what I mean, it is taking 4-6 times longer to process the same work, so it appears.

The latest batch which are rb 04 01 20235 19963 ab t000 robetta cstwt... Are currently on 2 hours 49, 56% on first decoy. Looks like 5hrs to run. 4.07 was running very similar tasks under an hour.

You are assuming that each decoy does an equal amount of work, and that each step does an equal amount of work. I don't expect that to be true.

Generally, the first decoy is only for checking that your computer works correctly and is the same every time, The second decoy starts the useful work.
ID: 93061 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 2,014
Message 93063 - Posted: 2 Apr 2020, 12:34:41 UTC

One thing to watch for when using CPUs with especially high numbers of cores - the bandwidth from the CPU to the memory may not be adequate to run all of the cores very well. This could leave each core in use waiting for access to memory most of the time,

If so, it can be useful to reduce the number of cores BOINC is allowed to use and see if that speeds up the work enough to more than compensate for fewer cores in use.
ID: 93063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JoshuaScholar

Send message
Joined: 26 Mar 20
Posts: 18
Credit: 232,183
RAC: 0
Message 93066 - Posted: 2 Apr 2020, 12:42:22 UTC - in response to Message 93063.  

That might be because of the bugs I noticed.
Make sure that every thread is really allocated in its own hyperhthread, because BOINC doesn't leave it up to the OS.
ID: 93066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
strongboes

Send message
Joined: 3 Mar 20
Posts: 27
Credit: 5,394,270
RAC: 0
Message 93071 - Posted: 2 Apr 2020, 12:48:24 UTC - in response to Message 93063.  

One thing to watch for when using CPUs with especially high numbers of cores - the bandwidth from the CPU to the memory may not be adequate to run all of the cores very well. This could leave each core in use waiting for access to memory most of the time,

If so, it can be useful to reduce the number of cores BOINC is allowed to use and see if that speeds up the work enough to more than compensate for fewer cores in use.


If you read previous posts you will see that i'm not hyper threading and have large l3 cache and ram, I tried running just 10 cores also. It isn't that, they run roughly 4 times slower than 4.07 if they start with rb, It will be obvious soon enough.
ID: 93071 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JoshuaScholar

Send message
Joined: 26 Mar 20
Posts: 18
Credit: 232,183
RAC: 0
Message 93072 - Posted: 2 Apr 2020, 12:51:00 UTC - in response to Message 93071.  
Last modified: 2 Apr 2020, 13:10:19 UTC

Oh you're right.
I just looked at my task list.
Time per WU has jumped from 8 hours to 16 hours!
The cores are running cooler than the last version too, suggests a bottleneck.
Note 2, I just noticed that the most recent few are fast again.
Maybe there was just a run of WU for a harder problem.
ID: 93072 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 2,014
Message 93074 - Posted: 2 Apr 2020, 13:25:00 UTC

A typical cause here for harder problems is larger proteins.
ID: 93074 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2141
Credit: 41,525,460
RAC: 10,413
Message 93077 - Posted: 2 Apr 2020, 14:25:26 UTC - in response to Message 93054.  

see below, there are no 4.07 tasks left showing, there was 9000 yesterday only 400 today, the mini was taking around an hour but gives an idea. the 4.07 were averaging a 40 min runtime, with a rate of 1 credit for 11.5 secs of runtime on average. 3600/11.5 = 313

The last 4.12 is running at 1 credit for 59.95 seconds of runtime. 4.7* slower

https://boinc.bakerlab.org/rosetta/results.php?hostid=3800945&offset=340&show_names=0&state=4&appid=

I didn't look back that far earlier. What I notice now is that starting today, 2-Apr, the scoring for mini-Rosetta has plunged to 75/hr, down from 300/hr and 4.12 are 300/4hr - 75/hr too

It looks like something has happened to <all> scoring from today - a step change down - but consistent between the two on validation. Very odd.
ID: 93077 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2141
Credit: 41,525,460
RAC: 10,413
Message 93079 - Posted: 2 Apr 2020, 14:40:50 UTC - in response to Message 93077.  

see below, there are no 4.07 tasks left showing, there was 9000 yesterday only 400 today, the mini was taking around an hour but gives an idea. the 4.07 were averaging a 40 min runtime, with a rate of 1 credit for 11.5 secs of runtime on average. 3600/11.5 = 313

The last 4.12 is running at 1 credit for 59.95 seconds of runtime. 4.7* slower

https://boinc.bakerlab.org/rosetta/results.php?hostid=3800945&offset=340&show_names=0&state=4&appid=

I didn't look back that far earlier. What I notice now is that starting today, 2-Apr, the scoring for mini-Rosetta has plunged to 75/hr, down from 300/hr and 4.12 are 300/4hr - 75/hr too

It looks like something has happened to <all> scoring from today - a step change down - but consistent between the two on validation. Very odd.

Oh, you're not going to like this...
I've just checked my own PC to see how my dribble of tasks have performed on a mere FX8370
1 Apr - Mini & 4.12 tasks around 45/hr, 280-340/8hr task. Better than I usually get tbh
2 Apr - Mini only (4.12 not reported yet) 110-120/hr, 890-950/8hr task. Lol

Nothing I can say to that...
ID: 93079 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
entity

Send message
Joined: 8 May 18
Posts: 19
Credit: 6,122,942
RAC: 5,437
Message 93080 - Posted: 2 Apr 2020, 15:11:15 UTC - in response to Message 93072.  
Last modified: 2 Apr 2020, 15:13:23 UTC

Oh you're right.
I just looked at my task list.
Time per WU has jumped from 8 hours to 16 hours!
The cores are running cooler than the last version too, suggests a bottleneck.
Note 2, I just noticed that the most recent few are fast again.
Maybe there was just a run of WU for a harder problem.

This is a known problem in Rosetta that the developers have acknowledged but probably haven't fixed yet. They indicated that it would take a major rewrite of the code. L3 cache tends to become over utilized and the CPU waits for data to make the trip from main memory hence the CPU runs cooler (more waiting). There was a post by a developer in another project that suggested to limit the number of tasks run concurrently. They indicated that each task uses about 4MB of L3 cache. Concerning the run time, I noticed that the run parameters include something like cpu_seconds=57500. That is 16 hours. They are ignoring the Target CPU runtime setting
ID: 93080 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stephen "Heretic"

Send message
Joined: 2 Apr 20
Posts: 21
Credit: 11,028
RAC: 0
Message 93081 - Posted: 2 Apr 2020, 15:27:06 UTC - in response to Message 93040.  

Hello, I have just joined this project but it seems there is no work to do at the moment. Is this a common state of affairs or have I struck a bad moment to join??
Work being done has increased by 500% over the last 2 and a bit weeks, so there's not much work available as demand is far exceeding supply.
More work is meant to be coming, but apparently it takes quite a while to prepare it for release, so it will take a while before work production comes close to matching the present demand.


. . I'm guessing fellow refugees from S@H ... oh well, I'll just have to be patient ...

Stephen

:(
ID: 93081 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 93083 - Posted: 2 Apr 2020, 15:37:56 UTC

I've tried to summarize the new work unit runtimes in a new thread, please post concerns about "performance" of new v4.12, or estimated time to completion over there.
Rosetta Moderator: Mod.Sense
ID: 93083 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BetelgeuseFive

Send message
Joined: 10 Aug 10
Posts: 4
Credit: 1,443,980
RAC: 382
Message 93084 - Posted: 2 Apr 2020, 16:23:02 UTC

I'm having a problem with 4.12 on Linux (CentOS 7). Found out my computer was doing nothing while there were plenty of tasks "Ready to start".
First rebooted the system, but this did not change anything.
Enabled cpu_sched_debug in the event log and messages indicated it was trying to start v4.12 tasks, but nothing actually started.
Suspended the v4.12 tasks and other v4.08 tasks started immediately without any problems.

Any clues ?

Thanks,

Tom
ID: 93084 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 93086 - Posted: 2 Apr 2020, 16:42:05 UTC - in response to Message 93084.  
Last modified: 2 Apr 2020, 16:51:56 UTC

How much memory have you allowed BOINC to use, when active? when idle?
Rosetta Moderator: Mod.Sense
ID: 93086 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BetelgeuseFive

Send message
Joined: 10 Aug 10
Posts: 4
Credit: 1,443,980
RAC: 382
Message 93087 - Posted: 2 Apr 2020, 17:00:28 UTC - in response to Message 93086.  

How much memory have you allowed BOINC to use, when active? when idle?


System has 6 Gb configured (running inside VM).
Just checked settings, it has:

When in use, use at most 50%
When not in use, use at most 90%

Should have been plenty start at least one task.
ID: 93087 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 34 · 35 · 36 · 37 · 38 · 39 · 40 . . . 309 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org