300+ TeraFLOPS sustained!

Message boards : Number crunching : 300+ TeraFLOPS sustained!

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,645,785
RAC: 84
Message 79760 - Posted: 17 Mar 2016, 13:47:32 UTC
Last modified: 17 Mar 2016, 13:48:26 UTC

Looks like a big boost in CE participation has pushed Rosetta@Home well over the 300 TeraFLOP mark.

Wondering if this has anyone at Baker lab thinking up any new experiments to run that may be more viable now than in the past or this little boost is still
orders of magnitude away from being a game changer just yet?

I know things aren't that simplistic, and real progress likely comes from evolution of the algorithms behind the models, but I'm sure there are thresholds where new things become possible.. Maybe its not at 320TeraFLOP/S though, maybe its at 300 ExaFLOP/S

Still interesting to ponder over. Progress for the win!
ID: 79760 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1914
Credit: 8,839,333
RAC: 9,528
Message 79764 - Posted: 20 Mar 2016, 21:30:54 UTC - in response to Message 79760.  
Last modified: 20 Mar 2016, 21:31:42 UTC

Looks like a big boost in CE participation has pushed Rosetta@Home well over the 300 TeraFLOP mark.
Wondering if this has anyone at Baker lab thinking up any new experiments to run that may be more viable now than in the past or this little boost is still
orders of magnitude away from being a game changer just yet?


After all discussions about gpu/cpu optimization/etc, i think they are not so interested in additional computational power.
ID: 79764 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ssoxcub@yahoo.com
Avatar

Send message
Joined: 8 Jan 12
Posts: 17
Credit: 503,947
RAC: 0
Message 79766 - Posted: 21 Mar 2016, 6:59:44 UTC

I think they should constantly improve the code as folding@home does. From personal experience a nvidia 760 gets about 80,000, while after they improved the amd code, a r9 390 pulls down 300,000 points a day. Not sure if they could ever use a amd processor because of its math deficits. But another thought is, you can get a older cpu that would hold its own, say 8 years old, which is an extremely long time, but a 8 year old gpu would be outclassed x100 or even a x1000.
ID: 79766 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 79832 - Posted: 2 Apr 2016, 12:30:30 UTC

would be fun an eye popper if rosetta@home reaches the petaflops benchmark, lets keep it up :)

ID: 79832 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 79833 - Posted: 2 Apr 2016, 14:55:58 UTC - in response to Message 79766.  

But another thought is, you can get a older cpu that would hold its own, say 8 years old, which is an extremely long time, but a 8 year old gpu would be outclassed x100 or even a x1000.

Nvidia keeps improving CUDA, and supposedly making it easier to use. Maybe by the time Volta comes out, it would be worthwhile for Baker Labs to hire some smart grad student to look into it.
ID: 79833 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1914
Credit: 8,839,333
RAC: 9,528
Message 79834 - Posted: 2 Apr 2016, 17:17:47 UTC - in response to Message 79833.  

Nvidia keeps improving CUDA, and supposedly making it easier to use. Maybe by the time Volta comes out, it would be worthwhile for Baker Labs to hire some smart grad student to look into it.


On the other side of the moon, Kronos Group, AMD, Altera, Intel and others keep improving OpenCl and supposedly making it easier to use. May be the time Vega comes out, it would be worthwhile for Baker Labs to hire some smart grad student to look into it.

:-)
ID: 79834 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 79836 - Posted: 2 Apr 2016, 20:10:01 UTC - in response to Message 79834.  
Last modified: 2 Apr 2016, 20:12:25 UTC

On the other side of the moon, Kronos Group, AMD, Altera, Intel and others keep improving OpenCl and supposedly making it easier to use. May be the time Vega comes out, it would be worthwhile for Baker Labs to hire some smart grad student to look into it.

I am happy to go either way, assuming AMD is still in business.
ID: 79836 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Emigdio Lopez Laburu

Send message
Joined: 25 Feb 06
Posts: 61
Credit: 40,240,061
RAC: 0
Message 79839 - Posted: 4 Apr 2016, 13:38:42 UTC - in response to Message 79764.  

Looks like a big boost in CE participation has pushed Rosetta@Home well over the 300 TeraFLOP mark.
Wondering if this has anyone at Baker lab thinking up any new experiments to run that may be more viable now than in the past or this little boost is still
orders of magnitude away from being a game changer just yet?


After all discussions about gpu/cpu optimization/etc, i think they are not so interested in additional computational power.


Why you say that???
ID: 79839 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1914
Credit: 8,839,333
RAC: 9,528
Message 79841 - Posted: 4 Apr 2016, 18:04:00 UTC - in response to Message 79839.  
Last modified: 4 Apr 2016, 18:04:54 UTC

After all discussions about gpu/cpu optimization/etc, i think they are not so interested in additional computational power.


Why you say that???


Despite some very interesting preliminary tests, seems that they abandon the optimization scope.
Please, read the discussions here and on Ralph's forum (here, for example)
- Only one admin partecipate (Dekim)
- This admin does not work very hard on optimizations (he has other things to do)
- He says that optimizations are not so important, "precision" of simulation is more important than speed.
- The optimization are commit to one volunteer (Rsj5), who works on code when he has free time.

So, i'm not so optimist
ID: 79841 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 22,331,254
RAC: 10,742
Message 79842 - Posted: 4 Apr 2016, 20:51:17 UTC - in response to Message 79841.  

After all discussions about gpu/cpu optimization/etc, i think they are not so interested in additional computational power.


Why you say that???


Despite some very interesting preliminary tests, seems that they abandon the optimization scope.
Please, read the discussions here and on Ralph's forum (here, for example)
- Only one admin partecipate (Dekim)
- This admin does not work very hard on optimizations (he has other things to do)
- He says that optimizations are not so important, "precision" of simulation is more important than speed.
- The optimization are commit to one volunteer (Rsj5), who works on code when he has free time.

So, i'm not so optimist


Be more optimistic ... and as patient as you can. 8-) Not all is bad.

I have thought about updating status several times, but I thought that it might be more appropriate for those on the project (dekim) to disclose plans/status. He can delete this message if I am off base ... since I did not ask.

There is another lab student working on incorporating my findings into their production environment. They are busy but I have been feeding them measurements and configuration files.

To summarize, I built 50+ binaries with selected option combinations and expected (as I had said before) about 20% improvement. I generally measured a 20% to 40% improvement and dekim said they had confirmed those numbers internally. I also said that it would require the compiler to auto-vectorize the code to go faster than 20%. The original source code, I think, was written in Fortran, and translated to C++. Ugh!

Dekim indicated that they have built and deployed a test binary based on my suggestions on Ralph. I don't know which one he is talking about but v3.73 was released about the right time.

He also indicated they have introduced an optimized binary into their local production clusters ... whatever that is. They are seeing more than 2x-4x improvement on one of their design protocols executing on that cluster. I will be interested in learning why the dramatic impact.

They are being careful, because this involves changing compilers and options. They are also in the middle of a big change ... notice the size of the database increased from 180mb to 270mb ... 8-)


I have set my Rosetta preferences to run 24 hour jobs so I can see when (if) a better binary is introduced. The easiest way to detect a changed binary is to run with the longer CPU target times and observe those that finish before the target 86,400 second CPU time stick out.







ID: 79842 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 79843 - Posted: 4 Apr 2016, 21:19:05 UTC - in response to Message 79842.  

I have set my Rosetta preferences to run 24 hour jobs so I can see when (if) a better binary is introduced. The easiest way to detect a changed binary is to run with the longer CPU target times and observe those that finish before the target 86,400 second CPU time stick out.

I always run 24 hours on six cores of my i7-4790 (Win7 64-bit), and have seen several short work units since 2 April, when I started working on 3.73.
24 hour tasks

You have done something very right it seems.


ID: 79843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 79844 - Posted: 4 Apr 2016, 21:51:18 UTC

@rjs5,
Thanks for the update. Wanted to point out that 24hr work units, running more efficiently will simply produce more models in as close to 24hrs as they can. So, you won't notice them completing 20-40% sooner. Each time a new model is begun, a check is made to estimate whether it will complete before the runtime preference set in the user's settings. I believe the estimate is just based on time taken to complete prior models on the same task. So if model 23 completes after 23.5hrs of CPU, then the task is ended and returned. If model 23 completes after 22.5hrs, then a 24th model begins.
Rosetta Moderator: Mod.Sense
ID: 79844 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 79845 - Posted: 4 Apr 2016, 23:11:59 UTC - in response to Message 79844.  
Last modified: 4 Apr 2016, 23:14:34 UTC

Thanks for the update. Wanted to point out that 24hr work units, running more efficiently will simply produce more models in as close to 24hrs as they can.

I think I see what you are saying. You put as many apples of various sizes in the box without overflowing. However, I have seen several tasks that run under 10,000 seconds on the above (and three other) machines in only two days. I think that is very rare, and after checking it is only on the 3.73 tasks.

Also, if they are that short, you would think there would be plenty of room to fit another model in. So it seems that something is making the run times shorter than before, and preventing another model from being run. Maybe there is a limit on the total number of models?
ID: 79845 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,870,184
RAC: 777
Message 79847 - Posted: 5 Apr 2016, 10:27:36 UTC - in response to Message 79845.  

Thanks for the update. Wanted to point out that 24hr work units, running more efficiently will simply produce more models in as close to 24hrs as they can.

I think I see what you are saying. You put as many apples of various sizes in the box without overflowing. However, I have seen several tasks that run under 10,000 seconds on the above (and three other) machines in only two days. I think that is very rare, and after checking it is only on the 3.73 tasks.

Also, if they are that short, you would think there would be plenty of room to fit another model in. So it seems that something is making the run times shorter than before, and preventing another model from being run. Maybe there is a limit on the total number of models?

It depends on the type of tasks. I'll just copy and paste what I wrote earlier and perhaps Mod.sense or DEK can correct and/or add detail as necessary:
If memory serves, the 99 model limit was enacted when some tasks created output files too large to be uploaded. The limit only applies to a particular type of task. Others use the preferred cpu time plus 4 method to determine when to end things. When a model is completed the task calculates whether it has time left to complete another model. If the answer is no then the task wraps things up despite there appearing (to the cruncher) hours left. if the answer is yes the tasks will begin another model. All models aren't equal however, even within the same task so some will take longer than predicted. To insure that otherwise good models aren't cut short just before completing (and to increase the odds that the task will complete at least one model) the task will continue past the preferred cpu time. At some point though, you gotta cut your losses and so at preferred cpu time plus 4 hours the watchdog cuts bait and the task goes home. ( I'm curious about the average overtime; my totally uninformed guess is that it's less than an hour.)

There are other types of tasks in which filters are employed to cut off models early. If the model passes the filter it will continue working on that one task to the end. This results in dramatically disparate counts, with one task generating hundreds of models while another task from the same batch only generating one, two, five, etc. Recently on ralph a filter was used to remove models resulting in a file transfer error upon upload. The stderr out listed 13 models from 2 attempts but since the models had been erased the file meant to contain them didn't exist. I'm guessing, based on DEK's post, which I may well have misinterpreted, that the server, possibly as part of a validation check, automatically gives the file transfer error (client error, compute error) when this particular file isn't part of the upload.

All these different strategies result, from the cruncher's point of view, in varied behavior which we struggle to interpret. Is it a problem with my computer or a problem with rosetta? Is it a problem at all? BOINC is complicated enough for the computer savvy, much more so for majority of crunchers who just want to maximize their participation in rosetta and end up massively tangled up in the BOINC settings. The variety of legitimate behaviors exhibited by rosetta tasks trips up the volunteers trying to help them become untangled. From the researcher' point of view everything may look fine, working as expected, and any issues a lone cruncher is having is most likely due to their particular set up. And it probably is, but the lack of information leaves the volunteers flailing.

I have long wished for a reference, a database of tasks, in which the tasks are divided into broad categories of strategies employed (as above, which some info on how they "look " to the crunchers) and what, in a most basic way, is being asked (how does this particular protein fold, how do these two proteins interact, can we create a new protein to do x, etc.)


Best,
Snags
ID: 79847 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 22,331,254
RAC: 10,742
Message 79848 - Posted: 5 Apr 2016, 13:09:18 UTC - in response to Message 79845.  

Thanks for the update. Wanted to point out that 24hr work units, running more efficiently will simply produce more models in as close to 24hrs as they can.

I think I see what you are saying. You put as many apples of various sizes in the box without overflowing. However, I have seen several tasks that run under 10,000 seconds on the above (and three other) machines in only two days. I think that is very rare, and after checking it is only on the 3.73 tasks.

Also, if they are that short, you would think there would be plenty of room to fit another model in. So it seems that something is making the run times shorter than before, and preventing another model from being run. Maybe there is a limit on the total number of models?


The 3.73 jobs hit my machines about 10am 3/31.

My SkyLake 6700k with Win10 machine is only taking 86,400 seconds on tasks that run multiple structures. It looks like there are several ways of running jobs and possibly source of the confusion.

ALL "24 hour" jobs that finished early.

CPU time (sec) -- Task ID
9,231 -- 806473699
9,474 -- 806473717
10,616 -- 802461224
10,736 -- 806473700
11,629 -- 802461073
19,048 -- 802461280
19,727 -- 802461293
25,353 -- 802461165
28,458 -- 802461288
31,028 -- 806739396
31,109 -- 806739333
31,152 -- 806739395
32,629 -- 806739285
32,775 -- 806739281
32,788 -- 806739332
32,897 -- 806739284
74,202 -- 802461278
86,645 -- 802461299 <<< multiple structures tj_3_15_dimer_X_ZC16v1_DHR54_l3_h22_l3_v11_0_v1b_fragments_abinitio_SAVE_ALL_OUT_339362_541_0
86,825 -- 802461222 <<< multiple structures tj_3_15_dimer_X_ZC16v1_DHR54_l3_h22_l3_v11_0_v1




My Haswell Extreme Win10 machine did not get any of the "tj" jobs and no job took 24 hours.

CPU time (sec) -- Task ID

26,147 -- 806166140
26,527 -- 806783950
27,408 -- 806166136
28,310 -- 806166173
28,716 -- 806166175
28,779 -- 806166153
28,946 -- 806166142
29,498 -- 806166158
29,656 -- 806166144
29,826 -- 806166155
30,031 -- 806166182
30,319 -- 806166135
31,056 -- 806166156
31,949 -- 806166137
32,495 -- 806166183
33,339 -- 806166141
33,441 -- 806166181
33,823 -- 806166154
34,171 -- 806166174
35,218 -- 806166143
37,461 -- 806166157
39,680 -- 806784007
41,060 -- 806784004
41,733 -- 806784024
42,119 -- 806784008
42,459 -- 806783991
42,714 -- 806784020
43,282 -- 806784012
45,064 -- 806783066
46,650 -- 806783178
46,709 -- 806783780
47,901 -- 806783261
48,229 -- 806783240
48,300 -- 806783221
48,708 -- 806783141
49,017 -- 806783220
49,049 -- 806783231
51,220 -- 806783233
51,305 -- 806783258
54,612 -- 806784017


ID: 79848 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 79849 - Posted: 5 Apr 2016, 14:55:31 UTC - in response to Message 79848.  

My SkyLake 6700k with Win10 machine is only taking 86,400 seconds on tasks that run multiple structures. It looks like there are several ways of running jobs and possibly source of the confusion.

Can you reach a conclusion yet? Is it clear that there are gains, or is that yet to be sorted out?

ID: 79849 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 22,331,254
RAC: 10,742
Message 79850 - Posted: 5 Apr 2016, 16:12:13 UTC - in response to Message 79849.  

My SkyLake 6700k with Win10 machine is only taking 86,400 seconds on tasks that run multiple structures. It looks like there are several ways of running jobs and possibly source of the confusion.

Can you reach a conclusion yet? Is it clear that there are gains, or is that yet to be sorted out?


Looks good. There are gains ... its the "how much" that is harder to determine.

Performance is always a "work in progress". That is why you have to be careful in "optimizing" something. Everyone who follows assumes the the optimizations still work. Rosetta is a moving target and the run time statistics are very difficult to extract on this side of the server. The data ages out too quickly and the task name/information is buried another level deep.

In very round numbers, I think there is generally 20%-50% in compiler and option "low hanging fruit". Timing of the deployed binary is out of my control and vision.
Properly written code will see a 2x-4x improvement.







ID: 79850 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1914
Credit: 8,839,333
RAC: 9,528
Message 79857 - Posted: 8 Apr 2016, 14:59:06 UTC - in response to Message 79842.  


Be more optimistic ... and as patient as you can. 8-)

I'm here since 2005, so i'm patient :-)

To summarize, I built 50+ binaries with selected option combinations and expected (as I had said before) about 20% improvement. I generally measured a 20% to 40% improvement and dekim said they had confirmed those numbers internally.

Not bad!

The original source code, I think, was written in Fortran, and translated to C++. Ugh!

Yep, i think there are still some traces of Fortran

Dekim indicated that they have built and deployed a test binary based on my suggestions on Ralph. I don't know which one he is talking about but v3.73 was released about the right time.
He also indicated they have introduced an optimized binary into their local production clusters ... whatever that is. They are seeing more than 2x-4x improvement on one of their design protocols executing on that cluster. I will be interested in learning why the dramatic impact.

Only Dekim can answer....


ID: 79857 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 79858 - Posted: 8 Apr 2016, 15:30:34 UTC

While we are on the subject, I am presently on Win7 64-bit. But I could go to Linux Mint 18 when it comes out. Is there an advantage?
ID: 79858 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 22,331,254
RAC: 10,742
Message 79860 - Posted: 8 Apr 2016, 16:39:42 UTC - in response to Message 79858.  

While we are on the subject, I am presently on Win7 64-bit. But I could go to Linux Mint 18 when it comes out. Is there an advantage?


8-} hit POST instead of PREVIEW ...


If you are curious, you might install a VM and then install Mint on it. You can compare the performance of the 32-bit windows binary with the 64-bit Linux version. The last time I tried this, the VM was about 10% faster.
ID: 79860 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : 300+ TeraFLOPS sustained!



©2024 University of Washington
https://www.bakerlab.org