Will there be a 64-bit client in the near future?

Author	Message
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 80837 - Posted: 11 Nov 2016, 15:46:49 UTC The folks over at folding@home / GROMACS seem to put a little more effort in it... 14:37:02:WU00:FS00:FahCore 0xa7 started 14:37:03:WU00:FS00:0xa7:********************* Log Started 2016-11-11T14:37:02Z ******************* 14:37:03:WU00:FS00:0xa7:********************** Gromacs Folding@home Core *********************** 14:37:03:WU00:FS00:0xa7: Type: 0xa7 14:37:03:WU00:FS00:0xa7: Core: Gromacs 14:37:03:WU00:FS00:0xa7: Website: http://folding.stanford.edu/ 14:37:03:WU00:FS00:0xa7: Copyright: (c) 2009-2016 Stanford University 14:37:03:WU00:FS00:0xa7: Author: Joseph Coffland <joseph@cauldrondevelopment.com> 14:37:03:WU00:FS00:0xa7: Args: -dir 00 -suffix 01 -version 704 -lifeline 3376 -checkpoint 15 -np 8 14:37:03:WU00:FS00:0xa7: Config: <none> 14:37:03:WU00:FS00:0xa7:******************************** Build *********************************** 14:37:03:WU00:FS00:0xa7: Version: 0.0.11 14:37:03:WU00:FS00:0xa7: Date: Sep 20 2016 14:37:03:WU00:FS00:0xa7: Time: 06:40:11 14:37:03:WU00:FS00:0xa7: Repository: Git 14:37:03:WU00:FS00:0xa7: Revision: 957bd90e68d95ddcf1594dc15ff6c64cc4555146 14:37:03:WU00:FS00:0xa7: Branch: master 14:37:03:WU00:FS00:0xa7: Compiler: GNU 4.8.5 14:37:03:WU00:FS00:0xa7: Options: -std=gnu++98 -O3 -funroll-loops -ffast-math -mfpmath=sse 14:37:03:WU00:FS00:0xa7: -fno-unsafe-math-optimizations -msse2 14:37:03:WU00:FS00:0xa7: Platform: linux2 4.6.0-1-amd64 14:37:03:WU00:FS00:0xa7: Bits: 64 14:37:03:WU00:FS00:0xa7: Mode: Release 14:37:03:WU00:FS00:0xa7: [color=red][b]SIMD: avx_256[/b][/color] ID: 80837 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80838 - Posted: 12 Nov 2016, 5:02:17 UTC - in response to Message 80836. Last modified: 12 Nov 2016, 5:17:35 UTC There is little if any code that would currently benefit from 64-bit integers. I originally thought that the larger number of registers available in 64-bit mode would help but the increased code and data size of 64-bit code did more damage to the caches than registers SPILL/FILLS necessary in 32-bits (caused by fewer registers). That is what I measured when I actually recompiled the code as 32-bit AND 64-bit. Rosetta spends a large chunk of its time computing "relationships" between 2 points in 3-dimensions (using floating point math). Rosetta makes an X-dimension 64-bit floating point calculation. Rosetta makes an Y-dimension 64-bit floating point calculation. Rosetta makes an Z-dimension 64-bit floating point calculation. You can change the TYPE DEFINITION of that "point" description to just add 4th "dummy" dimension that will allow the compiler to do a SIMD vector load of all 4 dimensions, perform the operation on all 4 dimensions and then a SIMD vector store. The compiler will change the 3 sequential SCALAR operation on 3-dimensions to a SINGLE PARALLEL operation on 4-dimensions. If you add the 4th dimension in the TYPEDEF, you do not need to make ANY other source code changes for the compilers to automatically generate the low level code to perform the parallel LOAD-OPERATION-STORE. VERY low hanging fruit. The Rosetta developers said they were already "familiar" with this technique when I pointed it out last year. It WOULD be their first, easy step to take if "low" performance was a problem for them. 32-bit integer versus 64-bit integer code really makes no difference unless Rosetta code undergoes major changes. agreed and as observed earlier just like in the case of the whetstone benchmark, 32bits codes actually turned out faster than 64bits codes. that could imply for instance that those running 32 bits boinc client actually get more boinc points (credits) since boinc awards points based on the whetstone benchmark, imho a very crude approach but i'd guess the point is to be 'comparable' across different boinc projects. 32 bits codes and its data use possibly considerably less memory compared to 64 bits. it makes it more likely that the code & data running in 32 bits fits completely in the cpu cache. this can make a world of difference if say the whetstone benchmark & all its data runs completely within the cpu cache never hitting dram. my thoughts are that 64 bits codes used so much more memory that cpu cache is trashed making it necessary to move data to / from memory for all that computations which hit the whetstone benchmark. as it seemed the whetstone benchmark is also impossible or infeasible to use parallel features such as SSE/AVX/AVX2 as most of its operations are memory based and in addition the codes / formulars seemed deliberately organised to thwart parallelism. Apparently for X86 or X86_64 CPUs, all the magic of floating point performance is primarily in the FPU and for that matter 32 bits codes or 64 bits codes do not matter as the FPU itself determines all the floating point calculation performance. Intel CPUs especially the recent cpus apparently has a very fast high performance FPU that accounts for the fast double precision floating point calculations performance. It is comparable to / outperforms even the lower & mid range consumer GPUs and delivers better double precision performance per watt compared to those GPUs. For the whetstone benchmark, the (intel) CPU seem to be able to handle double precision floating point calc so rapidly taking hardly any clock cycles that the cpu spent most of the time moving data between memory and cpu registers / cache. i'd guess the memory movements could have taken much more clock cycles (say 10 clock cycles?) compared to the double precision floating point computation itself say 1 clock cycle. that seem to be reflected in how the whetstone benchmark gets more gflops in 32 bits vs 64 bits ID: 80838 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80839 - Posted: 12 Nov 2016, 17:16:47 UTC Last modified: 12 Nov 2016, 17:39:28 UTC here is a very interesting article / slides on AVX/AVX2, and from CERN the HPC (high performance computing) people who deal with physics Haswell Conundrum:AVX or not AVX? https://indico.cern.ch/event/327306/contributions/760669/attachments/635800/875267/HaswellConundrum.pdf in 2014 Conclusions – Free lunch is over » In 2 years the computational power of Intel workstations has increased by 30% max (including core count and freq-boost) » For servers even less – Power management affects individual components: » Achieving maximal throughput requires to make choices among features to activate – Memory wall is higher than ever » HSW improves on instruction caching though.. – Wide SIMD vectors are effective only for highly specialized code – Little support for this new brave world in generic high level languages and libraries Summary – Haswell is a great new Architecture: » Not because of AVX – Long SIMD vectors are worth only for intensive vectorized code » Are not GPUs then a better option? – Power Management cannot be ignored while assessing computational efficiency – On modern architecture, extrapolation based on synthetic benchmarks is mission impossible they are in Boinc too & u can run their simulations: http://atlasathome.cern.ch/ that special scenario is apparently things like Linpack benchmark that depends heavily on subroutine DGEMM (double precision general matrix multiplication), e.g. multiply very big/large square matrices say 10,000 x 10,000 https://www.pugetsystems.com/labs/hpc/Haswell-Floating-Point-Performance-493/ once the math scenario falls outside this DGEMM multiply very big square matrices use case, all that vector / parallel cpu and even those extreme speed GPU (petaflops) hardware is simply useless, e.g. if you are trying to solve 2x2 matrices a billion times and the result of the next iteration depend on the previous, it would be just as slow as if you simply do it in loops no SSE,AVX,AVX2 lol in short SSE/AVX to all those super high end vectorized extreme performance GPU is only good if the whole world is simply DGEMM. too bad DGEMM is just very few of true real world scenarios lol ID: 80839 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2100 Credit: 12,313,716 RAC: 3,244	Message 81074 - Posted: 22 Jan 2017, 20:02:14 UTC Last modified: 22 Jan 2017, 20:02:43 UTC 10 years ago a volunteer asked for 64 bit and he did one experiment with rosetta. Now, in 2017, i think it's time to start with tests on 64 bit app (and, maybe, optimizations) ID: 81074 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 81084 - Posted: 24 Jan 2017, 17:56:43 UTC Last modified: 24 Jan 2017, 18:30:39 UTC imho microprocessors has reached the point of zero marginal improvements at 64 bits, based on the concept of Amdahl's law https://en.wikipedia.org/wiki/Amdahl's_law microprocessors is only as fast as its slowest bottleneck that cannot be made faster. Intel (and possibly AMD) X86* microprocessors with all its advances has very fast FPUs (even if the core codes runs at 32 bits, these FPUs runs at 80 bits all the time, perhaps even when SSE/AVX/AVXn is used) https://en.wikipedia.org/wiki/X87 the hardware optimizations (e.g. including superscalar instruction level parallelism) has perhaps reach a point may be it takes a single clock cycle to do some simple double precision floating point computations, and it can go no faster than that, and all the other chain of instructions and data processing is simply memory limited in its processing speed, and for some things you either vectorise and parallel process it or if that cannot be done such as the next step depends on the output of the previous step, this is as fast as it ever get (the early 64bits AMD64 processors worked faster 64bits vs 32bits, because those recent hardware optimizations (e.g. skylake, kabylake) that automatically fetch data say 64 bits doubles in hardware completely bypassing the old 32 bit ancient logic has not been designed/invented back then. today hardware prefetch instructions and use cache efficiently performing speculative execution overcoming all the 32 bits vs 64 bits distinction) and with transistors reaching 13nm (10nm next), it's probably impossible to make those transistors any smaller (the quantum uncertainty principal comes into play & it may no longer be possible to keep a transistor working as a transistor any smaller) - end of moores law this is an oversimplification, but it bring across the point that zero marginal improvements has been reached. hardware and software optimizations and transistor sizes has been pushed to limits where in there can be no further improvements regardless if it is 32 or 64 bits or more i think this is true even in ARM microprocessors where 64bits microprocessors only show a marginal (or even no) improvement over 32bits ARM microprocessors http://www.roylongbottom.org.uk/linpack%20results.htm#anchorAndroid Raspberry Pi 2 gcc 4.8 DP SP CPU MHz Linux MFLOPS MFLOPS ARM V7A 1000 3.18.5 169 176 Raspberry Pi 3 gcc 4.8 DP SP CPU MHz Linux MFLOPS MFLOPS ARM v8-A53 1200 4.1.19 180 194 if we take 32 bits Raspberry Pi 2 and overclock that to 1.2 ghz then assuming 169 * 1.2 = 202.8 Mflops that implies 32 bits ARM could actually exceed that in 64 bits double precision linpack benchmark clock-for-clock ID: 81084 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 81086 - Posted: 24 Jan 2017, 21:09:12 UTC the main advantage for going 64bits is then more to do with memory than with processing speeds, it can easily address > 4GB boundaries. the downside is that 64bits apps consume more memory for the same app ID: 81086 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2100 Credit: 12,313,716 RAC: 3,244	Message 81087 - Posted: 25 Jan 2017, 10:17:02 UTC - in response to Message 81086. the main advantage for going 64bits is then more to do with memory than with processing speeds, it can easily address > 4GB boundaries. the downside is that 64bits apps consume more memory for the same app This is my idea. I'm thinking of ram, to crunch bigger simulations (with, for example, the possibility to select "big app" in user's profile). For performances, eternal waiting of SSEx, Avx, Fma.. :-) ID: 81087 · Rating: 0 · rate: / Reply Quote