Message boards : Number crunching : Will there be a 64-bit client in the near future?
Author | Message |
---|---|
Christian L. Send message Joined: 12 Aug 07 Posts: 3 Credit: 203,369 RAC: 0 |
The title says it all. |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
So may i take it that none of them on the Applications page will do ?? Can you explain your problem a bit more. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
So may i take it that none of them on the Applications page will do ?? Those are wrappers, they "run" as if the CPU were 32 bit, they dont take advantage of the 64-bit capabilities. And to OP: It seems the work required to make a 64-bit versions surpasses the "speed" or TFlop increase, if at all, of the 64-bit version vs 32-bit. |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
A ha, A native 64 bit client, there are some projects that have them. and as you say if there is little to gain from using 64, they will have more important things to do on the list. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
A ha, There is little gain for ROSETTA exclusively. There are a bunch of threads explaining in more detail regarding 64-bit clients. But the bottom line is that they rather work on other things than develop 64-bit clients. It's way more advantageous to develop a GPU client instead anyways. |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 0 |
Well-written code shouldn't need much work if any to re-compile with a x86_64 compiler. Just compile for multiple architectures (x86 and x86_64) from the same source. True, there's no 64-bit specific enhancements, but you'll still use the 64-bit CPU feature set available from the compiler. (It will probably help Windows clients more than Linux clients.) |
ghost Send message Joined: 16 Oct 16 Posts: 2 Credit: 6,402,194 RAC: 0 |
Just discovered this thread via a search. When I run Rosetta, the name shows on task manager is `minirosetta_3.73_windows_x86_64.exe (32 bits)`, which is a kind of wired... |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2013 Credit: 9,821,437 RAC: 2,516 |
6 years ago..... |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
When I run Rosetta, the name shows on task manager is `minirosetta_3.73_windows_x86_64.exe (32 bits)`, which is a kind of wired... 32-bit application send as 64-bit to make 64-bit BOINC client happy... . |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
my guess would be that the codes could have been compiled using a commercial compiler linked against various optimised commercial libraries which is in 32 bits. in which source codes are not available. this in itself would limit the ability to go 64bit as 64 bits codes cannot be linked against 32 bit libraries (it doesn't make sense to do that any way, it is possibly just as 'slow' given that those libraries may possibly be doing quite a bit of the math) however on intel (x86_64)/ amd64 platforms win32 bits codes run just fine. there are lots of 32 bits apps around and among other things they are possibly (much) less memory hungry compared to 64 bits apps. the unfortunate downside is that 64 bits codes tend to do double precision maths (possibly say 20-30%) faster than 32 bit codes, the notion is that 64 bit instructions could possibly execute various 64 bits fp instructions say in 1 clock cycle (or less clock cycles) compared to that of 32 bits codes (could be double the clock cycles perhaps) doing 64 bits (double precision) fp just 2c |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2013 Credit: 9,821,437 RAC: 2,516 |
my guess would be that the codes could have been compiled using a commercial compiler linked against various optimised commercial libraries which is in 32 bits, in which source codes are not available. This in itself would limit the ability to go 64bit as 64 bits codes cannot be linked against 32 bit libraries This may be an answer, but... Are they using COMMERCIAL libraries? I I thought they had THEIR libraries. And that they are using open compilers (like gcc). |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
actually i support the notion that r@h should go 64bits esp for windows platform as the binaries today even as of 3.73 is still 32 bits (if i'm right about it). this aside from trying other 'esoteric' optimizations such as SSE/AVX/AVX2/FMA or even GPU, just going 64bits would most likely see immediate gains on windows x86_64 bits platform, especially for double precision floating point maths. some of the data processing which takes several 32 bits steps/instructions may simply be done in less steps say in a single instruction as double precision variables fit nicely in 64 bits. this could benefit / accelerate math / codes in those deepest nested parts of the loops which does bulk of computations |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
got very curious about 32 bits vs 64 bits double precision maths so decided to do some tests: got the plain old codes for linpack & whetstone from here http://www.netlib.org/benchmark/ http://www.netlib.org/benchmark/linpackc.new http://www.netlib.org/benchmark/whetstone.c compile them and run > gcc -o linpack32 -m32 -O2 linpack.c -lm > gcc -o linpack64 -O2 linpack.c -lm > ./linpack32 Enter array size (q to quit) [200]: 1000 Memory required: 7824K. LINPACK benchmark, Double precision. Machine precision: 15 digits. Array size 1000 X 1000. Average rolled and unrolled performance: Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS ---------------------------------------------------- 16 0.78 96.24% 0.58% 3.18% 3566560.131 32 1.56 96.22% 0.59% 3.19% 3559853.802 64 3.13 96.20% 0.63% 3.17% 3535126.088 128 6.24 96.22% 0.59% 3.19% 3549975.930 256 12.49 96.22% 0.59% 3.19% 3550935.104 > ./linpack64 Enter array size (q to quit) [200]: 1000 Memory required: 7824K. LINPACK benchmark, Double precision. Machine precision: 15 digits. Array size 1000 X 1000. Average rolled and unrolled performance: Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS ---------------------------------------------------- 16 0.76 95.14% 0.56% 4.30% 3704148.619 32 1.51 95.10% 0.57% 4.32% 3702957.305 64 3.04 95.12% 0.57% 4.32% 3694656.050 128 6.08 95.13% 0.56% 4.31% 3688275.421 256 12.21 95.08% 0.60% 4.32% 3674843.370 that works out to about 5% gains for linpack for a 1000x1000 matrix for whestone made some changes to store the results in static variables to prevent GCC from optimizing away codes. > gcc -o whetstone32 -m32 -O2 whetstone.c -lm > gcc -o whetstone64 -O2 whetstone.c -lm > ./whetstone32 -c 100000 Loops: 100000, Iterations: 1, Duration: 2 sec. C Converted Double Precision Whetstones: 5000.0 MIPS Loops: 100000, Iterations: 1, Duration: 2 sec. C Converted Double Precision Whetstones: 5000.0 MIPS Loops: 100000, Iterations: 1, Duration: 2 sec. C Converted Double Precision Whetstones: 5000.0 MIPS Loops: 100000, Iterations: 1, Duration: 1 sec. C Converted Double Precision Whetstones: 10000.0 MIPS > ./whetstone64 -c 100000 Loops: 100000, Iterations: 1, Duration: 3 sec. C Converted Double Precision Whetstones: 3333.3 MIPS Loops: 100000, Iterations: 1, Duration: 3 sec. C Converted Double Precision Whetstones: 3333.3 MIPS Loops: 100000, Iterations: 1, Duration: 2 sec. C Converted Double Precision Whetstones: 5000.0 MIPS Loops: 100000, Iterations: 1, Duration: 3 sec. C Converted Double Precision Whetstones: 3333.3 MIPS the results *defy intuition*, 32 bits is *faster* lol the reason could be this: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/393951 On Intel processors there are the following floating point instruction sets: FPU (8087 emulation), SSE and AVX. All three have access to an internal, very fast, internal floating point processor (engine). The FPU supports 4-byte, 8-byte, and 10-byte floating point formats as single elements (scalars). The SSE and AVX support 4-byte, 8-byte floating point formats as scalars (single variable) or small vectors (2 or more elements). Ignoring the multiple element formats in SSE and AVX, the latency of a floating point multiply is on the order of 4 clock cycles (this will extend for memory references). Throughput can be as little as 1-2 clock cycles. storing data from cpu registers to memory 'killed' the whetstone benchmark, it makes the benchmark dependent on the bottleneck of moving data to memory and no longer measure simply floating point calc speeds. stalled by memory moves. but without those storing to variable (i.e. ram) codes, GCC is too 'smart' & would *optimise away (remove)* calculations simply because the results are *not used anywhere* hence the double precision speedup or slowdown between 32bits vs 64bits could be dependent on processors and manufacturers e.g. amd vs intel i'd guess the differences may even be noticed between different cpu releases / generations this has *a lot* of implications considering that boinc awards points based on this very *legacy* whetstone benchmark. it implies whetstone benchmark measures cpu-memory bottlenecks rather than computation prowess. as in the gflops is not now fast the cpu calculates, rather it is how fast those numbers can be moved between cpu registers & memory lol |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
as it turns out floating point is as ancient as there has been Intel x87 *FPU* it is always 80bits. It treats single precision & double precision the same way. Unlike many of the recent *GPU* there is a drastic difference between what GPUs can handle between single precision and double precision. for GPUs the speed between single precision : double precision can be as much as 32:1 e.g. https://en.wikipedia.org/wiki/GeForce_10_series while on Intel x87 *FPU* everything is 80bits there is no 'single precision' everything is 'extended double precision' on x87 all the way to the newest Intel/AMD cpus haswell / skylake / etc. http://home.agh.edu.pl/~amrozek/x87.pdf 8.2. X87 FPU DATA TYPES the magic of floating point on X86 including X86_64 is all in the *FPU* everything is 80 bits. this could explain the reason 32 bits & 64 bits code didn't matter. in fact 32 bits codes is 'faster', this is likely an artifact / accident of cpu cache. i.e. the 32 bits whetstone benchmark instruction codes and possibly its various local data (variables) may completely 'live inside the cpu cache' due to the smaller code size and data outlay. hence, this may explain to an extent that 32bits whetstone codes 'runs faster' giving more gflops. if this is true, you could possibly run the 32bits version of boinc client (even on 64 bits os, including windows) & get higher gflops on the boinc whestone benchmark and possibly higher credits as well, not just r@h, everything under boinc lol |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2013 Credit: 9,821,437 RAC: 2,516 |
64 bits, obviously, are the future. All Os are abandoning 32-bit architecture (including Android). It's just a matter of time. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2013 Credit: 9,821,437 RAC: 2,516 |
some of the data processing which takes several 32 bits steps/instructions may simply be done in less steps say in a single instruction as double precision variables fit nicely in 64 bits. this could benefit / accelerate math / codes in those deepest nested parts of the loops which does bulk of computations I think it's not only "accelerate math". Admins said, in the past, that they simulate "little" proteins because the memory usage is high. With 64 bits the problem is solved. |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
some of the data processing which takes several 32 bits steps/instructions may simply be done in less steps say in a single instruction as double precision variables fit nicely in 64 bits. this could benefit / accelerate math / codes in those deepest nested parts of the loops which does bulk of computations yup that's quite true, with 64 bits it is *easier* to access more than 4GB memory, with 32bits, a somewhat more 'complicated' setup PAE is needed for that. https://en.wikipedia.org/wiki/Physical_Address_Extension 32 bits do have an advantage that in various scenarios it is less 'wasteful' of memory though, e.g. that if for most time small integers are used in 32 bits that's 4 bytes, while in 64 bits 8 bytes are used, multiply that by a million items in an array that becomes 8 megs vs 4 megs. but of course memory is sort of 'cheap' these days lol if i'm not wrong, among them one of the big advantages of going 64 bits is that in 32 bits mode there are 8 AVX SIMD registers available, while going to 64 bits makes that 16 AVX SIMD registers, that makes it much more flexible (and possibly faster) for vectorised SIMD codes that runs on AVX https://en.wikipedia.org/wiki/Advanced_Vector_Extensions |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
repeated the previous linpack test http://www.netlib.org/benchmark/linpackc.new but this time turn on compiler optimizations with AVX/AVX2/FMA > gcc -o linpack32 -m32 -O3 -mavx -mavx2 -mfma linpack.c -lm > gcc -o linpack64 -O3 -mavx -mavx2 -mfma linpack.c -lm > ./linpack32 Enter array size (q to quit) [200]: 1000 Memory required: 7824K. LINPACK benchmark, Double precision. Machine precision: 15 digits. Array size 1000 X 1000. Average rolled and unrolled performance: Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS ---------------------------------------------------- 16 0.65 92.87% 0.52% 6.61% 4408256.717 32 1.30 92.87% 0.53% 6.60% 4407119.733 64 2.61 92.87% 0.53% 6.60% 4408167.983 128 5.21 92.90% 0.53% 6.57% 4407866.492 256 10.42 92.90% 0.53% 6.58% 4408174.773 > ./linpack64 Enter array size (q to quit) [200]: 1000 Memory required: 7824K. LINPACK benchmark, Double precision. Machine precision: 15 digits. Array size 1000 X 1000. Average rolled and unrolled performance: Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS ---------------------------------------------------- 16 0.58 94.71% 0.53% 4.76% 4823499.938 32 1.17 94.71% 0.52% 4.76% 4823708.094 64 2.34 94.71% 0.52% 4.76% 4823911.930 128 4.67 94.71% 0.52% 4.76% 4825073.475 256 9.31 94.70% 0.52% 4.78% 4840441.868 512 18.66 94.71% 0.52% 4.77% 4829583.147 note all single core figures this time round the 64bits linpack is almost 10 (9.6) percent faster than the 32bits similarly optimised (AVX) linpack performance. (this is not yet the fastest, it is simply compiler optimised) and if the 64bits AVX/AVX2/FMA linpack is compared to the original 32 bits linpack (without the AVX/AVX2/FMA optimizations), it is a much better 35.4 percent improvement. Note that the original 32 bits app is optmised with -O2 optimization which also means optimised codes. 64 bit + AVX/AVX2/FMA optimizations is a clear winner here if pushed (highly optimised) to the metal, the now 'old' Haswell i7-4770k cpu manages a whopping 177 Gflops multi core vectorized double precision floating point performance. https://www.pugetsystems.com/labs/hpc/Haswell-Floating-Point-Performance-493 as most of the GPUs has rather poor *double precision* floating point performance, that makes today's intel i7 haswell/skylake/kabilake comparable to the higher performance (expensive) GPUs sold on the market today in terms of double precision floating point performance. And for that matter the intel i7 haswell/skylake/kabilake use considerably much less power giving a much better performance per watt (much better energy efficiency) score compared to the GPUs |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2013 Credit: 9,821,437 RAC: 2,516 |
as most of the GPUs has rather poor *double precision* floating point performance, that makes today's intel i7 haswell/skylake/kabilake comparable to the higher performance (expensive) GPUs sold on the market today in terms of double precision floating point performance. DP is for hpc gpu like Nvidia Tesla or Amd FirePro. Home gpus are great with SP. But these are only academic discussions (very interesting, but theorical), the fact is that this thread was opened 6 years ago!! Have we to wait other 6 years to get some results? |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,229,863 RAC: 691 |
some of the data processing which takes several 32 bits steps/instructions may simply be done in less steps say in a single instruction as double precision variables fit nicely in 64 bits. this could benefit / accelerate math / codes in those deepest nested parts of the loops which does bulk of computations There is little if any code that would currently benefit from 64-bit integers. I originally thought that the larger number of registers available in 64-bit mode would help but the increased code and data size of 64-bit code did more damage to the caches than registers SPILL/FILLS necessary in 32-bits (caused by fewer registers). That is what I measured when I actually recompiled the code as 32-bit AND 64-bit. Rosetta spends a large chunk of its time computing "relationships" between 2 points in 3-dimensions (using floating point math). Rosetta makes an X-dimension 64-bit floating point calculation. Rosetta makes an Y-dimension 64-bit floating point calculation. Rosetta makes an Z-dimension 64-bit floating point calculation. You can change the TYPE DEFINITION of that "point" description to just add 4th "dummy" dimension that will allow the compiler to do a SIMD vector load of all 4 dimensions, perform the operation on all 4 dimensions and then a SIMD vector store. The compiler will change the 3 sequential SCALAR operation on 3-dimensions to a SINGLE PARALLEL operation on 4-dimensions. If you add the 4th dimension in the TYPEDEF, you do not need to make ANY other source code changes for the compilers to automatically generate the low level code to perform the parallel LOAD-OPERATION-STORE. VERY low hanging fruit. The Rosetta developers said they were already "familiar" with this technique when I pointed it out last year. It WOULD be their first, easy step to take if "low" performance was a problem for them. 32-bit integer versus 64-bit integer code really makes no difference unless Rosetta code undergoes major changes. |
Message boards :
Number crunching :
Will there be a 64-bit client in the near future?
©2025 University of Washington
https://www.bakerlab.org