Will there be a 64-bit client in the near future?

Message boards : Number crunching : Will there be a 64-bit client in the near future?

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Christian L.

Send message
Joined: 12 Aug 07
Posts: 3
Credit: 203,369
RAC: 0
Message 65192 - Posted: 3 Feb 2010, 19:59:45 UTC

The title says it all.
ID: 65192 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 1
Message 65193 - Posted: 3 Feb 2010, 21:40:47 UTC

So may i take it that none of them on the Applications page will do ??
Can you explain your problem a bit more.
ID: 65193 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 65214 - Posted: 5 Feb 2010, 18:57:34 UTC - in response to Message 65193.  

So may i take it that none of them on the Applications page will do ??
Can you explain your problem a bit more.


Those are wrappers, they "run" as if the CPU were 32 bit, they dont take advantage of the 64-bit capabilities.

And to OP: It seems the work required to make a 64-bit versions surpasses the "speed" or TFlop increase, if at all, of the 64-bit version vs 32-bit.
ID: 65214 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 1
Message 65217 - Posted: 5 Feb 2010, 21:31:29 UTC

A ha,
A native 64 bit client,
there are some projects that have them.
and as you say if there is little to gain from using 64,
they will have more important things to do on the list.
ID: 65217 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 65228 - Posted: 7 Feb 2010, 23:59:40 UTC - in response to Message 65217.  

A ha,
A native 64 bit client,
there are some projects that have them.
and as you say if there is little to gain from using 64,
they will have more important things to do on the list.


There is little gain for ROSETTA exclusively. There are a bunch of threads explaining in more detail regarding 64-bit clients.

But the bottom line is that they rather work on other things than develop 64-bit clients. It's way more advantageous to develop a GPU client instead anyways.
ID: 65228 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,235,310
RAC: 11
Message 65234 - Posted: 8 Feb 2010, 19:36:10 UTC

Well-written code shouldn't need much work if any to re-compile with a x86_64 compiler. Just compile for multiple architectures (x86 and x86_64) from the same source. True, there's no 64-bit specific enhancements, but you'll still use the 64-bit CPU feature set available from the compiler. (It will probably help Windows clients more than Linux clients.)
ID: 65234 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ghost

Send message
Joined: 16 Oct 16
Posts: 2
Credit: 6,402,194
RAC: 0
Message 80747 - Posted: 17 Oct 2016, 7:03:49 UTC

Just discovered this thread via a search.

When I run Rosetta, the name shows on task manager is `minirosetta_3.73_windows_x86_64.exe (32 bits)`, which is a kind of wired...
ID: 80747 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1858
Credit: 8,129,799
RAC: 7,872
Message 80748 - Posted: 17 Oct 2016, 18:31:34 UTC

6 years ago.....
ID: 80748 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 352
Credit: 382,349
RAC: 0
Message 80756 - Posted: 19 Oct 2016, 7:49:35 UTC - in response to Message 80747.  

When I run Rosetta, the name shows on task manager is `minirosetta_3.73_windows_x86_64.exe (32 bits)`, which is a kind of wired...

32-bit application send as 64-bit to make 64-bit BOINC client happy...
.
ID: 80756 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80761 - Posted: 20 Oct 2016, 7:06:18 UTC
Last modified: 20 Oct 2016, 7:14:03 UTC

my guess would be that the codes could have been compiled using a commercial compiler linked against various optimised commercial libraries which is in 32 bits. in which source codes are not available. this in itself would limit the ability to go 64bit as 64 bits codes cannot be linked against 32 bit libraries (it doesn't make sense to do that any way, it is possibly just as 'slow' given that those libraries may possibly be doing quite a bit of the math)

however on intel (x86_64)/ amd64 platforms win32 bits codes run just fine. there are lots of 32 bits apps around and among other things they are possibly (much) less memory hungry compared to 64 bits apps.

the unfortunate downside is that 64 bits codes tend to do double precision maths (possibly say 20-30%) faster than 32 bit codes, the notion is that 64 bit instructions could possibly execute various 64 bits fp instructions say in 1 clock cycle (or less clock cycles) compared to that of 32 bits codes (could be double the clock cycles perhaps) doing 64 bits (double precision) fp

just 2c
ID: 80761 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1858
Credit: 8,129,799
RAC: 7,872
Message 80767 - Posted: 21 Oct 2016, 8:31:16 UTC - in response to Message 80761.  

my guess would be that the codes could have been compiled using a commercial compiler linked against various optimised commercial libraries which is in 32 bits, in which source codes are not available. This in itself would limit the ability to go 64bit as 64 bits codes cannot be linked against 32 bit libraries


This may be an answer, but...
Are they using COMMERCIAL libraries? I I thought they had THEIR libraries.
And that they are using open compilers (like gcc).

ID: 80767 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80817 - Posted: 1 Nov 2016, 8:33:09 UTC
Last modified: 1 Nov 2016, 8:39:07 UTC

actually i support the notion that r@h should go 64bits esp for windows platform as the binaries today even as of 3.73 is still 32 bits (if i'm right about it).

this aside from trying other 'esoteric' optimizations such as SSE/AVX/AVX2/FMA or even GPU, just going 64bits would most likely see immediate gains on windows x86_64 bits platform, especially for double precision floating point maths. some of the data processing which takes several 32 bits steps/instructions may simply be done in less steps say in a single instruction as double precision variables fit nicely in 64 bits. this could benefit / accelerate math / codes in those deepest nested parts of the loops which does bulk of computations
ID: 80817 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80819 - Posted: 1 Nov 2016, 10:09:50 UTC
Last modified: 1 Nov 2016, 11:09:29 UTC

got very curious about 32 bits vs 64 bits double precision maths so decided to do some tests:

got the plain old codes for linpack & whetstone from here
http://www.netlib.org/benchmark/
http://www.netlib.org/benchmark/linpackc.new
http://www.netlib.org/benchmark/whetstone.c

compile them and run
> gcc -o linpack32 -m32 -O2 linpack.c -lm
> gcc -o linpack64  -O2 linpack.c -lm
> ./linpack32 
Enter array size (q to quit) [200]:  1000
Memory required:  7824K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 1000 X 1000.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      16   0.78  96.24%   0.58%   3.18%  3566560.131
      32   1.56  96.22%   0.59%   3.19%  3559853.802
      64   3.13  96.20%   0.63%   3.17%  3535126.088
     128   6.24  96.22%   0.59%   3.19%  3549975.930
     256  12.49  96.22%   0.59%   3.19%  3550935.104

> ./linpack64 
Enter array size (q to quit) [200]:  1000
Memory required:  7824K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 1000 X 1000.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      16   0.76  95.14%   0.56%   4.30%  3704148.619
      32   1.51  95.10%   0.57%   4.32%  3702957.305
      64   3.04  95.12%   0.57%   4.32%  3694656.050
     128   6.08  95.13%   0.56%   4.31%  3688275.421
     256  12.21  95.08%   0.60%   4.32%  3674843.370



that works out to about 5% gains for linpack for a 1000x1000 matrix

for whestone made some changes to store the results in static variables to prevent GCC from optimizing away codes.

> gcc -o whetstone32 -m32 -O2 whetstone.c -lm
> gcc -o whetstone64 -O2 whetstone.c -lm
> ./whetstone32 -c 100000

Loops: 100000, Iterations: 1, Duration: 2 sec.
C Converted Double Precision Whetstones: 5000.0 MIPS

Loops: 100000, Iterations: 1, Duration: 2 sec.
C Converted Double Precision Whetstones: 5000.0 MIPS

Loops: 100000, Iterations: 1, Duration: 2 sec.
C Converted Double Precision Whetstones: 5000.0 MIPS

Loops: 100000, Iterations: 1, Duration: 1 sec.
C Converted Double Precision Whetstones: 10000.0 MIPS

> ./whetstone64 -c 100000

Loops: 100000, Iterations: 1, Duration: 3 sec.
C Converted Double Precision Whetstones: 3333.3 MIPS

Loops: 100000, Iterations: 1, Duration: 3 sec.
C Converted Double Precision Whetstones: 3333.3 MIPS

Loops: 100000, Iterations: 1, Duration: 2 sec.
C Converted Double Precision Whetstones: 5000.0 MIPS

Loops: 100000, Iterations: 1, Duration: 3 sec.
C Converted Double Precision Whetstones: 3333.3 MIPS


the results *defy intuition*, 32 bits is *faster* lol

the reason could be this:
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/393951
On Intel processors there are the following floating point instruction sets: FPU (8087 emulation), SSE and AVX. All three have access to an internal, very fast, internal floating point processor (engine). The FPU supports 4-byte, 8-byte, and 10-byte floating point formats as single elements (scalars). The SSE and AVX support 4-byte, 8-byte floating point formats as scalars (single variable) or small vectors (2 or more elements). Ignoring the multiple element formats in SSE and AVX, the latency of a floating point multiply is on the order of 4 clock cycles (this will extend for memory references). Throughput can be as little as 1-2 clock cycles.

When the problem involves a large degree of RAM reads and writes, the program is waiting for the memory as opposed to waiting for the floating point operations.

Note, when small vectors can be used, the computation time can be significantly reduced (1/2, 1/4. 1/8) memory subsystem overhead can be reduced per floating operation, but the demands on memory subsystem may increase.

Jim Dempsey


storing data from cpu registers to memory 'killed' the whetstone benchmark, it makes the benchmark dependent on the bottleneck of moving data to memory and no longer measure simply floating point calc speeds. stalled by memory moves.
but without those storing to variable (i.e. ram) codes, GCC is too 'smart' & would *optimise away (remove)* calculations simply because the results are *not used anywhere*

hence the double precision speedup or slowdown between 32bits vs 64bits could be dependent on processors and manufacturers e.g. amd vs intel
i'd guess the differences may even be noticed between different cpu releases / generations

this has *a lot* of implications considering that boinc awards points based on this very *legacy* whetstone benchmark. it implies whetstone benchmark measures cpu-memory bottlenecks rather than computation prowess. as in the gflops is not now fast the cpu calculates, rather it is how fast those numbers can be moved between cpu registers & memory lol
ID: 80819 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80820 - Posted: 1 Nov 2016, 15:45:34 UTC
Last modified: 1 Nov 2016, 16:36:50 UTC

as it turns out floating point is as ancient as there has been Intel x87 *FPU*
it is always 80bits. It treats single precision & double precision the same way. Unlike many of the recent *GPU* there is a drastic difference between what GPUs can handle between single precision and double precision. for GPUs the speed between single precision : double precision can be as much as 32:1
e.g. https://en.wikipedia.org/wiki/GeForce_10_series

while on Intel x87 *FPU* everything is 80bits there is no 'single precision' everything is 'extended double precision' on x87 all the way to the newest Intel/AMD cpus haswell / skylake / etc.
http://home.agh.edu.pl/~amrozek/x87.pdf
8.2. X87 FPU DATA TYPES
The x87 FPU recognizes and operates on the following seven data types :single-precision floating point, double-precision floating point, double extended-precision floating point, signed word integer, signed doubleword integer, signed quadword integer, and packed BCD decimal integers.

With the exception of the 80-bit double extended-precision floating-point format, all of these data types exist in memory only. When they are loaded into x87 FPU data registers, they are converted into double extended-precision floating-point format and operated on in that format.


the magic of floating point on X86 including X86_64 is all in the *FPU* everything is 80 bits. this could explain the reason 32 bits & 64 bits code didn't matter. in fact 32 bits codes is 'faster', this is likely an artifact / accident of cpu cache. i.e. the 32 bits whetstone benchmark instruction codes and possibly its various local data (variables) may completely 'live inside the cpu cache' due to the smaller code size and data outlay.

hence, this may explain to an extent that 32bits whetstone codes 'runs faster' giving more gflops.

if this is true, you could possibly run the 32bits version of boinc client (even on 64 bits os, including windows) & get higher gflops on the boinc whestone benchmark and possibly higher credits as well, not just r@h, everything under boinc lol
ID: 80820 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1858
Credit: 8,129,799
RAC: 7,872
Message 80823 - Posted: 4 Nov 2016, 8:49:18 UTC
Last modified: 4 Nov 2016, 8:57:09 UTC

64 bits, obviously, are the future.
All Os are abandoning 32-bit architecture (including Android).
It's just a matter of time.
ID: 80823 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1858
Credit: 8,129,799
RAC: 7,872
Message 80824 - Posted: 4 Nov 2016, 8:53:33 UTC - in response to Message 80817.  

some of the data processing which takes several 32 bits steps/instructions may simply be done in less steps say in a single instruction as double precision variables fit nicely in 64 bits. this could benefit / accelerate math / codes in those deepest nested parts of the loops which does bulk of computations


I think it's not only "accelerate math".
Admins said, in the past, that they simulate "little" proteins because the memory usage is high. With 64 bits the problem is solved.

ID: 80824 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80825 - Posted: 4 Nov 2016, 19:11:21 UTC - in response to Message 80824.  
Last modified: 4 Nov 2016, 19:25:20 UTC

some of the data processing which takes several 32 bits steps/instructions may simply be done in less steps say in a single instruction as double precision variables fit nicely in 64 bits. this could benefit / accelerate math / codes in those deepest nested parts of the loops which does bulk of computations


I think it's not only "accelerate math".
Admins said, in the past, that they simulate "little" proteins because the memory usage is high. With 64 bits the problem is solved.


yup that's quite true, with 64 bits it is *easier* to access more than 4GB memory, with 32bits, a somewhat more 'complicated' setup PAE is needed for that.
https://en.wikipedia.org/wiki/Physical_Address_Extension

32 bits do have an advantage that in various scenarios it is less 'wasteful' of memory though, e.g. that if for most time small integers are used in 32 bits that's 4 bytes, while in 64 bits 8 bytes are used, multiply that by a million items in an array that becomes 8 megs vs 4 megs. but of course memory is sort of 'cheap' these days lol

if i'm not wrong, among them one of the big advantages of going 64 bits is that in 32 bits mode there are 8 AVX SIMD registers available, while going to 64 bits makes that 16 AVX SIMD registers, that makes it much more flexible (and possibly faster) for vectorised SIMD codes that runs on AVX
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
ID: 80825 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80826 - Posted: 4 Nov 2016, 19:46:25 UTC
Last modified: 4 Nov 2016, 20:01:40 UTC

repeated the previous linpack test
http://www.netlib.org/benchmark/linpackc.new

but this time turn on compiler optimizations with AVX/AVX2/FMA
> gcc -o linpack32 -m32 -O3 -mavx -mavx2 -mfma linpack.c -lm
> gcc -o linpack64 -O3 -mavx -mavx2 -mfma linpack.c -lm         
> ./linpack32
Enter array size (q to quit) [200]:  1000
Memory required:  7824K.

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 1000 X 1000.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      16   0.65  92.87%   0.52%   6.61%  4408256.717
      32   1.30  92.87%   0.53%   6.60%  4407119.733
      64   2.61  92.87%   0.53%   6.60%  4408167.983
     128   5.21  92.90%   0.53%   6.57%  4407866.492
     256  10.42  92.90%   0.53%   6.58%  4408174.773

> ./linpack64
Enter array size (q to quit) [200]:  1000
Memory required:  7824K.

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 1000 X 1000.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      16   0.58  94.71%   0.53%   4.76%  4823499.938
      32   1.17  94.71%   0.52%   4.76%  4823708.094
      64   2.34  94.71%   0.52%   4.76%  4823911.930
     128   4.67  94.71%   0.52%   4.76%  4825073.475
     256   9.31  94.70%   0.52%   4.78%  4840441.868
     512  18.66  94.71%   0.52%   4.77%  4829583.147

note all single core figures


this time round the 64bits linpack is almost 10 (9.6) percent faster than the 32bits similarly optimised (AVX) linpack performance. (this is not yet the fastest, it is simply compiler optimised)

and if the 64bits AVX/AVX2/FMA linpack is compared to the original 32 bits linpack (without the AVX/AVX2/FMA optimizations), it is a much better 35.4 percent improvement. Note that the original 32 bits app is optmised with -O2 optimization which also means optimised codes.

64 bit + AVX/AVX2/FMA optimizations is a clear winner here

if pushed (highly optimised) to the metal, the now 'old' Haswell i7-4770k cpu manages a whopping 177 Gflops multi core vectorized double precision floating point performance.
https://www.pugetsystems.com/labs/hpc/Haswell-Floating-Point-Performance-493

as most of the GPUs has rather poor *double precision* floating point performance, that makes today's intel i7 haswell/skylake/kabilake comparable to the higher performance (expensive) GPUs sold on the market today in terms of double precision floating point performance.

And for that matter the intel i7 haswell/skylake/kabilake use considerably much less power giving a much better performance per watt (much better energy efficiency) score compared to the GPUs
ID: 80826 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1858
Credit: 8,129,799
RAC: 7,872
Message 80827 - Posted: 7 Nov 2016, 9:32:49 UTC - in response to Message 80826.  

as most of the GPUs has rather poor *double precision* floating point performance, that makes today's intel i7 haswell/skylake/kabilake comparable to the higher performance (expensive) GPUs sold on the market today in terms of double precision floating point performance.


DP is for hpc gpu like Nvidia Tesla or Amd FirePro. Home gpus are great with SP.
But these are only academic discussions (very interesting, but theorical), the fact is that this thread was opened 6 years ago!!
Have we to wait other 6 years to get some results?

ID: 80827 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 21,289,542
RAC: 13,039
Message 80836 - Posted: 11 Nov 2016, 14:57:05 UTC - in response to Message 80825.  

some of the data processing which takes several 32 bits steps/instructions may simply be done in less steps say in a single instruction as double precision variables fit nicely in 64 bits. this could benefit / accelerate math / codes in those deepest nested parts of the loops which does bulk of computations


I think it's not only "accelerate math".
Admins said, in the past, that they simulate "little" proteins because the memory usage is high. With 64 bits the problem is solved.


yup that's quite true, with 64 bits it is *easier* to access more than 4GB memory, with 32bits, a somewhat more 'complicated' setup PAE is needed for that.
https://en.wikipedia.org/wiki/Physical_Address_Extension

32 bits do have an advantage that in various scenarios it is less 'wasteful' of memory though, e.g. that if for most time small integers are used in 32 bits that's 4 bytes, while in 64 bits 8 bytes are used, multiply that by a million items in an array that becomes 8 megs vs 4 megs. but of course memory is sort of 'cheap' these days lol

if i'm not wrong, among them one of the big advantages of going 64 bits is that in 32 bits mode there are 8 AVX SIMD registers available, while going to 64 bits makes that 16 AVX SIMD registers, that makes it much more flexible (and possibly faster) for vectorised SIMD codes that runs on AVX
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions


There is little if any code that would currently benefit from 64-bit integers. I originally thought that the larger number of registers available in 64-bit mode would help but the increased code and data size of 64-bit code did more damage to the caches than registers SPILL/FILLS necessary in 32-bits (caused by fewer registers). That is what I measured when I actually recompiled the code as 32-bit AND 64-bit.


Rosetta spends a large chunk of its time computing "relationships" between 2 points in 3-dimensions (using floating point math).
Rosetta makes an X-dimension 64-bit floating point calculation.
Rosetta makes an Y-dimension 64-bit floating point calculation.
Rosetta makes an Z-dimension 64-bit floating point calculation.

You can change the TYPE DEFINITION of that "point" description to just add 4th "dummy" dimension that will allow the compiler to do a SIMD vector load of all 4 dimensions, perform the operation on all 4 dimensions and then a SIMD vector store. The compiler will change the 3 sequential SCALAR operation on 3-dimensions to a SINGLE PARALLEL operation on 4-dimensions.

If you add the 4th dimension in the TYPEDEF, you do not need to make ANY other source code changes for the compilers to automatically generate the low level code to perform the parallel LOAD-OPERATION-STORE. VERY low hanging fruit.

The Rosetta developers said they were already "familiar" with this technique when I pointed it out last year. It WOULD be their first, easy step to take if "low" performance was a problem for them.

32-bit integer versus 64-bit integer code really makes no difference unless Rosetta code undergoes major changes.







ID: 80836 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Will there be a 64-bit client in the near future?



©2024 University of Washington
https://www.bakerlab.org