Posts by rjs5

1) Message boards : Number crunching : Rosetta@home using AVX / AVX2 ? (Message 90832)
Posted 10 Jun 2019 by rjs5
Post:
GIMPS project introduces AVX512 support

:-P


PrimeGrid uses GIMPS. I am see the benefit on my 9980xe machine. I have not measured it accurately, but it seemed like about 30% improvement on PrimeGrid LLR. When a dense AVX application starts crunching, the CPU will throttle back the clock because of the higher CPU power usage.

This highlighted an issue with some CPUs. Some CPUs are designed with one AVX unit for each core instead of one AVX unit per thread. This means that the AVX unit can only be used by one of the threads on that core at a time. On systems with a single AVX unit per core, the bottleneck that creates cause the application to run slower that not using AVX 512.

"Performance" is a fickle thing.
2) Message boards : Number crunching : Hi, new to all this.., I have an I5 8400 on 100mb fibre.., the cpu was at 100% so I knocked back to 70%.., what should it be at? (Message 90763)
Posted 11 May 2019 by rjs5
Post:
The CPU % should be at whatever you set it to. I think 100% CPU would be the default BOINC setting when installed. You can then set BOINC Manager knobs to adjust the number of CPU and the % of time used as you like with the
OPTIONS -> CPU PREFERENCES -> COMPUTING tab.
You can turn off computing/GPU while you are using the machine too. I prefer limiting the number of CPU rather than the % because limiting the number allows the CPU have a constant workload.

Data used:
Rosetta uses a LOT of data com;pared to other projects. The database they download is about 300mb and their could be several versions on your machine at any instant. The database is copied to the "slot" directory and unzipped for execution. That unzipped version is about 2x the compressed size. There is 1 slot directory for each running Rosetta WU. The i5-8400 has 6/12 cores/threads. I would expect that BOINC would try to start 12 instances of Rosetta. 12 copies of Rosetta will take about 6-8 gb of disk.

You might check the EVENT LOG to see if there are messages that help.

Never finishing:
I have several suspicions here. Without knowing the OS type, Rosetta binary version running, BOINC settings, the amount of memory, disk or ...

1. The Rosetta 4.0 binary seems to need a 1gb of MEMORY at times, make sure you have enough memory to run the Rosetta WU. If you don't have enough memory, the Rosetta WU will end up paging/swapping and execution will slow to a crawl. Add more memory or limit the number of concurrent WU. There are some knobs you can add to an app_config.xml file in the Rosetta project directory to limit the number of max_concurrent WU running or you can just set the max number of CPUs if you are only running Rosetta .

2. I think the there are some Rosetta software bugs that allow the code to get into a deadlock condition. SUSPENDING the offending WU and then allowing it to restart might resolve the deadlock.

3. Check the amount of disk you have allocated for BOINC use and it is sufficient.
OPTIONS -> CPU PREFERENCES -> DISK
3) Message boards : Number crunching : Rosetta screen saver issue. (Message 90719)
Posted 24 Apr 2019 by rjs5
Post:
Ok, thank you for the input!
I don't think that memory usage is an issue, but I will double check to be sure.
And otherwise I'll limit the number of Rosetta WUs as you said.
Those from "Mapping Cancer Markers" and "Micriobiome" don't take more than +-500Mb of RAM.
So if I run 3 of those in parallel with 1 Rosetta WU, there should be no problem.

Thanks again;
Carl


WCG work units seem well behaved and your estimates are close. One of them is actually Rosetta code.
Rosetta is not well behaved and the WU seem to do very ugly things that I have seen take 1.5gb each. The 64-bit binary is the worst.

On my Windows machine, I use BOINCTASKS with the MEMORY column on the TASKS page enabled so I can watch usage.

Another "knob" that you have is the BOINC Manager:
OPTIONS -> COMPUTING PREFERENCES -> DISK AND MEMORY -> PAGE/SWAP FILE: USE AT MOST option.

You can set that to use a low % on the PAGE/SWAP memory option and if one of the WU goes rogue, jobs will fail quickly without affecting Windows and stalling with high disk activity.

I chose a BLANK screen saver. The BOINC screen savers looks nice but takes some CPU and memory away from crunching.
4) Message boards : Number crunching : Rosetta screen saver issue. (Message 90713)
Posted 23 Apr 2019 by rjs5
Post:
You have 3 Windows machines with 4 processors. 2 have 4gb of memory and the AMD machine has 8gb of memory.
I would suspect that you are running out of physical memory and paging to disk. Rosetta becomes a memory hog at the checkpoint times and uses 1gb of memory or more.

Start up the TASK MANAGER and select the PERFORMANCE tab. When it happens and you get control back, check the memory usage history. I think you will find available memory was exhausted.
The RESOURCE MONITOR also provides some more detail on the disk usage.

Possible solutions:
1. buy more memory
2. don't start unnecessary programs that take memory (screensaver, browser, ...)
3. limit the number of concurrent Rosetta WU with an app_config.xml file
4. Don't run Rosetta
5) Message boards : Number crunching : long/large work units, cpu_run_time limit and how to check 'progress'? (Message 90671)
Posted 14 Apr 2019 by rjs5
Post:
r@h should seriously look at resumable checkpoints even for that matter if the check point is 15 or even 30 minutes it would at least place a savepoint there so that for any reason the pc is shutdown wu can be continued. otherwise some (many?) partitipants may not be able to run the long jobs


In early development of a new method of analysis of a large protein, it is pretty common to span long periods of time without checkpoints. If the new method proves useful, and yields better models, then further development is done to improve runtime per model and checkpointing.



Why isn't this new and possibly disruptive work done on RALPH? Seems like RALPH is the place where Rosetta experimentation takes place and not on the main Rosetta@home. The RALPH volunteers are expecting this and it does not disrupt those who don't want to be messed up.

Other projects that perform their development work on their main site have the option for crunchers to opt-out of this testing.
6) Message boards : Number crunching : Computation errors (Message 90599)
Posted 31 Mar 2019 by rjs5
Post:
Hi, since I've come back to this project I've been seeing some strange errors in some of my WUs, especially in the ones that study big proteins, here are a few examples:
-https://boinc.bakerlab.org/rosetta/result.php?resultid=1065314770
-https://boinc.bakerlab.org/rosetta/result.php?resultid=1065314768
-https://boinc.bakerlab.org/rosetta/result.php?resultid=1065460662

How can I keep these errors from happening?


Rosetta developers were quite sloppy in their allocation and use of memory.

Task 1065460662 ran out of memory.
https://boinc.bakerlab.org/rosetta/result.php?resultid=1065460662

The other two error out with "Funzione non corretta" or "incorrect function"

When one WU runs out of memory, other WU may get strange error messages from function calls as developers don't always check the return results of all system calls.

The WU you are running are 64-bit and sometimes take large amounts of memory ... frequently over a GB each.

8gb should be enough to run 4 Rosetta 64-bit WU, so I would examine how memory is being used and change the workload.
Buy more memory if practical.
Lower the number of Rosetta WU running simultaneously with app_config.xml or BOINC -> OPTIONS -> COMPUTING PREFERENCES -> USAGE LIMITS
7) Message boards : Number crunching : Ever restarting tasks! (Message 90586)
Posted 27 Mar 2019 by rjs5
Post:
I would suspect that you are running low on memory and maybe exhausting it. Try "vmstat 1" and watch the amount of swapping the machine is doing.
You want to see zeroes in the si and so columns. Numbers in those 2 columns mean that memory blocks are being temporarily moved to disk and back.

You might try running just one Rosetta work unit and see if it completes it.

vmstat 1
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
r b swpd free buff cache si so bi bo in cs us sy id wa st
14 0 0 2437816 1077404 6773368 0 0 0 12 0 3 81 2 16 0 0
12 0 0 2437628 1077404 6773252 0 0 0 0 13061 2510 94 2 4 0 0
13 0 0 2438440 1077404 6773404 0 0 0 0 13180 2733 94 2 5 0 0
11 0 0 2438476 1077412 6772936 0 0 0 16 13016 2403 94 2 4 0 0


I also like "top ic" which prints more information about the running jobs.
8) Message boards : Number crunching : Computation error - WS_max (Message 90557)
Posted 22 Mar 2019 by rjs5
Post:
Sure it will not make disappear them all, my experience however is that I suffer less errors with that approach but you could be totally right. I've never seen a wu in "waiting mode", it is very curious, I do not have an explanation. The "finish file present too long" happens to me in almost every BOINC CPU project at some moment and I ended associating it to this type of conflicts but not sure of the actual reason. When I searched/asked (some years ago) there was I did not find a solution.


Wouldn't you know it. Just after I posted I had a burst of 12 errors. Many were sent at the same time, but all had different RUN times, but finished at the same time. All of these seem to be a WU problem. They all only processed 1 decoy and failed with default.out problems.

BOINC:: CPU time: 43297.4s, 14400s + 28800s[2019- 3-22 13:13: 5:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 43297.4 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
13:13:05 (20611): called boinc_finish(0)

</stderr_txt>

1064060602 958528553 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 13:12:34 UTC Completed and validated 39,821.69 38,579.93 116.28 Rosetta v4.08
1064060668 958528619 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 14:41:55 UTC Completed and validated 44,962.40 43,453.92 20 Rosetta v4.08
1064059944 958528620 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 14:53:37 UTC Completed and validated 44,850.18 43,373.49 20 Rosetta v4.08
1064060662 958528613 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 14:53:37 UTC Error while computing 44,762.00 43,410.41 --- Rosetta v4.08

1064059821 958528500 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 15:24:30 UTC Completed and validated 45,084.65 43,506.61 20 Rosetta v4.08
1064059792 958528470 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 16:05:34 UTC Completed and validated 44,989.24 43,510.94 116.49 Rosetta v4.08
1064060259 958528209 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 18:32:44 UTC Error while computing 45,186.26 43,625.06 --- Rosetta v4.08

1064073516 958535377 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 19:47:31 UTC Completed and validated 44,952.97 43,484.82 291 Rosetta v4.08
1064060604 958528555 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 19:54:22 UTC Error while computing 44,777.94 43,274.49 --- Rosetta v4.08

1064035288 958507870 21 Mar 2019, 16:25:45 UTC 22 Mar 2019, 2:11:20 UTC Completed and validated 27,686.13 26,899.81 311.93 Rosetta v4.07
1064035273 958507856 21 Mar 2019, 16:25:45 UTC 22 Mar 2019, 2:19:38 UTC Completed and validated 27,818.37 27,026.80 248.57 Rosetta v4.07
1064035277 958507860 21 Mar 2019, 16:25:45 UTC 22 Mar 2019, 2:20:09 UTC Completed and validated 26,983.10 26,201.80 184.35 Rosetta v4.07
1064060670 958528621 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 20:03:51 UTC Error while computing 44,845.19 43,253.75 --- Rosetta v4.08
1064073472 958535368 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:09:03 UTC Error while computing 44,966.90 43,476.75 --- Rosetta v4.08

1064073428 958535289 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:12:04 UTC Completed and validated 44,720.47 43,254.19 395.21 Rosetta v4.08
1064073481 958535376 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:14:51 UTC Error while computing 44,777.87 43,297.95 --- Rosetta v4.08
1064073479 958535374 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:44:20 UTC Error while computing 45,082.91 43,578.67 --- Rosetta v4.08
1064073494 958535355 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:44:20 UTC Error while computing 45,129.88 43,609.00 --- Rosetta v4.08
1064073496 958535357 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:44:20 UTC Error while computing 45,082.15 43,594.03 --- Rosetta v4.08
1064073508 958535369 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:44:20 UTC Error while computing 45,177.07 43,657.32 --- Rosetta v4.08
1064073408 958535304 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:44:20 UTC Error while computing 45,057.32 43,556.66 --- Rosetta v4.08
1064073440 958535301 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:44:20 UTC Error while computing 44,919.55 43,403.97 --- Rosetta v4.08

1064060666 958528617 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 3:09:31 UTC Completed and validated 17,404.94 16,926.27 2.44 Rosetta v4.08
1064060548 958528499 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 3:18:50 UTC Completed and validated 17,603.57 17,097.12 2.63 Rosetta v4.08
9) Message boards : Number crunching : Computation error - WS_max (Message 90554)
Posted 22 Mar 2019 by rjs5
Post:
This error is common in machines with high number of threads when several threads finish crunching tasks at the same time and the boinc program can not handle all the requests and for some units it fails with this message.

Your machines do not have such number of threads but maybe you have started all the units at the same time and when they finish quite close each other this error happens. You can see that all units failed at the same time tag which is an indicative of this behaviour.

Try to start units separately and you will see errors decrease.



Starting them at different times does not guarantee finishing at different times in the future. 8-)

I was watching when my 18C/36T machine finished a single 4.08 WU. The time remaining value clocked down to zero. The job did not identify as finished, but went into the Waiting mode. After some time, it restarted and was marked as a Compute Error and aborted. That seems to conflict with the "simultaneous finish" theory. It was curious that the WU went into the wait mode with no time remaining.

Could be a BOINC bug though ...

I sorted the WU by finish time and there were a couple that had another WU that finished at the same time. Maybe the other failing ones that did not have Rosetta WU finishing at the same time, but other projects. This machine has an Nvidia 2080ti and is running finishing WU frequently.

1063313212 956527517 17 Mar 2019, 7:56:26 UTC 17 Mar 2019, 16:34:05 UTC Completed and validated 29,041.25 28,562.28 280.28 Rosetta Mini v3.78
1063259978 957819313 16 Mar 2019, 20:52:15 UTC 17 Mar 2019, 7:56:26 UTC Error while computing 24,301.37 23,899.48 --- Rosetta v4.08
1063255548 957815416 16 Mar 2019, 20:18:24 UTC 17 Mar 2019, 7:56:26 UTC Error while computing 26,449.68 25,963.64 --- Rosetta v4.08

1063592535 958114526 19 Mar 2019, 2:13:38 UTC 19 Mar 2019, 10:01:22 UTC Completed and validated 28,038.34 27,607.97 379.43 Rosetta v4.08


1063704726 958214732 20 Mar 2019, 1:45:29 UTC 20 Mar 2019, 18:31:10 UTC Completed and validated 29,219.93 28,565.12 307.87 Rosetta v4.08
1063704662 958214725 20 Mar 2019, 1:45:29 UTC 20 Mar 2019, 18:40:22 UTC Error while computing 29,352.41 28,678.44 --- Rosetta v4.08
1063704674 958214749 20 Mar 2019, 1:45:29 UTC 20 Mar 2019, 18:40:22 UTC Completed and validated 29,336.58 28,735.34 295.1 Rosetta v4.08

1063704370 958214412 20 Mar 2019, 1:45:29 UTC 20 Mar 2019, 19:12:16 UTC Completed and validated 29,197.75 28,517.11 296.96 Rosetta v4.08

1063877290 958373683 20 Mar 2019, 17:46:57 UTC 21 Mar 2019, 3:07:19 UTC Completed and validated 29,343.53 28,530.64 281.28 Rosetta v4.08
1063857276 958355886 20 Mar 2019, 16:18:21 UTC 21 Mar 2019, 3:10:03 UTC Error while computing 29,451.14 28,604.48 --- Rosetta v4.08
1063877630 958374044 20 Mar 2019, 17:46:57 UTC 21 Mar 2019, 3:12:16 UTC Completed and validated 29,308.20 28,470.42 275.65 Rosetta v4.08


1064060668 958528619 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 14:41:55 UTC Completed and validated 44,962.40 43,453.92 20 Rosetta v4.08
1064059944 958528620 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 14:53:37 UTC Completed and validated 44,850.18 43,373.49 20 Rosetta v4.08
1064060662 958528613 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 14:53:37 UTC Error while computing 44,762.00 43,410.41 --- Rosetta v4.08

1064059821 958528500 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 15:24:30 UTC Completed and validated 45,084.65 43,506.61 20 Rosetta v4.08
1064059792 958528470 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 16:05:34 UTC Completed and validated 44,989.24 43,510.94 116.49 Rosetta v4.08
1064060259 958528209 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 18:32:44 UTC Error while computing 45,186.26 43,625.06 --- Rosetta v4.08
1064073516 958535377 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 19:47:31 UTC Completed and validated 44,952.97 43,484.82 291 Rosetta v4.08
1064060604 958528555 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 19:54:22 UTC Error while computing 44,777.94 43,274.49 --- Rosetta v4.08
1064035288 958507870 21 Mar 2019, 16:25:45 UTC 22 Mar 2019, 2:11:20 UTC Completed and validated 27,686.13 26,899.81 311.93 Rosetta v4.07
10) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 90542)
Posted 21 Mar 2019 by rjs5
Post:
I was watching when a couple of the Rosetta WU failed.

They computed properly down until the TIME REMAINING was zero seconds and the compute time was 8 hours and a few minutes. Instead of reporting the completion, the WU was marked as WAITING with zero seconds remaining. When the WU restarted, it indicated a COMPUTE ERROR with the "finish file present too long</message>". The 34 failing WU seemed to all fail at the end and were 4.08 Linux WU.

https://boinc.bakerlab.org/rosetta/result.php?resultid=1063704662
11) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 90533)
Posted 18 Mar 2019 by rjs5
Post:
19gb swap space used is concerning.


19 GB would indeed be a lot of swap in use but haven't you got the unit wrong? It looks like 19 MB to me.


DOH! You are obviously correct. I got units of GB dancing in my head.
12) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 90522)
Posted 15 Mar 2019 by rjs5
Post:
OK, extra memory ordered for both machines so we’ll see if that sorts it.


It seems like Rosetta gets into a state where it consumes 1gb+ per WU. I am running 35 WU and there is always a couple taking over a gb.

I watch the difference between CPU and RUN times and swap used. As long as the swap used is very low, you are probably not running into memory problems. I tend to buy more GB of memory than threads. I originally got my 36 thread machine with 32GB and that was not enough. You can see that 19gb of my swap space has been used even though the machine has 64gb installed for the 36 threads. 19gb swap space used is concerning.

Based on over a thousand jobs each, the credit difference between the 64-bit Rosetta WU and Minirosetta 32-bit WU is negligible. 44.0 credits/CPU hr for Rosetta 4.08 and 45.7 credits/CPU hr.

top ic .... sorted by memory use.

top - 10:55:55 up 1 day, 18:24, 0 users, load average: 40.22, 36.72, 36.27
Tasks: 524 total, 37 running, 487 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.4 us, 1.4 sy, 96.6 ni, 1.1 id, 0.0 wa, 0.4 hi, 0.1 si, 0.0 st
MiB Mem : 64090.7 total, 1051.2 free, 16283.5 used, 46756.0 buff/cache
MiB Swap: 32112.0 total, 32093.0 free, 19.0 used. 45874.0 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24590 boinc 39 19 1722808 1.5g 75400 R 98.3 2.5 219:01.25 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_15_1798_1948__t000__1_C1+
25349 boinc 39 19 1384300 1.2g 75400 R 99.3 1.9 198:57.60 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_15_1798_1948__t000__1_C1+
22988 boinc 39 19 838204 723668 75400 R 97.7 1.1 259:32.35 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_14_1536_1929__t000__0_C1+
24878 boinc 39 19 706928 592640 75784 R 99.3 0.9 211:30.53 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_15_1805_1950__t000__0_C1+
15222 boinc 39 19 605140 491200 76104 R 99.0 0.7 459:54.12 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_15_1674_1946__t000__0_C1+
20625 boinc 39 19 605492 491108 75400 R 99.3 0.7 319:46.33 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_15_1808_1947__t000__0_C1+
16439 boinc 39 19 583112 468876 75784 R 97.4 0.7 428:23.20 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_15_1674_1946__t000__0_C1+
24082 boinc 39 19 583664 465920 68044 R 99.3 0.7 231:28.63 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu @flags_rb_03_15_1805_1950__t000__ab_robetta -in:file:boinc_wu_zip+
17334 boinc 39 19 575680 457680 68620 R 99.3 0.7 404:59.21 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
22280 boinc 39 19 543464 425512 68556 R 99.7 0.6 276:44.21 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
19901 boinc 39 19 533536 415428 68556 R 99.7 0.6 338:55.09 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
22209 boinc 39 19 530860 413260 68236 R 99.3 0.6 278:15.90 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu @flags_rb_03_15_1808_1947__t000__ab_robetta -in:file:boinc_wu_zip+
25711 boinc 39 19 523612 408668 70668 R 99.3 0.6 190:02.19 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu @foldit_2007571_0001_fold_and_dock_flags -silent_gz -mute all -ou+
21481 boinc 39 19 521072 406132 70604 R 99.3 0.6 297:12.91 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu @foldit_2007571_0005_fold_and_dock_flags -silent_gz -mute all -ou+
17873 boinc 39 19 516024 398184 68620 R 99.3 0.6 391:55.17 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
15374 boinc 39 19 511956 394116 68556 R 99.3 0.6 455:50.04 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
30825 boinc 39 19 509260 391232 68620 R 99.3 0.6 78:42.82 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
14998 boinc 39 19 508228 390160 68620 R 98.0 0.6 465:27.01 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
18209 boinc 39 19 503324 385500 68620 R 99.0 0.6 383:28.39 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
31538 boinc 39 19 500516 382744 68620 R 99.3 0.6 60:22.53 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
13) Message boards : Number crunching : Rosetta 4.0+ (Message 90492)
Posted 6 Mar 2019 by rjs5
Post:
The quantity of Validation error units is also increasing in wu's with names starting by "6mers" and "ClpP_peptoids".



I had several that ran about 4000 seconds and errored out. They all had tried 600 decoys.

https://boinc.bakerlab.org/rosetta/result.php?resultid=1061038710

command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_i686-pc-linux-gnu @ClpP_peptoids_1_npropyl_7mer_0001_511_0001.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2973984
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
======================================================
DONE :: 1 starting structures 6360.72 cpu seconds
This process generated 600 decoys from 600 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
12:23:32 (21790): called boinc_finish(0)[url][/url][url][/url]
14) Message boards : Number crunching : Rosetta@home using AVX / AVX2 ? (Message 90439)
Posted 26 Feb 2019 by rjs5
Post:
So I hope your efforts can be applied to something more mundane, such as AVX2. It appears that AMD is going to improve their implementation of it on Ryzen 3000, and so should be widely available before long. Good luck.


+1
Avx seems to be enough
An example: Acustics


Interesting link.
Since Rosetta is working on only 3-dimensional arrays, it currently performs 3-loads, 3-operations and then 3-stores.
Changing to 4-dimensions will have SSE2 do 2-loads, 2-operations and 2-stores.
AVX-512 would not make much difference and few could run the binary.
I think on Skylake forward, Intel was looking at not stalling on software prefetches. If the code issued a software prefetch and all the read/write buffers were busy, the software prefetch was not executed.

32-bit has a smaller code footprint and smaller data footprint and therefore makes the on-chip caches more effective.

IMO, a 32-bit code version with a 4-dimensional vector so SSE2 does only 2 operations vs three would probably be the fastest. Rosetta probably measures performance running one copy on one of their servers with large caches. The optimizations they chose bloat the runtime size and running multiple Rosetta binaries stress the hardware and slows all copies down. They over tune the binary using single one execution.
15) Message boards : Number crunching : Rosetta@home using AVX / AVX2 ? (Message 90426)
Posted 25 Feb 2019 by rjs5
Post:
I think I might download the rosetta source and see what the developers have done to the code in the last year. 8-)


I'm not so optimist. Latest version of R@H code (4.07) is almost one year old...

P.S. Great pc!!!


The code version doesn't really matter to me as long as it runs the workloads. I am going to make some minor changes to the "vector" object definition ... from a array of 3 FP numbers to an array of 4 FP numbers. I am going to try to get the compilers to generate code that uses both halves of the SSE2 registers when computing. When that works, I will turn on AVX-512 and see how crunching the vectors in one operation changes performance.

I am just going to try to get some "empirical" data that might interest the developers to introduce similar changes.

BTW, the new machine is still accumulating Rosetta RAC and is around 21,800 and topping out. Top machine #34. With the liquid cooler, it is running about 65 degrees C. The only BIOS change I made was to tell the CPU to not exceed 80 degrees, but Linux tools says that the CPU is running at 3.8ghz.
I am running predominantly Rosetta with a random WCG "Help TB" thrown in and GPU WUs too.
16) Message boards : Number crunching : Rosetta v4.08 x86_64-pc-linux-gnu or Rosetta v4.07 i686-pc-linux-gnu (Message 90404)
Posted 21 Feb 2019 by rjs5
Post:
There's only mod.Sense who is a volunteer and so isn't paid by the bakerlab, bit has done an incredible job of keeping this place in order for many years, with waaay more patience dealing with all types of people on the forum than I would ever have had.

David Kim also post occasionally, but less so recently and usually only about tech stuff like hardware issues.


Agree about mod.Sense and David. I have worked with David off line several times and he has been quite knowledgeable and helpful. I have usually talked about performance issues, but the last time was to give him a fix for the Ubuntu 18.04 glibc problem that caused WU to abort. David cleaned up my fix and built the Linux Rosetta 4.08 64-bit version.

The Rosetta developers have been repeatedly skeptical about my performance improvement estimates. That is not a surprise. Developers are sensitive about their work and frequently think they know more than they do. I had to explain to many compiler developers why their "really neat improvement" was not going to make the impact they forecast. The application developers are farther away from performance problems than the compiler developers.

I thought I would look at Rosetta code and see what they have done during the last two years and maybe run some experiments.

... Chilean ... you know more than you think. As with most topics, terminology is the barrier. Start another thread if you have a question and I will see if I can answer.
17) Message boards : Number crunching : Rosetta v4.08 x86_64-pc-linux-gnu or Rosetta v4.07 i686-pc-linux-gnu (Message 90395)
Posted 19 Feb 2019 by rjs5
Post:
I frequently find myself thinking ... "why did the project devote time to that instead of ...?"
Maybe it makes sense if you have a better view of what they are doing. Maybe not. 8-)

LACK OF COMMUNICATION ...
It seems like all their moderators are closely related to the project. It should be fairly easy to recruit some volunteer MODERATORS or people who agree to handle many of the routine comments on the boards. Some of these volunteers could even generate some extra financial support for the project. Many US companies have employee benefits that match cash donations to schools and charities (like U of Washington). Many of these companies will also match volunteer time with cash called a "Matching Volunteer Grant". I retired from a company that extends these benefits to retirees. In theory, I could submit the number of hours that I contribute to Rosetta for a "matching gift". My company matches volunteer time at a rate of $10/hour. A $10 to $20 hourly match rate is common. Apple matches $50 per volunteer hour currently.
https://doublethedonation.com/matching-gifts/apple-inc


APPLICATION BINARIES ...
It seems to me, the biggest problem is how the Rosetta Project "spends" its limited human resources. It may be a problem with matching their "people skill sets" with the "development wish list" and with the available time of those people.

Their efforts to "hyper optimize" the binary by pulling functions "inline" is based on running 1 copy on a large, idle machine. The result is "sub optimized" results when running 2 or more WU on a machine that strain a critical resource .. like the instruction cache. I am running 36 copies on a machine and the negative impact of inlining functions is pretty obvious.

Climateprediction@home has WU that run for hundreds of hours, but they "checkpoint" and trickle up the results 12 times during the run. You get partial credit even if the WU aborts deep into execution. That seems like an execution model Rosetta could consider. Rosetta chops up a long running WU and broadcasts the pieces to many machines. If they ran multiple pieces of that one WU on the same machine in parallel, you would only need 1 database file to share and the overall size of the execution footprint would be smaller.

Etc, etc, ...



We don't really know where the limitations are by the way. They don't bother to tell us. So one speculation is as good as another.


I agree with you. Maybe it's a problem of work generation or others.

And I don't like the lack of communication of this project

Returning to the argument of thread: why not create 64 native app? Seems, reading r5js, this is not SO difficult.
18) Message boards : Number crunching : Rosetta v4.08 x86_64-pc-linux-gnu or Rosetta v4.07 i686-pc-linux-gnu (Message 90389)
Posted 18 Feb 2019 by rjs5
Post:
I was surprised to see that Rosetta 4.08 is the only 64-bit binary for Linux. Both copies of Minirosetta are 32-bit.

I noted this (thanks to you) almost 2 years ago. Nothing has changed.
Today 98% of Windows runs 64 bit version.

A 64-bt version will always be faster if compiled properly, unless you aggressively inline functions. The larger code footprint is causing front-end icache miss stalls.
I want to see how hard it is to modify the source to redefine their 3-dimensional "vector" object into a 4-dimensional vector so they can use packed SSE or AVX. Right now all Rosetta computation uses scalar operations.

You are doing what, in Italy, we call "opera meritoria" (something like "meritorious work").
But sometimes, here in Rosetta@Home, seems to be like Don Quixote of la Mancha, who "tilting at windmills".




minirosetta_3.78_x86_64-pc-linux-gnu is a 32-bit binary even though the name implies it is 64-bit. Seems curious that they would build and deploy TWO binaries with only different names. The only difference in the 2 binaries is a different text name in the binary. I wonder if someone goofed on the compile options.

It is beginning to appear that BOINC is becoming too successful and the projects are having a hard time utilizing the compute power. The server infrastructure is creaking under the pressure.
19) Message boards : Number crunching : Rosetta v4.08 x86_64-pc-linux-gnu or Rosetta v4.07 i686-pc-linux-gnu (Message 90367)
Posted 16 Feb 2019 by rjs5
Post:
Well, you can't execute the benchmark since that are all your completed WUs. Whatever you did, will sort itself out with time.
You can force 64-bit only with <no_alt_platform> in cc_config.xml, but I would wait with that and see if the 64-bit application is really faster than the 32-bit, that's not always the case. Just watch the GFLOPS values. If it really is faster, the server will send most WUs to that application anyway, so no really need to do anything.


I was surprised to see that Rosetta 4.08 is the only 64-bit binary for Linux. Both copies of Minirosetta are 32-bit.
The 32-bit binaries end up with a smaller code footprint and pass parameters on the stack. The parameter list quickly spills to the stack, but they will be in the L1 cache which has a 1-cycle access time.

A 64-bt version will always be faster if compiled properly, unless you aggressively inline functions. The larger code footprint is causing front-end icache miss stalls.
I want to see how hard it is to modify the source to redefine their 3-dimensional "vector" object into a 4-dimensional vector so they can use packed SSE or AVX. Right now all Rosetta computation uses scalar operations.

thanks again.
20) Message boards : Number crunching : Rosetta v4.08 x86_64-pc-linux-gnu or Rosetta v4.07 i686-pc-linux-gnu (Message 90365)
Posted 16 Feb 2019 by rjs5
Post:
Does the researcher specify Rosetta 4.07 or 4.08 ... or does the project code run some detection test on my machine to determine what is supported?

The server will send you both to see which one performs better. After 10 valid WUs for each application, most WUs will be assigned to the faster application and only very few to the slower one (just to check it is still slow).


Thanks
Do you know how I get it to execute the benchmark process and chose again OR explicitly override it for 64-bits?
It made the wrong choice because I was messing with the new machine and changing settings that affected the runs.


Next 20



©2019 University of Washington
http://www.bakerlab.org