Posts by Mad_Max

41) Message boards : Number crunching : Stalled downloads (Message 91925)
Posted 10 Mar 2020 by Mad_Max
Post:
Yep, got a bunch of stuck downloads on all of my computers today too.

In addition to files from rb_03_08_ tasks already reported above there were few tasks like this:

10/03/2020 23:53:17 | Rosetta@home | Temporarily failed download of twc_method_msd_cpp_10v4nme2_1719_result_0965_msd.zip: transient HTTP error
From task twc_method_msd_cpp_10v4nme2_1719_result_0965_msd_SAVE_ALL_OUT_901017_596_0

11/03/2020 00:31:08 | Rosetta@home | Started download of 11v1nmgb_c17732_11mer_gb_000552.zip
from
11/03/2020 00:34:27 | Rosetta@home | task 11v1nmgb_c17732_11mer_gb_000552_SAVE_ALL_OUT_893258_131_0 aborted by user

It getting really annoying, this bug repeats every few days for at least a month already.
Next time i probable just switch my computers to WCG from R@H.

P.S.
Looks like aborting whole WUs with stalled downloads instead of aborting download itself (internet transfer) is a faster/easier way to clear such errors.
BOINC resume work fetch almost immediately after WU abort while after aborting download it usually still refuses to get new work for few more hours (complaining about stalled downloads even after all of it already aborted) or until BOINC restart.
42) Message boards : Number crunching : GPU WU's (Message 91921)
Posted 10 Mar 2020 by Mad_Max
Post:
folding@home has many cpu only work units too. When asked fah says the same thing, not all projects are suitable for gpu folding. maybe its too hard for them to code a gpu version.

You can dedicate your gpu to folding@home and cpu to rosetta. Thats how I use my resources for a long time.

A lot of projects can now be ran on a modern GPU.
ATI GPUs support Double precision, which is probably the only limitation on what projects can only be ran from CPU.
Most RTX and modern higher end AMD GPUs, support DPP.

You have to look at it this way:
A CPU runs Out Of Order instructions, running at sub 5 Ghz, with mostly 8 cores 16 threads.
That's 16 threads of 5Ghz of data, that gets a 20-25% boost due to out of order arrangement.
Multiply this and a fictive number of 100 comes out.

A GPU runs mostly in-order cores, at a much lower 1,5-1,8Ghz (let's say for the sake of discussion 1,5Ghz sustained).
But a low end GPU has a good 384 cores, a high end GPU has 4500 cores.
The fictive number for low end GPUs (like a GT 750) would be 576.
The fictive number for high end GPUs (like an RTX 2080 Ti) would be 6750.

And while the CPU has many more optimizations, GPUs benefit from their direct access to VRAM (much faster than CPU RAM).

GPU worst case scenario VS CPU best case scenario:
A budget GPU is +5x faster than a CPU.
A high end GPU is +66x faster.


A fair comparison would say the average GPU is 100x faster than an average CPU, while doing it at a much lower power consumption.

Heck, even performance/power consumption (efficiency) of $20 cheap Chinese media players, or cellphones, is much higher than x86 CPUs.

LOL divide GPU "cores" by factor of 64 and you will get REAL GPU core count. Real core = minimal independent part of electronic chip which can run own program/computing thread.
For example GT 750 has 8 GPU cores and RTX 2080 Ti has 68 GPU cores.

What you are referring to is not cores, but "shaders" or elemental computation units inside SIMD engine of GPU core. Usually 64 shaders per GPU core for AMD and NV GPUs . And most of them is 32 bit compute units, only minority is capable of 64 bit. Naming shaders as "cores" is just marketing bullshit.

x86 cores also have multiple compute unites inside each core and all of them is 64 bit capable. Current standard desktop x86 CPUs from Intel/AMD has 8х64 bit FPU plus 4 integer/logic compute units per each core and running at 2-3 times higher frequency and with higher efficiency compared to GPU cores. Intel high-end server CPUs has 16x64 bit FPUs + 4x64 INT per each core.

As a result modern GPU only just few times faster compared to modern CPUs. And only on task well suitable for highly parallel SIMD computation. On tasks non well suitable for such way of computation it can be even slower compared to CPUs.
43) Message boards : Number crunching : Stalled downloads (Message 91816)
Posted 1 Mar 2020 by Mad_Max
Post:
Yes, first link is now working for me too. But it did not work at time when i was writing my previous post (29 Feb 2020 ~ 22:20 UTC ).
44) Message boards : Number crunching : Huge RAM usage by some of latest WUs (Message 91815)
Posted 1 Mar 2020 by Mad_Max
Post:

Isn't the Internet bandwidth the same? With multi-threaded you run fewer work units at a time, but you download/upload correspondingly more often.
No, with R@H it works other way - almost all WUs here runs exactly same time (8 hours by current default) regardless of how powerful CPU you are using. E.g. if you replace CPU by other model which is 2 times faster(per each core basis) you do not complete 2x quantity of R@H WU. Instead each WU on faster CPU will be running about same time as on slower (still ~8 hours) but will produce more useful results (they called models or decoys - variants of possible protein configuration) from the same input data files downloaded from server.
Same apply for multi-threaded app if it will be ever available for R@H: increasing number of threads per WU will not reduce runtime of one WU, it just "squeeze out" more scientifically useful results from same starting WU data. So using MT app will reduce total number of WUs downloaded, stored on disk and loaded into the RAM few folds (based on how many threads each WU is using).

I think the only real saving is memory. Most multi-threaded projects now allow you to select how many threads (cores) you want to use on a single work unit. I usually select "1" or "2", since that is usually more efficient. Most MT projects run less efficiently the more threads you use. I am not sure why that is the case, but it is said that on some of them, one thread may finish early before the others, and have nothing to do. There may be other reasons.

I usually have plenty of memory, though having a choice is nice. But I expect that not all tasks are suitable for MT.
Yes from CPU side a lot of single threaded WU running in parallel independently usually is most effective variant, MT apps usually a slightly less effective CPU wise. But swarms of single threaded tasks is a waste of all other resources - RAM, bandwidth and both disk space and disk usage. As each R@H WU unpack own copy of main Rosetta database to work with which is about ~ 500 MB and ~4000 files plus 400 folder per each running WU ( and it keeps growing as more data added to DB) - you can find it in "minirosetta_database" subfolder of each BOINC "slot" folder occupied by R@H WU.

Yes, MT app may be not optimal for simple and "small" tasks but can be very useful for huge tasks like this "corona virus" simulations described above which needs a lot of RAM and CPU power per each WU.
45) Message boards : Number crunching : File transfers. (Message 91814)
Posted 1 Mar 2020 by Mad_Max
Post:
Yes, it will clear itself but in a not a good way - BOINC will just ignore such tasks from project with "zero" resource share until it almost hit theirs deadlines, it trigger "panic mode" and BOINC reallocate all resources to it to be able finish it before deadline. But sometimes it still miss some deadlines as tasks duration estimates are far from perfect and some WU can take a way longer than BOINC thinks.
And do some other stupid thing while in "panic mode" like ignoring CPU cores reservation setting (like i set to use 90% CPUs at max = 7 of 8 cores, but BOINC in "panic mode" will use all 8) or start pausing GPU work to free more cpu cores for CPU WU risking cross deadline and other thing which was never allowed to do.
46) Message boards : Number crunching : Stalled downloads (Message 91812)
Posted 29 Feb 2020 by Mad_Max
Post:
Yep, I got a bunch of stuck downloads at 28 Feb too.

Latest 2 examples:

http://boinc.bakerlab.org/rosetta/download/fc/rb_02_24_16848_16671_ab_t000__h002_robetta.zip

http://boinc.bakerlab.org/rosetta/download/224/PKY1232uM_gly_00722_127_2_SSC_matched_9_FR_C_R_B_0001_notail.zip

From BOINC it looks like this (with http_debug):
01/03/2020 00:30:08 | Rosetta@home | Started download of PKY1232uM_gly_00722_127_2_SSC_matched_9_FR_C_R_B_0001_notail.zip
01/03/2020 00:35:15 | Rosetta@home | Temporarily failed download of PKY1232uM_gly_00722_127_2_SSC_matched_9_FR_C_R_B_0001_notail.zip: transient HTTP error
01/03/2020 00:35:15 | Rosetta@home | Backing off 05:44:16 on download of PKY1232uM_gly_00722_127_2_SSC_matched_9_FR_C_R_B_0001_notail.zip
--------i have noticed stalled download (it was stuck for about 15-20 hours already )  turned http_debug on and press "retry"  -----------
01/03/2020 00:42:31 |  | Re-reading cc_config.xml
01/03/2020 00:42:31 |  | log flags: file_xfer, sched_ops, task, http_debug, work_fetch_debug
01/03/2020 00:42:31 | Rosetta@home | Found app_config.xml
01/03/2020 00:42:31 | Rosetta@home | [work_fetch] REC 4936.494 prio -0.068 can't request work: some download is stalled
01/03/2020 00:42:31 | Rosetta@home | [work_fetch] share 0.000
01/03/2020 00:42:59 | Rosetta@home | [http] HTTP_OP::init_get(): http://boinc.bakerlab.org/rosetta/download/224/PKY1232uM_gly_00722_127_2_SSC_matched_9_FR_C_R_B_0001_notail.zip
01/03/2020 00:42:59 | Rosetta@home | [http] HTTP_OP::libcurl_exec(): ca-bundle 'D:Boincca-bundle.crt'
01/03/2020 00:42:59 | Rosetta@home | [http] HTTP_OP::libcurl_exec(): ca-bundle set
01/03/2020 00:42:59 | Rosetta@home | Started download of PKY1232uM_gly_00722_127_2_SSC_matched_9_FR_C_R_B_0001_notail.zip
01/03/2020 00:42:59 | Rosetta@home | [http] [ID#10522] Info:  Connection 3013 seems to be dead!
01/03/2020 00:42:59 | Rosetta@home | [http] [ID#10522] Info:  Closing connection 3013
01/03/2020 00:43:00 | Rosetta@home | [http] [ID#10522] Info:    Trying 128.95.160.156...
01/03/2020 00:43:00 | Rosetta@home | [http] [ID#10522] Info:  Connected to boinc.bakerlab.org (128.95.160.156) port 80 (#3014)
01/03/2020 00:43:00 | Rosetta@home | [http] [ID#10522] Sent header to server: GET /rosetta/download/224/PKY1232uM_gly_00722_127_2_SSC_matched_9_FR_C_R_B_0001_notail.zip HTTP/1.1
01/03/2020 00:43:00 | Rosetta@home | [http] [ID#10522] Sent header to server: Host: boinc.bakerlab.org
01/03/2020 00:43:00 | Rosetta@home | [http] [ID#10522] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.14.2)
01/03/2020 00:43:00 | Rosetta@home | [http] [ID#10522] Sent header to server: Accept: */*
01/03/2020 00:43:00 | Rosetta@home | [http] [ID#10522] Sent header to server: Accept-Encoding: deflate, gzip
01/03/2020 00:43:00 | Rosetta@home | [http] [ID#10522] Sent header to server: Content-Type: application/x-www-form-urlencoded
01/03/2020 00:43:00 | Rosetta@home | [http] [ID#10522] Sent header to server: Accept-Language: en_GB
01/03/2020 00:43:00 | Rosetta@home | [http] [ID#10522] Sent header to server:
01/03/2020 00:43:01 | Rosetta@home | [http] [ID#10522] Received header from server: HTTP/1.1 200 OK
01/03/2020 00:43:01 | Rosetta@home | [http] [ID#10522] Received header from server: Date: Sat, 29 Feb 2020 21:42:58 GMT
01/03/2020 00:43:01 | Rosetta@home | [http] [ID#10522] Received header from server: Server: Apache/2.4.18
01/03/2020 00:43:01 | Rosetta@home | [http] [ID#10522] Received header from server: Last-Modified: Sat, 22 Feb 2020 18:36:23 GMT
01/03/2020 00:43:01 | Rosetta@home | [http] [ID#10522] Received header from server: ETag: "a8a-59f2e6a4792b8"
01/03/2020 00:43:01 | Rosetta@home | [http] [ID#10522] Received header from server: Accept-Ranges: bytes
01/03/2020 00:43:01 | Rosetta@home | [http] [ID#10522] Received header from server: Content-Length: 2698
01/03/2020 00:43:01 | Rosetta@home | [http] [ID#10522] Received header from server: Content-Type: application/zip
01/03/2020 00:43:01 | Rosetta@home | [http] [ID#10522] Received header from server:
01/03/2020 00:43:01 | Rosetta@home | [http] [ID#10522] Received header from server: PK
01/03/2020 00:48:06 | Rosetta@home | [http] [ID#10522] Info:  Operation too slow. Less than 10 bytes/sec transferred the last 300 seconds
01/03/2020 00:48:06 | Rosetta@home | [http] [ID#10522] Info:  Closing connection 3014
01/03/2020 00:48:06 | Rosetta@home | [http] HTTP error: Timeout was reached
01/03/2020 00:48:06 | Rosetta@home | Temporarily failed download of PKY1232uM_gly_00722_127_2_SSC_matched_9_FR_C_R_B_0001_notail.zip: transient HTTP error
01/03/2020 00:48:06 | Rosetta@home | Backing off 03:56:16 on download of PKY1232uM_gly_00722_127_2_SSC_matched_9_FR_C_R_B_0001_notail.zip 



From a browser or other programs it looks the same: R@H server is responding, downloading of file begins but at some point completely stops until timeout is triggered. Retries does not help - it just repeat loop.
47) Message boards : Number crunching : Huge RAM usage by some of latest WUs (Message 91747)
Posted 19 Feb 2020 by Mad_Max
Post:
These are likely jobs that are modeling the Spike complex (http://new.robetta.org/results.php?id=15652) of 2019-nCoV_S, the corona virus. The genome has been sequenced and there is a mad rush to determine structures for possible drug targets.

We are collaborating with a number of different research groups to model corona virus proteins that may be possible drug targets, including the NIH/NAIAD and SSGCID https://www.ssgcid.org/.

So it no memory leaks, it just abnormally big (compared to R@H average work) protein model? 1273 amino acid residues if i get it right?

Is any work on developing of multi-threaded app for such big targets? To not to waste huge amounts of RAM for complete datasest copy for each working thread.
Modern computer getting more and more CPU cores/thread and just running multiples copies on each thread means more and more "overhead" for RAM, Disk and Internet(Bandwidth) usage because use of all of these resources is multiplicates by number of task is running. While multi-threaded app is share all of this and only need multiple CPU/threads.

Usual(common) setup for non server computers is about 1 GB of RAM per 1 CPU thread.
2 GB per thread is much more rare cases. And there are almost no "consumer" or "office" or "home" computer with >2 GB RAM per CPU thread.
So you can not just throw task which consume >=3 GB of RAM per thread and expect that all will be working OK. There WILL be problems on majority of computer.

In other case if there is a multi-threaded app is available then using even 5-10 GB of RAM per single large model will be acceptable for most volunteer computers. Also i will help with runtimes of biggest models on older CPUs - really big models often getting aborted on old(or just slow like Intel Atom or AMD Puma/Jaguar/Bobcat) CPUs by watchdog due to exceeding max allowed runtime (8+4 = 12 hour MAX as default) before very first model/decoy is calculated and CPU time spend is wasted.
48) Message boards : Number crunching : No "finished" file (Message 91744)
Posted 19 Feb 2020 by Mad_Max
Post:
Yes it somehow related to disk speed and occurs on SSDs much less frequently, but it still occurs sometimes even on SSDs.
On HDD + lot of concurrent R@H WUs running it happens much often.

Looks like root of the problem is a really old bug somewhere in Rosetta software which cause app to crash if it can not write to disk immediately, instead of just waiting a few seconds while disk is busy by handling other requests.
But devs do not bother to track it and fix so it keeps crashing the app and wasting generated result for years now.

Moving data to SSDs, enable disk write cache, reducing max_concurrent tasks running, etc - all is just partial workarounds(it helps mitigate problems, but not 100%), it does not fix the problem itself.
49) Message boards : Number crunching : Stalled downloads (Message 91743)
Posted 19 Feb 2020 by Mad_Max
Post:
Yep, same shit here. Stuck downloads (= stop flow of work for R@H as BOINC stops getting new work from R@H and switch to backup project - WCG in my case) every few days.
There were 4 or 5 times from beginning of February.
50) Message boards : Number crunching : Huge RAM usage by some of latest WUs (Message 91680)
Posted 12 Feb 2020 by Mad_Max
Post:
Longer they run - more RAM to consume.



Now > 3000 MB per WU after ~5 hours of running.
rb_02_08_15652_15556__t000__0_C3_SAVE_ALL_OUT_IGNORE_THE_REST_891233_7217
and
rb_02_08_15652_15556__t000__0_C3_SAVE_ALL_OUT_IGNORE_THE_REST_891233_7469

Looks much like memory leaks. Buy it non linear but RAM usage jump after each stage of computation finished and new begins.
Smell like data/object not released properly after use.
51) Message boards : Number crunching : No "finished" file (Message 91679)
Posted 12 Feb 2020 by Mad_Max
Post:
This error have been here for years now.
Happens from time to time. No clear ways to fix it.

No need to do full reset of the project.
Simple BOINC restart (not just manager aka GUI, but full restart) or computer reboot fix it too. But it will return again after some time.
52) Message boards : Number crunching : Huge RAM usage by some of latest WUs (Message 91678)
Posted 12 Feb 2020 by Mad_Max
Post:
Hello.

One of my computer crashed today. Then i start digging why - it was out of RAM.
And second was in "swap of death" state"(swapping non-stop for hours while almost not doing any useful work )
More digging - reason of out of RAM and non-stop swapping was Rosetta.

I see HUGE RAM usage by some of latest WUs. Form 1.5 to 3.5 GB of RAM per working WU.

You can see a lot of task using 1400-1600 MB of RAM currently and ~2800 MB of RAM as a peak value.
Before crash and reboot few tasks peaked at ~3200-3500 MB before system crash after running out of both RAM and disk swap space.

Usual consumption for R@H in 300-1000 MB range. Is this WUs is something completely new?
Or just bugs like memory leaks?

It all Rosetta 4.07 WUs and names start by "rb_02_xx (where xx = 29, 08, 08 and 10).
I guess it Robetta WUs generated at 29 JAN, 08 FEB, 09 FEB, 10 FEB.

I was forced to limit maximum of concurrency running R@H units using "max concurrency" setting in app config.

Some example WUs
http://boinc.bakerlab.org/rosetta/result.php?resultid=1121861215
http://boinc.bakerlab.org/rosetta/result.php?resultid=1121861165
http://boinc.bakerlab.org/rosetta/result.php?resultid=1121861118
http://boinc.bakerlab.org/rosetta/result.php?resultid=1121861128
http://boinc.bakerlab.org/rosetta/result.php?resultid=1121861130
http://boinc.bakerlab.org/rosetta/result.php?resultid=1121861138
http://boinc.bakerlab.org/rosetta/result.php?resultid=1121861090
http://boinc.bakerlab.org/rosetta/result.php?resultid=1121861114
http://boinc.bakerlab.org/rosetta/result.php?resultid=1121613378
53) Message boards : Number crunching : File transfers. (Message 91677)
Posted 12 Feb 2020 by Mad_Max
Post:
I also have few stuck files in last few days.
And BOINC also stop getting new work from R@H completely until i have noticed it today and aborted stuck file transfers.
54) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 91676)
Posted 12 Feb 2020 by Mad_Max
Post:
If downloading retry does not help - aborting file transfer will usually work.
Corresponding task will fail, but BOINC is smart enough to abort such tasks without trying to run it.
So no any computation is wasted.

P.S.
I also have few stuck files in last few days (previous such case was about a year ago).
I think one of the files was exactly the same file. And BOINC also stop getting new work from R@H until i have noticed it today and aborted stuck file transfer.

One of tasks with "stuck" downloads: http://boinc.bakerlab.org/rosetta/result.php?resultid=1121514493
55) Message boards : Number crunching : Minirosetta 3.73-3.78 (Message 88088)
Posted 17 Jan 2018 by Mad_Max
Post:
I do not see such memory leaks any more lately too.

About AMD CPU performance - I do not know. I do not have any latest AMD CPUs (from Ryzen family) yet.
I am still using older CPUs: one Phenom II X6 and two FX-8320 (Vishera/Piledriver), And I have not seen any performance issues with these older AMD CPUs in Rosetta: they almost on par with corresponding (from same Generation/age and same core number) Intel CPUs.
56) Message boards : Number crunching : Minirosetta 3.73-3.78 (Message 88042)
Posted 9 Jan 2018 by Mad_Max
Post:
Looks kike something wrong with rb_01_08_.... series of WUs on minirosetta 3.78. (rb_01_08_77806_122534__t000__2_C1_SAVE_ALL_OUT_IGNORE_THE_REST_541301_331_0 latest example)

i have seen some of these tasks consuming huge amount of RAM - it start from standard 200-400 Mb range but at same point can hoard up to 1400-1800 Mb per task. May be even more - it crashed due to out of RAM (8 GB RAM + 4 GB page/swap file on 6-core CPU)
57) Message boards : Number crunching : Minirosetta 3.73-3.78 (Message 87546)
Posted 20 Oct 2017 by Mad_Max
Post:
Sometimes minirosetta lose calculated result somehow at task restarts (eg. computer or boinc reboot or just switch to another project if few are running on same CPU).
I am not talking about checkpoints in the middle of model calculation but of entire models which was already successfully calculated but did not reported to the server.

Here example: http://boinc.bakerlab.org/result.php?resultid=948310700
======================================================
DONE :: 133 starting structures 28635.4 cpu seconds
This process generated 133 decoys from 133 attempts
======================================================

But after task restart (NOT crash/hang, just normal correct restart when taks unloaded from memory and loaded back from disk later ) only one last model (decoy) reported to server.
======================================================
DONE :: 1 starting structures 28382.7 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================


All previous 133 calculated decoys lost.

This is not happens often. But i see such task from time to time (may be 1-2 per week).
Best way to track(search) for such task is to query databese for VALID task but with abnormal low credit compared to used CPU time - because credit calculated in proportion to number of decoys reported. And if many decoys were lost - granted CR will be abnormal low.
58) Message boards : Rosetta@home Science : Design of protein-protein interfaces (Message 75686)
Posted 30 May 2013 by Mad_Max
Post:
2 moody
Thanks for science update!
59) Message boards : Number crunching : Rosetta@home FAQs - They are soooo out of date! (Message 75515)
Posted 27 Apr 2013 by Mad_Max
Post:
These values should be enough except RAM.
1 Gb RAM is minimim for old one core/one thread comps only.
Real RAM requirements is 0.5 Gb of RAM + 0.5Gb Х number of CPU treads as a very minimum. (1.5 Gb for 2 cores CPUs like celeron/pentium/athlon, 2.5 Gb for i3/i5/Phenom x4/FX-4ххх and so on)
And 1 GB of RAM per CPU thread as recomended value (if the owner wants to use computer for something other than the calculations of R@H exclusively )
If computer dedicated for R@H 0.5 Gb/tread may be suitable.
60) Message boards : Number crunching : Client errors (Message 75228)
Posted 12 Mar 2013 by Mad_Max
Post:
Another team member try nv 314.07. And it NOT help in his case: http://boinc.bakerlab.org/rosetta/results.php?hostid=1555324

Will try upgrade BOINC now.

P.S.
Main difference beetwin computes - GTX 6хх (Kepler) cards in first (314.07 drivers helps) and GTX 580 (Fermi) in second (not helps).


Previous 20 · Next 20



©2024 University of Washington
https://www.bakerlab.org