Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 309 · Next
Author | Message |
---|---|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Sorry, I didn't specifically intend FLOPS. Was rather trying to vaguely refer to the clock speed of the CPU involved. The comparisons on number of models processed are only meaningful if the tasks were of the same batch of work. Rosetta Moderator: Mod.Sense |
Simplex0 Send message Joined: 13 Jun 18 Posts: 14 Credit: 1,714,717 RAC: 0 |
If running fewer concurrent tasks improves credit per minute, it would imply either memory contention, or L2 cache contention. If all of the cores are operating in the same L2 cache, then you can see how that would become the constrained resource. I don't mean to say there is anything wrong with a given computer, just that R@h is very memory intensive. Also others have found that machines with larger L2 caches seem to yield more credit per FLOPS rating per runtime minute. Lets compare Threadripper 1950X, 16 core, L2-cache size 16 x 512 KB and Intel Celeron G1620, 2 core, L2 cache size 512 KB The L2 cache size per core is actually half the size per core on the Intel Celeron G1620 compared with Threadripper 1950X but a Threadripper 1950X, running Rosetta on all cores, is being outperformed by a Intel Celeron G1620. Does that make any sense to you? Take a look at computer ranked as number 51 here https://boinc.bakerlab.org/rosetta/top_hosts.php?sort_by=expavg_credit&offset=40 My experience with Rosetta while running a Threadripper 1950X@4GHz on full load on 31 threads 16 hoursday is as follow..... Target CPU run time 1 hours credit per day = 23400 Target CPU run time 2 hours credit per day = 28100 Target CPU run time 4 hours credit per day = 16700 Target CPU run time 8 hours credit per day = 13700 What credit you will get running a given CPU in Rosetta is a lottery where you have no idea of what you could expect yet you see staffmoderators here claim that the credit given for work done works as it should. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,229,863 RAC: 6,747 |
I agree that the comparison is only meaningful if there is a similar amount of work being done. Glad you didn't really mean FLOPS. Sometimes FLOPS is a good metric. For Rosetta, it seems like decoys is. My problem, after examining this one case, is the "credit algorithm" is broken. The repeated request from your volunteers seems to be a simple request to examine this more closely. IMO, any computer running any WU for 8 hours and solving 83% as many decoys and taking slightly more memory, would expect to see 80% of the credit. It is very hard for me to think of any case where that same 8 hours of compute should only be worthy of 17% of the credit. |
4LG5zSZM7uiF1nVGZVqTRrjkXA6i Send message Joined: 7 Mar 10 Posts: 14 Credit: 111,252,570 RAC: 0 |
I've also seen this, even on Android devices. WU1: Run time 12 hours 3 min 35 sec CPU time 11 hours 54 min 52 sec Validate state Valid Credit 50.04 Peak working set size 337.73 MB Peak swap size 394.07 MB Peak disk usage 576.05 MB DONE :: 1 starting structures 42892.1 cpu seconds This process generated 121 decoys from 121 attempts WU2: Run time 12 hours 2 min 22 sec CPU time 11 hours 54 min Validate state Valid Credit 244.01 Peak working set size 333.51 MB Peak swap size 390.57 MB Peak disk usage 576.05 MB DONE :: 1 starting structures 42840.4 cpu seconds This process generated 117 decoys from 117 attempts WU3: Run time 12 hours 6 min 5 sec CPU time 11 hours 57 min 23 sec Validate state Valid Credit 239.91 Peak working set size 330.57 MB Peak swap size 391.07 MB Peak disk usage 576.06 MB DONE :: 1 starting structures 43043.8 cpu seconds This process generated 115 decoys from 115 attempts So the WU with the most decoys had less than 25% of the credit of the others. CPU time was all within 3 minutes of each other. All three WU's were part of the same family as well; cispro_backbone_generation_8mers_largerun_PXXPXXXX_SAVE_ALL_OUT These three were run on the same device at the same time. There are times where I get two WU's that have low value and one that is high, in this case, it was two high and one low. The device was not be used at all; just sitting on a desk plugged into a charger. |
Simplex0 Send message Joined: 13 Jun 18 Posts: 14 Credit: 1,714,717 RAC: 0 |
In case anyone in the Rosetta staff do care if the members computer runs a lot of workunits for 5 - 6 hours each and that all that work is wasted because the result ends up as "Invalid". The work units all have the word 'aivan' in their name the las time I spotted them I downloaded 200 workuntis but aborted all of the after the 5 first had finished after 5 hours despite my settings for "Target CPU run time" is 2 hours and they all ended up as "Invalid" Here is one of the tasks. Uppgift 1015234886 Namn T1000_full3_aivan_SAVE_ALL_OUT_03_09_677955_4155_0 Arbetsenhet 914799018 Skapades 14 Jul 2018, 5:47:44 UTC Skickad 14 Jul 2018, 6:11:13 UTC Rapporteringstidsgräns 22 Jul 2018, 6:11:13 UTC Mottagit 14 Jul 2018, 16:21:25 UTC Servertillstånd Klar Resultat Valideringsfel Enhetstillstånd Färdig Avsluts status 0 (0x00000000) Dator-ID 3418863 Körtid 6 timmar 11 minsta 41 sekunder CPU-tid 6 timmar 8 minsta 50 sekunder Valideringsstatus Inte godkänd Poäng 0.00 Enhetens störst flyttalshastighet 4.91 GFLOPS Applikations version Rosetta v4.07 windows_intelx86 Peak working set size 235.55 MB Peak swap size 229.32 MB Peak disk usage 514.59 MB Stderr logg <core_client_version>7.10.2</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe @T1000.3.flags -in:file:boinc_wu_zip T1000.3.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3683496 Starting watchdog... Watchdog active. BOINC:: CPU time: 22129.6s, 14400s + 7200s[2018- 7-14 18:21:12:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 22129.6 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 18:21:12 (14532): called boinc_finish(0) </stderr_txt> ]]> |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I just did a clean install of Ubuntu 18.04.1 on a Ryzen 1700 and thought I would try Rosetta again. To my surprise, all the downloads (two batches of eight each, with both 3.78 and 4.07 in each batch) failed with download errors before even getting a chance to run. I then recalled the fix for the x64 problem with Rosetta 4.07, but that did not affect downloads and did not affect 3.78 (so I thought). https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12242&postid=88954#88954 Wrong. Applying the fix solved the download problem, and now two 3.78 and two 4.07 are running fine. (But it is absurd for crunchers to have to fix a basic problem three months after the release of a major OS upgrade.) |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,229,863 RAC: 6,747 |
I just did a clean install of Ubuntu 18.04.1 on a Ryzen 1700 and thought I would try Rosetta again. To my surprise, all the downloads (two batches of eight each, with both 3.78 and 4.07 in each batch) failed with download errors before even getting a chance to run. I then recalled the fix for the x64 problem with Rosetta 4.07, but that did not affect downloads and did not affect 3.78 (so I thought). I am pretty sure the only people who use this STATIC version with the bug are the volunteer contributors. Rosetta partners build the binaries on their own systems and probably build the dynamic version. If their build caused a problem, it would get fixed immediately. The UoW systems are all running the "mature old" versions of Linux OR an internal dynamic version and don't see the problem. Hmmm. My prediction: When Rosetta builds with the new glibc version that works with the new distributions, they will cause havoc with ALL the old currently installed 2.26 version systems. I think the fix is trivial, but it will be interesting to see how they chose to mess it up. 8-) |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
My prediction: Actually, that is a bit reassuring. I was wondering if they even knew about the bug at all. At least we will know (the hard way) when they attempt a fix. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,229,863 RAC: 6,747 |
My prediction: Interestingly, the 2.27 glibc version contains support for PIE (position independent code) static linked binaries and many glibc AVX/FMA optimizations. Relinking could fix the bug and possibly make execution of some math functions faster. https://www.phoronix.com/scan.php?page=news_item&px=GNU-Glibc-2.27-Released NEWS for version 2.27 ===================== Major new features: * The GNU C Library can now be compiled with support for building static PIE executables (See --enable-static-pie in INSTALL). These static PIE executables are like static executables but can be loaded at any address and provide additional security hardening benefits at the cost of some memory and performance. When the library is built with --enable-static-pie the resulting libc.a is usable with GCC 8 and above to create static PIE executables using the GCC option '-static-pie'. This feature is currently supported on i386, x86_64 and x32 with binutils 2.29 or later, and on aarch64 with binutils 2.30 or later. * Optimized x86-64 asin, atan2, exp, expf, log, pow, atan, sin, cosf, sinf, sincosf and tan with FMA, contributed by Arjan van de Ven and H.J. Lu from Intel. * Optimized x86-64 trunc and truncf for processors with SSE4.1. * Optimized generic expf, exp2f, logf, log2f, powf, sinf, cosf and sincosf. * In order to support faster and safer process termination the malloc API family of functions will no longer print a failure address and stack backtrace after detecting heap corruption. The goal is to minimize the amount of work done after corruption is detected and to avoid potential security issues in continued process execution. Reducing shutdown time leads to lower overall process restart latency, so there is benefit both from a security and performance perspective. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I hope it all works. I am getting into even deeper waters. I attached to Rosetta on my i7-8700 again, and even with "the x64 fix", it still errors out on all 4.07. It is getting beyond my abilities to cope. |
Calcii Send message Joined: 24 Jan 12 Posts: 1 Credit: 10,161,945 RAC: 963 |
If you look at the top ten computers https://boinc.bakerlab.org/rosetta/top_hosts.php?sort_by=expavg_credit&offset=0, the first 4 places are occupied by [DPC] Nifhack with AMD: AuthenticAMD AMD EPYC 7551P 32-Core Processor [Family 23 Model 1 Stepping 2] (64 processors) With: 1) 387,860.43 PPD 2) 264,083.61 PPD 3) 249,572.42 PPD 4) 187,924.58 PPD How it possible? GenuineIntel Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz [Family 6 Model 62 Stepping 4] (120 processors) only 41,616.11 PPD or GenuineIntel Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz [Family 6 Model 79 Stepping 1] (72 processors) 2x 18 cores processors only 41,217.95 PPD May be super architecture of AMD x10 better than Intel, ok 34 place: AuthenticAMD AMD EPYC 7401P 24-Core Processor [Family 23 Model 1 Stepping 2] (48 processors) 17,598.95 PPD And AuthenticAMD AMD Ryzen Threadripper 1950X 16-Core Processor [Family 23 Model 1 Stepping 1] (32 processors) with 32 place 17,274.88 PPD. This statistics hacked of [DPC] Nifhack or what it is, explain who understands? |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 10,612 |
Where did all the WUs go? There were loads to download the last time I looked. Now none. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 10,612 |
Where did all the WUs go? There were loads to download the last time I looked. Now none. Cancel that (maybe). 6k+ just came back |
fcbrants Send message Joined: 25 Mar 13 Posts: 13 Credit: 3,933,177 RAC: 0 |
Hello!! I'm getting an error, "<message>finish file present too long</message> after a WU has completed. Win 7 SP2, Fully up2date 64 Bit, BOINC 7.12.1 x64, Rosetta Mini v3.78 windows_intelx86 This machine is my "daily driver" & I use it for other non-demanding tasks. I WAS running BOINC on 100% of the CPU's & the machine DID get sluggish at times. https://boinc.bakerlab.org/show_host_detail.php?hostid=1606827 https://boinc.bakerlab.org/rosetta/results.php?userid=472655&offset=0&show_names=0&state=6&appid= I've searched this forum & couldn't find an answer... This appears to be an old issue that's ostensibly been resolved, and it hasn't been discussed since 2016: https://www.google.com/search?ei=ApjHW7A955e2BeO1tPAL&q=%22finish+file+present+too+long%22&oq=%22finish+file+present+too+long%22&gs_l=psy-ab.3..0i7i30.967651.971417..972092...0.0..0.54.103.2......0....1..gws-wiz.......0i71j0i22i30.6CEhGn3jHEY <core_client_version>7.12.1</core_client_version> <![CDATA[ <message> finish file present too long</message> I'm backing the "Use at Most" CPU's to 93.75% (30 of 32 threads, 15/16 cores) & setting the project to No New Tasks until I can get this resolved. I did find messages in the error log: 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva126_mut_5_9tcssm_73_K_0251_0001_0006_fragments_relax_SAVE_ALL_OUT_700558_46_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva118_mut_5_8tcssm_73_K_0251_0001_0008_fragments_relax_SAVE_ALL_OUT_700549_46_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva76_mut_5_4tcssm_73_K_0251_0001_0006_fragments_fold_SAVE_ALL_OUT_700633_96_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG99_DEHCFABG_s00880005aa336a56_0002_0001_fragments_fold_SAVE_ALL_OUT_700443_272_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva2_mut_5_10tcssm_61_K_0251_0001_0005_fragments_fold_SAVE_ALL_OUT_700582_96_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG62_CDEFGBAH_s006200026a3c1c99_0001_0001_fragments_fold_SAVE_ALL_OUT_700403_280_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva117_mut_5_8tcssm_73_K_0251_0001_0007_fragments_fold_SAVE_ALL_OUT_700548_115_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG55_BEDGHCFA_s00450008aa336a56_0002_0001_fragments_fold_SAVE_ALL_OUT_700395_280_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG88_DEFGBAHC_s008200032b122311_0001_0001_fragments_fold_SAVE_ALL_OUT_700431_280_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva53_mut_5_2tcssm_73_K_0251_0001_0003_fragments_fold_SAVE_ALL_OUT_700608_140_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva130_mut_5_9tcssm_73_K_0251_0001_0010_fragments_fold_SAVE_ALL_OUT_700563_140_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task TJ54_IGGH_A_B_C_F_G_D_7delE_0251_0001_0004_fragments_fold_SAVE_ALL_OUT_700139_290_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva118_mut_5_8tcssm_73_K_0251_0001_0008_fragments_fold_SAVE_ALL_OUT_700549_140_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG60_BGDEHCFA_s005600026a3c1c99_0002_0001_fragments_fold_SAVE_ALL_OUT_700401_299_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva112_mut_5_8tcssm_73_K_0251_0001_0002_fragments_fold_SAVE_ALL_OUT_700543_164_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva21_mut_5_6tcssm_61_K_0251_0001_0007_fragments_fold_SAVE_ALL_OUT_700573_189_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva94_mut_5_6tcssm_73_K_0251_0001_0004_fragments_fold_SAVE_ALL_OUT_700653_188_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva25_mut_5_8tcssm_61_K_0251_0001_0003_fragments_fold_SAVE_ALL_OUT_700577_226_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG22_ADEHCFGB_s00150009aa336a56_0001_0001_fragments_fold_SAVE_ALL_OUT_700323_321_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva3_mut_5_10tcssm_61_K_0251_0001_0008_fragments_fold_SAVE_ALL_OUT_700593_226_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG240_GDEFCHAB_s015900036a3c1c99_0003_0001_fragments_fold_SAVE_ALL_OUT_700335_321_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva14_mut_5_4tcssm_61_K_0251_0001_0007_fragments_fold_SAVE_ALL_OUT_700565_226_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva53_mut_5_2tcssm_73_K_0251_0001_0003_fragments_fold_SAVE_ALL_OUT_700608_246_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva15_mut_5_4tcssm_61_K_0251_0001_0009_fragments_fold_SAVE_ALL_OUT_700566_247_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task jelva78_mut_5_4tcssm_73_K_0251_0001_0008_fragments_fold_SAVE_ALL_OUT_700635_246_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. 10/22/2018 7:47:43 AM | Rosetta@home | Task TJ73_IGGH_A_B_C_F_G_D_E_8_0251_0008_fragments_fold_SAVE_ALL_OUT_700160_329_0 exited with zero status but no 'finished' file 10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project. Most of the tasks mentioned in the error log do NOT show up here: https://boinc.bakerlab.org/rosetta/results.php?userid=472655&offset=0&show_names=0&state=6&appid= But this one does: https://boinc.bakerlab.org/result.php?resultid=1036002868 Please let me know what I can do to help troubleshoot the problem. Thanks!! Franko |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 10,612 |
Where did all the WUs go? There were loads to download the last time I looked. Now none. Went up to 14k tasks, then all gone again. Something weird happening. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,338,560 RAC: 2,014 |
If you look at the top ten computers https://boinc.bakerlab.org/rosetta/top_hosts.php?sort_by=expavg_credit&offset=0, the first 4 places are occupied by [DPC] Nifhack with AMD: [snip] Looks like the main limitation of CPUs with this many processors is not the number of processors, but the speed of the memory that all the processors in the same package share. If so, some of these processors could even be beyond the point where deciding which processor to allow to make the next memory access takes up enough of the run time is high enough to cause a significant slowdown. You might also look up the cache size inside each of these CPUs - competing for cache space could also cause a significant slowdown. |
Sam Send message Joined: 9 Mar 06 Posts: 3 Credit: 3,350,780 RAC: 986 |
Hi Franko, I get the same error (and not only on Rosetta@home. I think my hard disk is too busy to output a 'finished file' on time for Boinc. You can just ignore it, because most of the time your workunits are fine. Sjmielh |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I get the same error (and not only on Rosetta@home. I think my hard disk is too busy to output a 'finished file' on time for Boinc. That is an interesting thought. I have not seen that error for a long time, and I now use only SSDs on all my machines. Also, I usually use a write-cache (or ramdisk), so most of my writes and even reads are from main memory. I think that does it. |
fcbrants Send message Joined: 25 Mar 13 Posts: 13 Credit: 3,933,177 RAC: 0 |
This machine uses two SSD's in RAID 0 on a Dell PERC H710 RAID card with 1 GB of RAM (which could be the source of the problem), with the write policy set to "Write Back", which is defined as, "In Write Back mode the controller sends a data transfer completion signal to the host when the controller cache has received all of the data in a transaction." For some reason, Windows Explorer (Exploder?) hangs when this machine is NOT under load, AND I have several windows explorer windows open. Is there a way to increase this timeout to accommodate this machine's peculiarities? Thanks!! Franko I get the same error (and not only on Rosetta@home. I think my hard disk is too busy to output a 'finished file' on time for Boinc. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,229,863 RAC: 6,747 |
The error message is displayed by the BOINC Client. I think it is just a BOINC Client timing issue that they have declared "fixed" several times. I don't think it is ever a problem, just annoying. client/app_control.cpp // Check for finish files every 10 sec. // If we already found a finish file, abort the app; // it must be hung somewhere in boinc_finish(); // static double last_finish_check_time = 0; if (gstate.clock_change || gstate.now - last_finish_check_time > 10) { last_finish_check_time = gstate.now; for (i=0; i<active_tasks.size(); i++) { ACTIVE_TASK* atp = active_tasks[i]; if (atp->task_state() == PROCESS_UNINITIALIZED) continue; if (atp->finish_file_time) { // process is still there 10 sec after it wrote finish file. // abort the job atp->abort_task(EXIT_ABORTED_BY_CLIENT, "finish file present too long"); <<<<<<<<<<<< line 140 } else if (atp->finish_file_present()) { atp->finish_file_time = gstate.now; } } } |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org