Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 309 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 89227 - Posted: 6 Jul 2018, 18:35:35 UTC - in response to Message 89225.  
Last modified: 6 Jul 2018, 19:17:11 UTC


BTW, there is NEAR ZERO Floating point operations in Rosetta code.


Sorry, I didn't specifically intend FLOPS. Was rather trying to vaguely refer to the clock speed of the CPU involved.

The comparisons on number of models processed are only meaningful if the tasks were of the same batch of work.
Rosetta Moderator: Mod.Sense
ID: 89227 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Simplex0

Send message
Joined: 13 Jun 18
Posts: 14
Credit: 1,714,717
RAC: 0
Message 89228 - Posted: 6 Jul 2018, 19:00:55 UTC - in response to Message 89220.  
Last modified: 6 Jul 2018, 19:03:13 UTC

If running fewer concurrent tasks improves credit per minute, it would imply either memory contention, or L2 cache contention. If all of the cores are operating in the same L2 cache, then you can see how that would become the constrained resource. I don't mean to say there is anything wrong with a given computer, just that R@h is very memory intensive. Also others have found that machines with larger L2 caches seem to yield more credit per FLOPS rating per runtime minute.


Lets compare

Threadripper 1950X, 16 core, L2-cache size 16 x 512 KB

and

Intel Celeron G1620, 2 core, L2 cache size 512 KB

The L2 cache size per core is actually half the size per core on the Intel Celeron G1620 compared with Threadripper 1950X but a Threadripper 1950X, running Rosetta on all cores, is being outperformed by a Intel Celeron G1620.
Does that make any sense to you?

Take a look at computer ranked as number 51 here https://boinc.bakerlab.org/rosetta/top_hosts.php?sort_by=expavg_credit&offset=40

My experience with Rosetta while running a Threadripper 1950X@4GHz on full load on 31 threads 16 hoursday is as follow.....

Target CPU run time 1 hours credit per day = 23400
Target CPU run time 2 hours credit per day = 28100
Target CPU run time 4 hours credit per day = 16700
Target CPU run time 8 hours credit per day = 13700

What credit you will get running a given CPU in Rosetta is a lottery where you have no idea of what you could expect yet you see staffmoderators here claim that the credit given for work done works as it should.
ID: 89228 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,229,863
RAC: 6,747
Message 89230 - Posted: 7 Jul 2018, 13:52:06 UTC - in response to Message 89227.  


BTW, there is NEAR ZERO Floating point operations in Rosetta code.


Sorry, I didn't specifically intend FLOPS. Was rather trying to vaguely refer to the clock speed of the CPU involved.

The comparisons on number of models processed are only meaningful if the tasks were of the same batch of work.



I agree that the comparison is only meaningful if there is a similar amount of work being done. Glad you didn't really mean FLOPS. Sometimes FLOPS is a good metric. For Rosetta, it seems like decoys is.

My problem, after examining this one case, is the "credit algorithm" is broken.
The repeated request from your volunteers seems to be a simple request to examine this more closely.

IMO, any computer running any WU for 8 hours and solving 83% as many decoys and taking slightly more memory, would expect to see 80% of the credit. It is very hard for me to think of any case where that same 8 hours of compute should only be worthy of 17% of the credit.
ID: 89230 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
4LG5zSZM7uiF1nVGZVqTRrjkXA6i

Send message
Joined: 7 Mar 10
Posts: 14
Credit: 111,252,570
RAC: 0
Message 89231 - Posted: 7 Jul 2018, 14:59:55 UTC

I've also seen this, even on Android devices.

WU1:
Run time	12 hours 3 min 35 sec
CPU time	11 hours 54 min 52 sec
Validate state	Valid
Credit	50.04
Peak working set size	337.73 MB
Peak swap size	394.07 MB
Peak disk usage	576.05 MB
DONE ::     1 starting structures  42892.1 cpu seconds
This process generated    121 decoys from     121 attempts


WU2:
Run time	12 hours 2 min 22 sec
CPU time	11 hours 54 min
Validate state	Valid
Credit	244.01
Peak working set size	333.51 MB
Peak swap size	390.57 MB
Peak disk usage	576.05 MB
DONE ::     1 starting structures  42840.4 cpu seconds
This process generated    117 decoys from     117 attempts


WU3:
Run time	12 hours 6 min 5 sec
CPU time	11 hours 57 min 23 sec
Validate state	Valid
Credit	239.91
Peak working set size	330.57 MB
Peak swap size	391.07 MB
Peak disk usage	576.06 MB
DONE ::     1 starting structures  43043.8 cpu seconds
This process generated    115 decoys from     115 attempts


So the WU with the most decoys had less than 25% of the credit of the others. CPU time was all within 3 minutes of each other. All three WU's were part of the same family as well; cispro_backbone_generation_8mers_largerun_PXXPXXXX_SAVE_ALL_OUT

These three were run on the same device at the same time. There are times where I get two WU's that have low value and one that is high, in this case, it was two high and one low. The device was not be used at all; just sitting on a desk plugged into a charger.
ID: 89231 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Simplex0

Send message
Joined: 13 Jun 18
Posts: 14
Credit: 1,714,717
RAC: 0
Message 89287 - Posted: 15 Jul 2018, 7:58:21 UTC

In case anyone in the Rosetta staff do care if the members computer runs a lot of workunits for 5 - 6 hours each and that all that work is wasted because the result ends up as "Invalid".

The work units all have the word 'aivan' in their name the las time I spotted them I downloaded 200 workuntis but aborted all of the after the 5 first had finished after 5 hours despite my settings for "Target CPU run time" is 2 hours and they all ended up as "Invalid"

Here is one of the tasks.


Uppgift 1015234886

Namn T1000_full3_aivan_SAVE_ALL_OUT_03_09_677955_4155_0
Arbetsenhet 914799018
Skapades 14 Jul 2018, 5:47:44 UTC
Skickad 14 Jul 2018, 6:11:13 UTC
Rapporteringstidsgräns 22 Jul 2018, 6:11:13 UTC
Mottagit 14 Jul 2018, 16:21:25 UTC
Servertillstånd Klar
Resultat Valideringsfel
Enhetstillstånd Färdig
Avsluts status 0 (0x00000000)
Dator-ID 3418863
Körtid 6 timmar 11 minsta 41 sekunder
CPU-tid 6 timmar 8 minsta 50 sekunder
Valideringsstatus Inte godkänd
Poäng 0.00
Enhetens störst flyttalshastighet 4.91 GFLOPS
Applikations version Rosetta v4.07
windows_intelx86
Peak working set size 235.55 MB
Peak swap size 229.32 MB
Peak disk usage 514.59 MB

Stderr logg
<core_client_version>7.10.2</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe @T1000.3.flags -in:file:boinc_wu_zip T1000.3.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3683496
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 22129.6s, 14400s + 7200s[2018- 7-14 18:21:12:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 22129.6 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
18:21:12 (14532): called boinc_finish(0)

</stderr_txt>
]]>
ID: 89287 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 89369 - Posted: 2 Aug 2018, 18:15:17 UTC

I just did a clean install of Ubuntu 18.04.1 on a Ryzen 1700 and thought I would try Rosetta again. To my surprise, all the downloads (two batches of eight each, with both 3.78 and 4.07 in each batch) failed with download errors before even getting a chance to run. I then recalled the fix for the x64 problem with Rosetta 4.07, but that did not affect downloads and did not affect 3.78 (so I thought).
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12242&postid=88954#88954

Wrong. Applying the fix solved the download problem, and now two 3.78 and two 4.07 are running fine.

(But it is absurd for crunchers to have to fix a basic problem three months after the release of a major OS upgrade.)
ID: 89369 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,229,863
RAC: 6,747
Message 89371 - Posted: 3 Aug 2018, 16:42:05 UTC - in response to Message 89369.  

I just did a clean install of Ubuntu 18.04.1 on a Ryzen 1700 and thought I would try Rosetta again. To my surprise, all the downloads (two batches of eight each, with both 3.78 and 4.07 in each batch) failed with download errors before even getting a chance to run. I then recalled the fix for the x64 problem with Rosetta 4.07, but that did not affect downloads and did not affect 3.78 (so I thought).
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12242&postid=88954#88954

Wrong. Applying the fix solved the download problem, and now two 3.78 and two 4.07 are running fine.

(But it is absurd for crunchers to have to fix a basic problem three months after the release of a major OS upgrade.)


I am pretty sure the only people who use this STATIC version with the bug are the volunteer contributors.
Rosetta partners build the binaries on their own systems and probably build the dynamic version. If their build caused a problem, it would get fixed immediately.
The UoW systems are all running the "mature old" versions of Linux OR an internal dynamic version and don't see the problem.

Hmmm.
My prediction:
When Rosetta builds with the new glibc version that works with the new distributions, they will cause havoc with ALL the old currently installed 2.26 version systems.
I think the fix is trivial, but it will be interesting to see how they chose to mess it up. 8-)
ID: 89371 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 89372 - Posted: 3 Aug 2018, 17:04:50 UTC - in response to Message 89371.  

My prediction:
When Rosetta builds with the new glibc version that works with the new distributions, they will cause havoc with ALL the old currently installed 2.26 version systems.
I think the fix is trivial, but it will be interesting to see how they chose to mess it up. 8-)

Actually, that is a bit reassuring. I was wondering if they even knew about the bug at all. At least we will know (the hard way) when they attempt a fix.
ID: 89372 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,229,863
RAC: 6,747
Message 89374 - Posted: 4 Aug 2018, 16:13:11 UTC - in response to Message 89372.  

My prediction:
When Rosetta builds with the new glibc version that works with the new distributions, they will cause havoc with ALL the old currently installed 2.26 version systems.
I think the fix is trivial, but it will be interesting to see how they chose to mess it up. 8-)

Actually, that is a bit reassuring. I was wondering if they even knew about the bug at all. At least we will know (the hard way) when they attempt a fix.


Interestingly, the 2.27 glibc version contains support for PIE (position independent code) static linked binaries and many glibc AVX/FMA optimizations. Relinking could fix the bug and possibly make execution of some math functions faster.
https://www.phoronix.com/scan.php?page=news_item&px=GNU-Glibc-2.27-Released


NEWS for version 2.27
=====================

Major new features:

* The GNU C Library can now be compiled with support for building static
PIE executables (See --enable-static-pie in INSTALL). These static PIE
executables are like static executables but can be loaded at any address
and provide additional security hardening benefits at the cost of some
memory and performance. When the library is built with --enable-static-pie
the resulting libc.a is usable with GCC 8 and above to create static PIE
executables using the GCC option '-static-pie'. This feature is currently
supported on i386, x86_64 and x32 with binutils 2.29 or later, and on
aarch64 with binutils 2.30 or later.

* Optimized x86-64 asin, atan2, exp, expf, log, pow, atan, sin, cosf,
sinf, sincosf and tan with FMA, contributed by Arjan van de Ven and
H.J. Lu from Intel.

* Optimized x86-64 trunc and truncf for processors with SSE4.1.

* Optimized generic expf, exp2f, logf, log2f, powf, sinf, cosf and sincosf.

* In order to support faster and safer process termination the malloc API
family of functions will no longer print a failure address and stack
backtrace after detecting heap corruption. The goal is to minimize the
amount of work done after corruption is detected and to avoid potential
security issues in continued process execution. Reducing shutdown time
leads to lower overall process restart latency, so there is benefit both
from a security and performance perspective.
ID: 89374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 89375 - Posted: 4 Aug 2018, 16:43:11 UTC - in response to Message 89374.  

I hope it all works. I am getting into even deeper waters. I attached to Rosetta on my i7-8700 again, and even with "the x64 fix", it still errors out on all 4.07. It is getting beyond my abilities to cope.
ID: 89375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Calcii

Send message
Joined: 24 Jan 12
Posts: 1
Credit: 10,161,945
RAC: 963
Message 89614 - Posted: 22 Sep 2018, 21:08:31 UTC

If you look at the top ten computers https://boinc.bakerlab.org/rosetta/top_hosts.php?sort_by=expavg_credit&offset=0, the first 4 places are occupied by [DPC] Nifhack with AMD:
AuthenticAMD
AMD EPYC 7551P 32-Core Processor [Family 23 Model 1 Stepping 2]
(64 processors)
With: 1) 387,860.43 PPD
2) 264,083.61 PPD
3) 249,572.42 PPD
4) 187,924.58 PPD
How it possible?
GenuineIntel Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz [Family 6 Model 62 Stepping 4] (120 processors) only 41,616.11 PPD
or GenuineIntel Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz [Family 6 Model 79 Stepping 1] (72 processors) 2x 18 cores processors only 41,217.95 PPD
May be super architecture of AMD x10 better than Intel, ok 34 place:
AuthenticAMD AMD EPYC 7401P 24-Core Processor [Family 23 Model 1 Stepping 2] (48 processors) 17,598.95 PPD
And AuthenticAMD AMD Ryzen Threadripper 1950X 16-Core Processor [Family 23 Model 1 Stepping 1] (32 processors) with 32 place 17,274.88 PPD.

This statistics hacked of [DPC] Nifhack or what it is, explain who understands?
ID: 89614 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 10,612
Message 89776 - Posted: 26 Oct 2018, 12:27:40 UTC

Where did all the WUs go? There were loads to download the last time I looked. Now none.
ID: 89776 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 10,612
Message 89778 - Posted: 26 Oct 2018, 13:01:33 UTC - in response to Message 89776.  

Where did all the WUs go? There were loads to download the last time I looked. Now none.

Cancel that (maybe). 6k+ just came back
ID: 89778 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fcbrants
Avatar

Send message
Joined: 25 Mar 13
Posts: 13
Credit: 3,933,177
RAC: 0
Message 89779 - Posted: 26 Oct 2018, 14:49:30 UTC

Hello!!

I'm getting an error, "<message>finish file present too long</message> after a WU has completed.

Win 7 SP2, Fully up2date 64 Bit, BOINC 7.12.1 x64, Rosetta Mini v3.78 windows_intelx86

This machine is my "daily driver" & I use it for other non-demanding tasks.

I WAS running BOINC on 100% of the CPU's & the machine DID get sluggish at times.

https://boinc.bakerlab.org/show_host_detail.php?hostid=1606827

https://boinc.bakerlab.org/rosetta/results.php?userid=472655&offset=0&show_names=0&state=6&appid=

I've searched this forum & couldn't find an answer...

This appears to be an old issue that's ostensibly been resolved, and it hasn't been discussed since 2016:

https://www.google.com/search?ei=ApjHW7A955e2BeO1tPAL&q=%22finish+file+present+too+long%22&oq=%22finish+file+present+too+long%22&gs_l=psy-ab.3..0i7i30.967651.971417..972092...0.0..0.54.103.2......0....1..gws-wiz.......0i71j0i22i30.6CEhGn3jHEY


<core_client_version>7.12.1</core_client_version>
<![CDATA[
<message>
finish file present too long</message>

I'm backing the "Use at Most" CPU's to 93.75% (30 of 32 threads, 15/16 cores) & setting the project to No New Tasks until I can get this resolved.

I did find messages in the error log:

10/22/2018 7:47:43 AM | Rosetta@home | Task jelva126_mut_5_9tcssm_73_K_0251_0001_0006_fragments_relax_SAVE_ALL_OUT_700558_46_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva118_mut_5_8tcssm_73_K_0251_0001_0008_fragments_relax_SAVE_ALL_OUT_700549_46_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva76_mut_5_4tcssm_73_K_0251_0001_0006_fragments_fold_SAVE_ALL_OUT_700633_96_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG99_DEHCFABG_s00880005aa336a56_0002_0001_fragments_fold_SAVE_ALL_OUT_700443_272_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva2_mut_5_10tcssm_61_K_0251_0001_0005_fragments_fold_SAVE_ALL_OUT_700582_96_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG62_CDEFGBAH_s006200026a3c1c99_0001_0001_fragments_fold_SAVE_ALL_OUT_700403_280_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva117_mut_5_8tcssm_73_K_0251_0001_0007_fragments_fold_SAVE_ALL_OUT_700548_115_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG55_BEDGHCFA_s00450008aa336a56_0002_0001_fragments_fold_SAVE_ALL_OUT_700395_280_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG88_DEFGBAHC_s008200032b122311_0001_0001_fragments_fold_SAVE_ALL_OUT_700431_280_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva53_mut_5_2tcssm_73_K_0251_0001_0003_fragments_fold_SAVE_ALL_OUT_700608_140_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva130_mut_5_9tcssm_73_K_0251_0001_0010_fragments_fold_SAVE_ALL_OUT_700563_140_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task TJ54_IGGH_A_B_C_F_G_D_7delE_0251_0001_0004_fragments_fold_SAVE_ALL_OUT_700139_290_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva118_mut_5_8tcssm_73_K_0251_0001_0008_fragments_fold_SAVE_ALL_OUT_700549_140_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG60_BGDEHCFA_s005600026a3c1c99_0002_0001_fragments_fold_SAVE_ALL_OUT_700401_299_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva112_mut_5_8tcssm_73_K_0251_0001_0002_fragments_fold_SAVE_ALL_OUT_700543_164_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva21_mut_5_6tcssm_61_K_0251_0001_0007_fragments_fold_SAVE_ALL_OUT_700573_189_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva94_mut_5_6tcssm_73_K_0251_0001_0004_fragments_fold_SAVE_ALL_OUT_700653_188_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva25_mut_5_8tcssm_61_K_0251_0001_0003_fragments_fold_SAVE_ALL_OUT_700577_226_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG22_ADEHCFGB_s00150009aa336a56_0001_0001_fragments_fold_SAVE_ALL_OUT_700323_321_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva3_mut_5_10tcssm_61_K_0251_0001_0008_fragments_fold_SAVE_ALL_OUT_700593_226_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task DSIGG240_GDEFCHAB_s015900036a3c1c99_0003_0001_fragments_fold_SAVE_ALL_OUT_700335_321_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva14_mut_5_4tcssm_61_K_0251_0001_0007_fragments_fold_SAVE_ALL_OUT_700565_226_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva53_mut_5_2tcssm_73_K_0251_0001_0003_fragments_fold_SAVE_ALL_OUT_700608_246_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva15_mut_5_4tcssm_61_K_0251_0001_0009_fragments_fold_SAVE_ALL_OUT_700566_247_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task jelva78_mut_5_4tcssm_73_K_0251_0001_0008_fragments_fold_SAVE_ALL_OUT_700635_246_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.
10/22/2018 7:47:43 AM | Rosetta@home | Task TJ73_IGGH_A_B_C_F_G_D_E_8_0251_0008_fragments_fold_SAVE_ALL_OUT_700160_329_0 exited with zero status but no 'finished' file
10/22/2018 7:47:43 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.

Most of the tasks mentioned in the error log do NOT show up here:

https://boinc.bakerlab.org/rosetta/results.php?userid=472655&offset=0&show_names=0&state=6&appid=

But this one does: https://boinc.bakerlab.org/result.php?resultid=1036002868

Please let me know what I can do to help troubleshoot the problem.

Thanks!!

Franko
ID: 89779 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 10,612
Message 89781 - Posted: 26 Oct 2018, 22:12:39 UTC - in response to Message 89778.  

Where did all the WUs go? There were loads to download the last time I looked. Now none.

Cancel that (maybe). 6k+ just came back

Went up to 14k tasks, then all gone again. Something weird happening.
ID: 89781 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,338,560
RAC: 2,014
Message 89782 - Posted: 26 Oct 2018, 23:34:09 UTC - in response to Message 89614.  

If you look at the top ten computers https://boinc.bakerlab.org/rosetta/top_hosts.php?sort_by=expavg_credit&offset=0, the first 4 places are occupied by [DPC] Nifhack with AMD:

[snip]

Looks like the main limitation of CPUs with this many processors is not the number of processors, but the speed of the memory that all the processors in the same package share.

If so, some of these processors could even be beyond the point where deciding which processor to allow to make the next memory access takes up enough of the run time is high enough to cause a significant slowdown.

You might also look up the cache size inside each of these CPUs - competing for cache space could also cause a significant slowdown.
ID: 89782 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sam

Send message
Joined: 9 Mar 06
Posts: 3
Credit: 3,350,780
RAC: 986
Message 89794 - Posted: 28 Oct 2018, 15:46:57 UTC - in response to Message 89779.  


I'm getting an error, "<message>finish file present too long</message> after a WU has completed.


Hi Franko,

I get the same error (and not only on Rosetta@home. I think my hard disk is too busy to output a 'finished file' on time for Boinc. You can just ignore it, because most of the time your workunits are fine.

Sjmielh
ID: 89794 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 89795 - Posted: 28 Oct 2018, 16:32:27 UTC - in response to Message 89794.  

I get the same error (and not only on Rosetta@home. I think my hard disk is too busy to output a 'finished file' on time for Boinc.

That is an interesting thought. I have not seen that error for a long time, and I now use only SSDs on all my machines.
Also, I usually use a write-cache (or ramdisk), so most of my writes and even reads are from main memory. I think that does it.
ID: 89795 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fcbrants
Avatar

Send message
Joined: 25 Mar 13
Posts: 13
Credit: 3,933,177
RAC: 0
Message 89805 - Posted: 30 Oct 2018, 17:00:35 UTC - in response to Message 89795.  

This machine uses two SSD's in RAID 0 on a Dell PERC H710 RAID card with 1 GB of RAM (which could be the source of the problem), with the write policy set to "Write Back", which is defined as, "In Write Back mode the controller sends a data transfer completion signal to the host when the controller cache has received all of the data in a transaction."

For some reason, Windows Explorer (Exploder?) hangs when this machine is NOT under load, AND I have several windows explorer windows open.

Is there a way to increase this timeout to accommodate this machine's peculiarities?

Thanks!!

Franko

I get the same error (and not only on Rosetta@home. I think my hard disk is too busy to output a 'finished file' on time for Boinc.

That is an interesting thought. I have not seen that error for a long time, and I now use only SSDs on all my machines.
Also, I usually use a write-cache (or ramdisk), so most of my writes and even reads are from main memory. I think that does it.
ID: 89805 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,229,863
RAC: 6,747
Message 89812 - Posted: 1 Nov 2018, 4:11:01 UTC - in response to Message 89805.  

The error message is displayed by the BOINC Client.
I think it is just a BOINC Client timing issue that they have declared "fixed" several times.
I don't think it is ever a problem, just annoying.

client/app_control.cpp

// Check for finish files every 10 sec.
// If we already found a finish file, abort the app;
// it must be hung somewhere in boinc_finish();
//
static double last_finish_check_time = 0;
if (gstate.clock_change || gstate.now - last_finish_check_time > 10) {
last_finish_check_time = gstate.now;
for (i=0; i<active_tasks.size(); i++) {
ACTIVE_TASK* atp = active_tasks[i];
if (atp->task_state() == PROCESS_UNINITIALIZED) continue;
if (atp->finish_file_time) {
// process is still there 10 sec after it wrote finish file.
// abort the job
atp->abort_task(EXIT_ABORTED_BY_CLIENT, "finish file present too long"); <<<<<<<<<<<< line 140
} else if (atp->finish_file_present()) {
atp->finish_file_time = gstate.now;
}
}
}
ID: 89812 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 309 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org