Posts by Ananas

1) Message boards : Number crunching : validate errors (Message 79170)
Posted 9 Dec 2015 by Ananas
Post:
It has the WU state "cancelled" now, one more from the same series has the same outcome and it is cancelled too.

The credits have been granted later on that first one, I guess someone checked the problem and granted manually (thanks for that!), as the new one still shows 0 granted.
2) Message boards : Number crunching : validate errors (Message 79168)
Posted 9 Dec 2015 by Ananas
Post:
rb_12_07_59935_105598_ab_stage0_t000___robetta_IGNORE_THE_REST_12_12_313657_76_0

99 decoys generated (i think that's where all WUs stop), first delivery, no restarts, outcome "invalid"

:-(
3) Message boards : Number crunching : FFD__ crashes with a null pointer access (Message 78563)
Posted 8 Aug 2015 by Ananas
Post:
one more :

FFD__7strand14helixWYYends_125_0001_dock_PD1CancerImmunotherapy_15_08_07_40_57_globalDocking_2_SAVE_ALL_OUT_277454_3

somewhat shorter filename, WU is 680026011
4) Message boards : Number crunching : FFD__ crashes with a null pointer access (Message 78562)
Posted 8 Aug 2015 by Ananas
Post:
FFD__5strand14helixWYR_filteredloops_169_0001_dock_PD1CancerImmunotherapy_15_08_07_40_26_globalDocking_1_SAVE_ALL_OUT_277449_15

WU is 680041644

how long are the strings that you reserve for the filename?

(My BOINC base path is fairly short, so that should not be the problem.)
5) Message boards : Number crunching : look at getting a workstation for crunching (Message 78437)
Posted 12 Jul 2015 by Ananas
Post:
It runs windows and has dual Xeon 5400 CPUs. I plan to add a Tesla K80 GPU to it. ...

The GPU will currently not be used in this project, you can use it on your SETI account though.

The Xeons are from the same generation as the C2Q 95xx desktop CPUs, more efficient than your Q6600 but less efficient than your i7 laptop.

Even though they are a bit outdated, it's not that bad, they will still do good work for this project ... of course I have to say that because I'm running on a C2Q 95xx myself here ;-)
6) Message boards : Number crunching : rsc_fpops_est for certain WU types (ordered_X) (Message 78415)
Posted 6 Jul 2015 by Ananas
Post:
Some WU types report only about 60% of the RSC_FPOPS_EST value they should have. This messes up the cache handling, as de facto they have the same runtime as the other results.

Examples are those with this pattern : tj_8_7_ordered_X*

My preferred runtime setting is 8 hours, after one of those strange ones, the estimated runtime of all "normal" workunits goes up to about 12 hours, even though they usually still finish within my runtime preferences.
7) Message boards : Number crunching : Problem with task "exited with zero status but no 'finished' file" error (Message 78133)
Posted 17 Apr 2015 by Ananas
Post:
I have the same problem on one computer, from what I can see it is caused by the huge HDD activity when unpacking and starting each Rosetta WU, combined with the outdated API version that still has the first and much too low heartbeat tolerance setting.

If you have a not too fast HDD, this slows down the BOINC core client and the same Rosetta workunit that caused the slowdown doesn't receive the heartbeat in time and restarts itself, before it managed to unzip the result. Plus it kills other Rosetta WUs in that process, they run into the same heartbeat error too. This goes on for quite a while until the core client gives up on restarting the workunit.

The only other project that seems to use such an old API is Leiden Classical, their WUs are also victim of the heartbeat bug now and then. Other projects are more robust when it comes to that problem.

Afaik. the heartbeat timeout has been increased quite much in later API versions.

On the box where I have that problem I can start one Rosetta WU, it usually comes to start, sometimes after one heartbeat crash. Trying to starting a second one kills both and none of the two will ever recover from that restart/crash loop.

p.s.: another way to fix it might be to split the compressed database into smaller parts and hoping that the core client can use the break between the unzips to refresh the heartbeat in the shared memory - or to exclude those .gz files from the .zip file and deliver them (only the needed ones!) separately, as packing it like that is very stupid anyway.

@Erik : I don't think that it is entirely fixed on your system, the WUs still show "unpacking ..." way too often. If it was a clean run, it would unpack the database and unpack the workunit file - two unpack commands per workunit and that's it.
8) Message boards : Number crunching : Validator ignores results (Message 77618)
Posted 29 Oct 2014 by Ananas
Post:
Looks as if it has been fixed :-)
9) Message boards : Number crunching : Validator ignores results (Message 77601)
Posted 23 Oct 2014 by Ananas
Post:
Two of my results are stuck in "pending" :

F5Y_T65S_noE1_L0_fold_SAVE_ALL_OUT_221346_1841 (for ~6 hours)
and
rb_10_21_50349_97079_ab_stage0_h002___robetta_IGNORE_THE_REST_03_06_221453_4 (for ~13 hours)

One result I returned just now has already been validated so the validator basically seems to do its work.
10) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 77224)
Posted 3 Aug 2014 by Ananas
Post:
Once it had a connection, an upload of 1.5MB took just 10 seconds, another indicator that neither the server itself nor the line speed lag. So I guess we can assume, that the number of connections is was the limiting factor.

While I'm typing this, a few more of my uploads went through without retries, so someone must have fixed it :-)
11) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 77204)
Posted 1 Aug 2014 by Ananas
Post:
I just monitored a working download amd it was surprisingly fast (~250 KBit/s). Not for a tiny file where the speed display is more or less random, it has been a 200k minirosetta_database ZIP file, so the speed value is relevant.

So it might not be the server speed that causes the trouble but a way too low number of allowed concurrent connections.

Another indicator that it is probably not a speed problem would be the message. From what I have seen, it never said that something has been interrupted or timed out but it says "system connect" only few seconds after the attempt, just as if it did get a physical connection immediately, but it had been rejected.

Otoh. upload and download might of course behave different - I just thought I'd mention it as it might help with the analysis.
12) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 77153)
Posted 30 Jul 2014 by Ananas
Post:
... Scheduler request failed: Couldn't connect to server ...

Same here now, but I received three results during the 30.07., one at ~07:00 and two at ~11:30 (UTC)
13) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 77137)
Posted 30 Jul 2014 by Ananas
Post:
The scheduler is back, downloads are working (so we can get new work), uploads are still stuck.
14) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 77072)
Posted 29 Jul 2014 by Ananas
Post:
http://srv4.bakerlab.org/rosetta_cgi/cgi gives me timeouts, for BOINC and the same when I try it in the browser. Uploads need several attempts too. Someone standing on the cable?

p.s.: no trouble connecting to the Rosetta web site and the database seems to be fast also, so it must be only this one server that is in trouble.
15) Message boards : Number crunching : pd1_graftsheet_41limit keep crashing (Message 76965)
Posted 7 Jul 2014 by Ananas
Post:
Error code 0xc0000005 (protection fault / access violation)

Not only for me, my wingmen seem not to have more luck with those.

p.s.: already reported here, sorry, I had not seen this before I posted.
16) Message boards : Number crunching : Validate errors (Message 76656)
Posted 25 Apr 2014 by Ananas
Post:
Hmmm ... from the credits view, invalid ones sometimes even have an advantage ... that's quite mean :ยด(

CPU time 43483 (preferred 28800), claimed 245, granted 20 because that stupid thing created only one decoy.

It has been a 4oo3_01_S_t000_cstw05_krypton_SAVE_ALL_OUT_161776_567_0
17) Message boards : Number crunching : Validate errors (Message 76613)
Posted 11 Apr 2014 by Ananas
Post:
... So, yes, there is also such a thing as partial success (BOINC does not really support the concept). And learning about what hangs up an algorithm is useful too.

A very important information, if crashed results are used to improve the methods, they are not really a waste of energy.

Thanks :-)
18) Message boards : Number crunching : Validate errors (Message 76610)
Posted 11 Apr 2014 by Ananas
Post:
Second delivery of one of my invalid results :


Outcome Client error

Validate state Invalid

Granted credit 300


Btw. : My invalid WUs show granted credits too, not in the list but on the details page.


I'm even more confused now - are the results useful and recoverd manually or are they used to fill the trashcan?
19) Message boards : Number crunching : Validate errors (Message 76609)
Posted 11 Apr 2014 by Ananas
Post:
nearly 20% validate errors lately, I guess I'll disable Rosetta until this is solved.

p.s.: my runtime prefs are set to 8 hours and the broken ones all ran over the full timespan, maybe some results cannot handle that?

There are Windows and Linux results with this problem, so it is independant from the OS.
20) Message boards : Number crunching : Validate errors (Message 76593)
Posted 6 Apr 2014 by Ananas
Post:
one more

And this one (not mine)

As Rosetta doesn't have validation on comparison level, I wonder if it could be a problem with the upload handler. If the message "Couldn't resolve host name" is caused by a network problem on the Rosetta site, BOINC server side tasks could be affected as well.

Afaik. "finished upload" just means that the upload handler was able to store the file in a temporary storage place, but then the file has to be moved and if this move fails, the uploading host will probably not notice it (that's how http uploads usually work).


Next 20



©2020 University of Washington
https://www.bakerlab.org