client upgrade, stalled WU's - what is the cause and the fix???

Message boards : Number crunching : client upgrade, stalled WU's - what is the cause and the fix???

To post messages, you must log in.

AuthorMessage
PresterJohn
Avatar

Send message
Joined: 4 Nov 05
Posts: 24
Credit: 2,121,609
RAC: 0
Message 3284 - Posted: 15 Nov 2005, 15:40:54 UTC
Last modified: 15 Nov 2005, 15:54:28 UTC

1) for the windows version of the client, is there a way to tell what version number of the software i am running?

i see that my machine downloaded rosetta_4.79_windows_intelx86.exe but how can i tell if it is actually running 4.79? i see no mention of the 4.79 executable being started in my stdoutdae.txt


2) i've skimmed thru some of one or two of the related threads about WU's stuck at 1%, etc and correct me if i'm wrong, but it seems that there is a number of different possibilities and no one seems to know what exactly is the cause of the problem.

since this weekend, i've had approx 5 occurrences of stalled WU's. in two of those cases, the client kept happily trying to finish and eventually wasted 43.8 and 14.5 hrs respectively only to return a client error as the final outcome (see links below).

https://boinc.bakerlab.org/rosetta/result.php?resultid=1626055

https://boinc.bakerlab.org/rosetta/result.php?resultid=1368310

the other three occurrences were cases of 'active' stalled jobs (the latest of which i discovered 90 minutes ago), which were aborted by user intervention. all told, probably over 120 hrs of wasted time and money (electricity in nyc isn't cheap you know) doing absolutely nothing useful.

so understandably, i am not in a particularly happy mood about this and would like to know what is being done to diagnose and fix this problem.

i would rather not hear suggestions about running boincview or checking my boxes more frequently. in the two sites that i run r@h, boincview will not work for one of them because the highly secured router/switch environmment locks out the bionc service port. find-a-drug users are/were accustomed to a client that ran smoothly with a minimum of user intervention and administration. an occasional bad batch of WU's being pushed out to users i can understand and live with, but unexplained, unreproducible errors which might be occurring on a frequent basis and which could result in nonproductive conditions that may last for days is almost untenable.

we have some large crunchers on our team and the extra overhead to manage and check host machines to make sure they are properly working is entirely unsatisfactory and will probably negatively impact the number of participants interested in running rosetta.

[edit] fixed typo in thread subject.
- team XPC - 'Where merry times and good crunching meet head-on!'
ID: 3284 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Andrew

Send message
Joined: 19 Sep 05
Posts: 162
Credit: 105,512
RAC: 0
Message 3285 - Posted: 15 Nov 2005, 15:46:24 UTC
Last modified: 15 Nov 2005, 15:56:11 UTC

Quick Answer:

Those results are from the 4.78 client, not the 4.79 client. Once you start crunching 4.79 you shouldn't get any "stuck" WUs.


Longer Answer:

1) for the windows version of the client, is there a way to tell what version number of the software i am running?

i see that my machine downloaded rosetta_4.79_windows_intelx86.exe but how can i tell if it is actually running 4.79? i see no mention of the 4.79 executable being started in my stdoutdae.txt


On the Work tab of Boinc Manager in the Application column, you'll find what application version is being used.

To actually see what windows is running, you can open the task manager go to the processes tab and you'll see either rosetta_4.78_windows_intelx86.exe or rosetta_4.79_windows_intelx86.exe using most of your cpu.

2) i've skimmed thru some of one or two of the related threads about WU's stuck at 1%, etc and correct me if i'm wrong, but it seems that there is a number of different possibilities and no one seems to know exactly is the cause of the problem.

since this weekend, i've had approx 5 occurences of stalled WU's. in two of those cases, the client kept happily keep trying to finish and eventaully wasted 43.8 and 14.5 hrs respectively only to return a client error as the final outcome (see links below).

https://boinc.bakerlab.org/rosetta/result.php?resultid=1626055

https://boinc.bakerlab.org/rosetta/result.php?resultid=1368310

...


The above WU links indicate that they used the 4.78. Once you're using the 4.79 you shouldn't get any WUs with the 1% stalls...

If you can't wait for your cache to deplete the 4.78, just abort them or reset the project.


EDIT: added longer answer
ID: 3285 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
PresterJohn
Avatar

Send message
Joined: 4 Nov 05
Posts: 24
Credit: 2,121,609
RAC: 0
Message 3286 - Posted: 15 Nov 2005, 15:49:22 UTC
Last modified: 15 Nov 2005, 15:49:32 UTC

see my question #1...

the stalled job that i killed on one of my machines this morning had d/l'ed 4.79 yesterday. how can i verify that it is indeed running the new version?
- team XPC - 'Where merry times and good crunching meet head-on!'
ID: 3286 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,504,680
RAC: 1,117
Message 3289 - Posted: 15 Nov 2005, 15:56:16 UTC - in response to Message 3286.  

the stalled job that i killed on one of my machines this morning had d/l'ed 4.79 yesterday. how can i verify that it is indeed running the new version?


In the Work tab, the Application column. Mine shows "rosetta 4.79" at the moment.

ID: 3289 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
PresterJohn
Avatar

Send message
Joined: 4 Nov 05
Posts: 24
Credit: 2,121,609
RAC: 0
Message 3291 - Posted: 15 Nov 2005, 16:10:07 UTC - in response to Message 3289.  
Last modified: 15 Nov 2005, 16:13:26 UTC

the stalled job that i killed on one of my machines this morning had d/l'ed 4.79 yesterday. how can i verify that it is indeed running the new version?


In the Work tab, the Application column. Mine shows "rosetta 4.79" at the moment.


yep, i noticed version # listed in the application column about a minute ago and it did say 4.78. but how exactly does the software know to use 4.79?

just now i attempted to manually force 4.79 to load by renaming the 4.78 exe. it took two restarts on boincmgr to get 4.79 to load but in the process it cleared out my queue and it attempted to download 4.78 again.

--- quoted from message log ----------------------

2005-11-15 11:01:03 [---] request_reschedule_cpus: start failed
2005-11-15 11:01:03 [rosetta@home] Computation for result 1hz7A_abrelaxmode_random_gauss_fix_bb_jitter03_110659_0 finished
2005-11-15 11:01:03 [rosetta@home] Starting result 1n0u__abrelaxmode_random_length20_jitter02_omega_16322_0 using rosetta version 479
2005-11-15 11:01:26 [rosetta@home] Finished download of rosetta_4.78_windows_intelx86.exe
2005-11-15 11:01:26 [rosetta@home] Throughput 209181 bytes/sec
2005-11-15 11:02:01 [rosetta@home] Fetching master file
2005-11-15 11:02:06 [rosetta@home] Master file download succeeded
2005-11-15 11:02:12 [rosetta@home] Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
2005-11-15 11:02:12 [rosetta@home] Reason: To fetch work
2005-11-15 11:02:12 [rosetta@home] Requesting 728251 seconds of new work, and reporting 41 results
2005-11-15 11:02:17 [rosetta@home] Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
2005-11-15 11:02:18 [---] request_reschedule_cpus: files downloaded
2005-11-15 11:02:18 [---] request_reschedule_cpus: files downloaded
2005-11-15 11:02:18 [---] request_reschedule_cpus: files downloaded
2005-11-15 11:02:18 [---] request_reschedule_cpus: files downloaded
2005-11-15 11:02:18 [---] request_reschedule_cpus: files downloaded


something does not look right here!
- team XPC - 'Where merry times and good crunching meet head-on!'
ID: 3291 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Andrew

Send message
Joined: 19 Sep 05
Posts: 162
Credit: 105,512
RAC: 0
Message 3293 - Posted: 15 Nov 2005, 16:28:44 UTC - in response to Message 3291.  
Last modified: 15 Nov 2005, 16:34:02 UTC

how exactly does the software know to use 4.79?


This info is stored in the xml files in the boinc main directory. Which file and where, I don't exactly know.

If you want to use 4.79 instead of 4.78 you'll have to create an app_info.xml in {BOINC_INSTALL_DIR}projectsboinc.bakerlab.org_rosetta

See this link about app_info.xml: link

Basically the app_info.xml file will tell the boinc client what exe to use for 4.78.

I believe your xml would look something like this (although I haven't tested this):

<app_info>
<app>
<name>rosetta</name>
</app>
<file_info>
<name>rosetta_4.79_windows_intelx86.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>rosetta</app_name>
<version_num>478</version_num>
<file_ref>
<file_name>rosetta_4.79_windows_intelx86.exee</file_name>
<main_program/>
</file_ref>
</app_version>
</app_info>


However, after all this... I'd just abort the 4.78 WUs, and keep the 4.79 WUs. :)
ID: 3293 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,504,680
RAC: 1,117
Message 3294 - Posted: 15 Nov 2005, 16:31:38 UTC - in response to Message 3291.  
Last modified: 15 Nov 2005, 16:34:29 UTC

but how exactly does the software know to use 4.79?


Each result you are assigned has a version that it should be processed with. Thus the earlier statement that you may have to look at your Work tab and "abort" any results which show 4.78 as the required application version.

Note that until all 4.78 results have been processed, you may continue to receive a few. A project can run multiple science apps against different results as needed - this for example lets projects like Predictor have two different types of WUs, processed by two different science apps, yet all still be "Predictor". Deleting (or renaming) a science app will just cause another copy to be downloaded if you get a result assigned that requires it.

EDIT:: I would caution against using an app_info.xml file in this case. It would probably work, but the slightest mistake can result in a large number of "lost" results, more than simply aborting them would. This file is normally used when, for example, you want to use an optimized SETI app in place of the standard app, and you know that the outcome of processing a result with either app is supposed to be identical. Also, you must remember to delete the file when it is no longer needed.

ID: 3294 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 3300 - Posted: 15 Nov 2005, 18:28:26 UTC

The "stuck at 1%" issue was not directly addressed in the new version so it may still occur. However, there has been some significant changes in the code from our development team so I wouldn't be surprised if a fix was made unknowingly. This is a very peculiar and hard to find bug as you may have gathered already from the message board threads.
ID: 3300 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Andrew

Send message
Joined: 19 Sep 05
Posts: 162
Credit: 105,512
RAC: 0
Message 3305 - Posted: 15 Nov 2005, 19:22:11 UTC - in response to Message 3294.  
Last modified: 15 Nov 2005, 19:22:21 UTC

EDIT:: I would caution against using an app_info.xml file in this case. It would probably work, but the slightest mistake can result in a large number of "lost" results, more than simply aborting them would. This file is normally used when, for example, you want to use an optimized SETI app in place of the standard app, and you know that the outcome of processing a result with either app is supposed to be identical. Also, you must remember to delete the file when it is no longer needed.


I agree I would not use the app_info.xml in this case, but I just presented it to him as another option.
ID: 3305 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 3306 - Posted: 15 Nov 2005, 19:28:33 UTC
Last modified: 15 Nov 2005, 19:58:09 UTC

As I have noted in another thread, we have not had any problems with R@H 4.79. It seems very stable. Most of our boxes are running BOINC 5.x, 3 projects, clients stay in memory and change after 60 mins. R@H client 4.78 (the old version) hung at 1% on a number of occasions even with 120 mins of contiuous run time. Seems encourging......Cheers, Rog.
ID: 3306 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,504,680
RAC: 1,117
Message 3313 - Posted: 15 Nov 2005, 20:45:37 UTC - in response to Message 3305.  

I agree I would not use the app_info.xml in this case, but I just presented it to him as another option.


Yep! No problem - it's a viable option here. I didn't notice your statement at the bottom that you'd just abort the WUs or I wouldn't have bothered editing. I'm not concerned in this specific case, but I've seen people get "carried away" with this type of thing on the SETI boards - read a recommendation to one person, assume it applies to them even though they have a totally different problem - and then they wind up so messed up it takes a total reinstall to get it straightened up. Editing the xml is sometimes the _only_ way to go, but it's not for the masses; it's like "hit Reset Project" - for a while it seemed like everyone was doing that for every problem, even though it only helped in maybe 10% of the cases, and it caused tons of lost results...

ID: 3313 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 3318 - Posted: 15 Nov 2005, 21:16:11 UTC - in response to Message 3291.  

how exactly does the software know to use 4.79?


I believe the call to use one version over the another is actually encoded in the work unit itself. (again, Im only speaking from observation).. Once the workunit arrives, it triggers the support files (actual EXE) that needs to be downloaded to run it.

Thus the abort is the safest, abort until you receive 4.79 work units... or (and I STRONGLY dont suggest this), manipulate the xml files to fake BOINC to use 4.79 instead of 4.78.
ID: 3318 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : client upgrade, stalled WU's - what is the cause and the fix???



©2024 University of Washington
https://www.bakerlab.org