Message boards : Number crunching : Problems with version 5.96
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next
Author | Message |
---|---|
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
We are looking into this. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
Are the actual rosetta processes not running and the boinc client stays idle as if it doesn't know that the error occurred? We need some more feedback to assess the situation. There is definitely a problem right now with these jobs that were submitted yesterday. If the client doesn't report back, we can't tell that the errors are occurring. I sent an email to Rom and David Anderson to see if there may have been an issue with the BOINC api. |
Helix Von Smelix Send message Joined: 16 Oct 05 Posts: 12 Credit: 4,030,163 RAC: 6 |
Are the actual rosetta processes not running and the boinc client stays idle as if it doesn't know that the error occurred? We need some more feedback to assess the situation. There is definitely a problem right now with these jobs that were submitted yesterday. If the client doesn't report back, we can't tell that the errors are occurring. I sent an email to Rom and David Anderson to see if there may have been an issue with the BOINC api. my problem WU's are shown as running, but the cpu run time is not going up, or the time to run. both are stopped. cpu usage is zero. it seems to happen from 60% complete or higher. sorry i don't have more information if you reboot, then the WU goes to 100% after a few seconds then shows client error |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
-- The processes stay running, at or near 100% finished and continue until an abort, restart or deadline. Further, a restart will either finish without recrunching the last decoy or will start the last one over with (most likely) the same hang at or near 100%.. This is accomplished with the CPU/CORE index dropping to zero, but, with the condition that BOINC is unable to assign any other crunching job to the CPU/CORE. That core becomes useless, with the exception of handling system overhead. The watchdog process is either unable to correct the issue or fails to recognize that there is an issue. I will open up a system to more jobs (I had discontinued all my Rosetta crunching) to see if I can capture any more information about this. I agree that this would be a difficult one to track since the problem only shows in aborted jobs or, in the case of a successful restart intervention, a completed job. Looking for a team ??? Join BoincSynergy!! |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
Is this platform specific? If people are getting this client hanging error please post here: 1. the platform 2. client version 3. work unit 4. additional details that may help |
EW-3 Send message Joined: 1 Sep 06 Posts: 27 Credit: 2,561,427 RAC: 0 |
Not exactly germane to this particular issue, the fact I have several computers back at the office who should be cranking right now, but appear to be dead, I was wondering if there might be a method that could permit a remote reset/restart in situations like this? |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
Are people seeing this type of hanging error with minirosetta version 1.28 jobs? |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
-- Mine are all Linux , but, several some above have been reported with Windows.. Most of my machines use 5.10.21 client , with 3 or 4 at 5.10.45.... The symptoms are the same for each client. I have already posted a half dozen work units to check and there have been a couple of dozen posted by others earlier in the thread... I would do the debug work for you, but, I don't have all the required code.... so... please read what others have posted... Looking for a team ??? Join BoincSynergy!! |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
|
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
|
Allan Hojgaard Send message Joined: 4 May 08 Posts: 9 Credit: 591,749 RAC: 0 |
These 3 WUs showed the same behavior as the 3 other WUs I reported earlier, but since I I can not find my post in this thread I will report it again and add these 3 to the list: The latest: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_65018_0 t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_74793_0 t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_77384_0 The previous: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_10358_0 t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_28688_0 t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_38843_0 What happens is that these WU reach 100%, but do not use any CPU power at all. BOINC still reports these as running and does not release/report/abort them so one of my cores sits idly by and does nothing. The only solution is to manually abort the WUs and let BOINC take the next one. This is very annoying as I could be crunching WUs with all that idle time instead having my other core waiting indefinitely for an already finished WU. Laptop specs: Ubuntu Linux 8.04 (2.6.24-19-generic kernel) BOINC 5.10.45 Intel Core 2 Duo T7300 @ 2GHz 2GB RAM I would like to hear a response or workaround from the developers or forum administrators so that I, and surely many others, can better navigate the beta pitfalls and spend more time crunching and less time not crunching. |
eric Send message Joined: 2 Jan 07 Posts: 23 Credit: 815,696 RAC: 0 |
Freeze calculation of WU 1wrp__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1wrp_-native__3004_6 (https://boinc.bakerlab.org/rosetta/workunit.php?wuid=137744826). I'm getting the same problems as you are with WUs running but the CPU is at 0%. This is happening on a Phenom9850 and a Intel Core Duo. I am now going to move most of my crunching to WCG until this gets fixed. I have also gotten a lot of validation errors and such. |
Allan Hojgaard Send message Joined: 4 May 08 Posts: 9 Credit: 591,749 RAC: 0 |
Yet another WU that takes roughly 3 hours to crunch, gets stuck at 100% and doesn't actually finish. Roughly 3 hours wasted again. I am going to crunch WUs for Spinhenge@home until the matter has been resolved. I am also going to periodically check this thread and the main page for updates on this issue. Please fix this issue quickly because so far Rosetta@Home is my favorite distributed computing project. |
sslickerson Send message Joined: 14 Oct 05 Posts: 101 Credit: 578,497 RAC: 0 |
Is this platform specific? Windows XP, BOINC version: 5.10.45, Client Version: 5.96 I've just aborted all of my remaining WU. I checked CPU usage and I was using one core with a minirosetta WU (everything OK). The 5.96 WU was idle. I restarted BOINC and the WU actually started crunching again but I aborted anyhow, given that I have seen it start normal after restart and then hang shortly after. The restart always tends to either crash one or both WU (even the nominal WU's such as minirosetta). That's all I have, hope it helps. I am going to suspend for now given the many lost hours over the past few days but I will be back when the problem is fixed. Tim |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 2 |
I had already posted the required info. There is definitely a problem right now with these jobs that were submitted yesterday. Are they gone now? I don't want to reconnect to Rosetta and find idle cores at remote sites requiring an hours drive to press a button. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
We haven't been able to reproduce this behavior yet. Tomorrow I'll update rosetta with the latest boinc api and double check the source code to see if there were any changes between versions that could be causing this. We are seeing an odd error at the end of a local run on our linux machines that suggests an api issue but it may or may not be related. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 2 |
Thanks David, I'll stay suspended for now then. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
eberndl Send message Joined: 17 Sep 05 Posts: 47 Credit: 3,076,784 RAC: 461 |
Hello, I've had at least 3 units go to 100% and then just sit there. I'm running the linux version of 5.10.45 on Ubuntu 8.04. The most recent unit I've had to kill is t405__CASP8_JUMPAB_TYPE2_RES81to192_SAVE_ALL_OUT_BARCODE__3785_69982 (https://boinc.bakerlab.org/rosetta/result.php?resultid=172177470) My computer is a dual core with 2GB of RAM... Questions? Try the Wiki! Take a look inside my brain |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I'm running Linux (various versions). The recent spate of stuck WUs have been t405 and Rosetta 5.96. Before this I've seen a rare stuck WU. This seems to happen sometimes when Rosetta exits in a non-normal way, such as being ended by the watchdog. The t405 WUs seem to get stuck after they try to end. They are usually at 100% in BOINC manager. A few show a percentage short of 100%, but I think that's just a race between getting stuck and the final update of the percentage. They are shown as "running", but the CPU time isn't increasing. The stderr file in the stuck WU's slot directory shows the number of decoys produced, the message about the watchdog shutting down, and the call to boinc_finish (all of which looks normal). It then shows a stack trace, and finally says "Exiting...". It says all that with the WU stuck and before anything has been done about the WU. If I stop and restart BOINC, the WU usually crunches another decoy and then tries to exit and gets stuck again. You can see I did this a few times here Looking at the node the WU is running on shows the CPU is idle. The Rosetta process shows up with the 'ps' command. It shows as having three threads. (A normally running WU has four. Perhaps the watchdog thread is the one that disappeared?) All three threads seem to be "sleeping" ('S' status). If the Rosetta process is "kill"ed, BOINC says it exitted with no finished file, and then restarts it. I didn't have a System.map file on my node, but I dug up what I think is the right one and added it. That lets me get the WCHAN info for the Rosetta process (assuming I did things right). Note that this is with 32bit Gentoo Linux 2.4.31-gentoo-r1. Normally the main CPU-using thread of Rosetta is active, and the other threads are sleeping in "schedule_timeout". With a stuck WU the other threads are still in "schedule_timeout", but the main thread is shown as sleeping in "rt_sigsuspend". I hope all this is helpful. |
vicel Send message Joined: 28 Mar 06 Posts: 5 Credit: 957,142 RAC: 0 |
Again, another WU named as t405_..., after progress has arrived 100% task-status don't set to "Ready to report" and continue "Running" (CPU don't used). WU: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=157148293 Rosetta Beta 5.96 Ubuntu 8.04 (kernel 2.6.24-19-generic, GNOME 2.22.2) BOINC 5.10.45 Core 2 Duo E4500, Memory 3.2Mb, HDD 200 Gb (available 89 Gb) |
Message boards :
Number crunching :
Problems with version 5.96
©2024 University of Washington
https://www.bakerlab.org