Problems with version 5.96

Message boards : Number crunching : Problems with version 5.96

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next

AuthorMessage
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1480
Credit: 4,334,829
RAC: 0
Message 53793 - Posted: 18 Jun 2008, 21:35:51 UTC

We are looking into this.
ID: 53793 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1480
Credit: 4,334,829
RAC: 0
Message 53798 - Posted: 18 Jun 2008, 21:56:03 UTC

Are the actual rosetta processes not running and the boinc client stays idle as if it doesn't know that the error occurred? We need some more feedback to assess the situation. There is definitely a problem right now with these jobs that were submitted yesterday. If the client doesn't report back, we can't tell that the errors are occurring. I sent an email to Rom and David Anderson to see if there may have been an issue with the BOINC api.
ID: 53798 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Helix Von Smelix

Send message
Joined: 16 Oct 05
Posts: 12
Credit: 4,030,163
RAC: 6
Message 53799 - Posted: 18 Jun 2008, 22:01:39 UTC - in response to Message 53798.  
Last modified: 18 Jun 2008, 22:05:51 UTC

Are the actual rosetta processes not running and the boinc client stays idle as if it doesn't know that the error occurred? We need some more feedback to assess the situation. There is definitely a problem right now with these jobs that were submitted yesterday. If the client doesn't report back, we can't tell that the errors are occurring. I sent an email to Rom and David Anderson to see if there may have been an issue with the BOINC api.


my problem WU's are shown as running, but the cpu run time is not going up, or the time to run. both are stopped. cpu usage is zero.

it seems to happen from 60% complete or higher. sorry i don't have more information

if you reboot, then the WU goes to 100% after a few seconds then shows client error
ID: 53799 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 53805 - Posted: 18 Jun 2008, 22:48:11 UTC

--

The processes stay running, at or near 100% finished and continue until an abort, restart or deadline. Further, a restart will either finish without recrunching the last decoy or will start the last one over with (most likely) the same hang at or near 100%..

This is accomplished with the CPU/CORE index dropping to zero, but, with the condition that BOINC is unable to assign any other crunching job to the CPU/CORE. That core becomes useless, with the exception of handling system overhead.

The watchdog process is either unable to correct the issue or fails to recognize that there is an issue.

I will open up a system to more jobs (I had discontinued all my Rosetta crunching) to see if I can capture any more information about this.

I agree that this would be a difficult one to track since the problem only shows in aborted jobs or, in the case of a successful restart intervention, a completed job.


Looking for a team ??? Join BoincSynergy!!


ID: 53805 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1480
Credit: 4,334,829
RAC: 0
Message 53807 - Posted: 18 Jun 2008, 22:57:49 UTC
Last modified: 18 Jun 2008, 23:07:41 UTC

Is this platform specific?

If people are getting this client hanging error please post here:
1. the platform
2. client version
3. work unit
4. additional details that may help
ID: 53807 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
EW-3

Send message
Joined: 1 Sep 06
Posts: 27
Credit: 2,561,427
RAC: 0
Message 53808 - Posted: 18 Jun 2008, 23:07:14 UTC - in response to Message 53807.  

Not exactly germane to this particular issue, the fact I have several computers back at the office who should be cranking right now, but appear to be dead, I was wondering if there might be a method that could permit a remote reset/restart in situations like this?


ID: 53808 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1480
Credit: 4,334,829
RAC: 0
Message 53809 - Posted: 18 Jun 2008, 23:17:44 UTC

Are people seeing this type of hanging error with minirosetta version 1.28 jobs?
ID: 53809 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 53811 - Posted: 19 Jun 2008, 0:10:24 UTC - in response to Message 53805.  

--

The processes stay running, at or near 100% finished and continue until an abort, restart or deadline. Further, a restart will either finish without recrunching the last decoy or will start the last one over with (most likely) the same hang at or near 100%..

This is accomplished with the CPU/CORE index dropping to zero, but, with the condition that BOINC is unable to assign any other crunching job to the CPU/CORE. That core becomes useless, with the exception of handling system overhead.

The watchdog process is either unable to correct the issue or fails to recognize that there is an issue.

I will open up a system to more jobs (I had discontinued all my Rosetta crunching) to see if I can capture any more information about this.

I agree that this would be a difficult one to track since the problem only shows in aborted jobs or, in the case of a successful restart intervention, a completed job.



Mine are all Linux , but, several some above have been reported with Windows..

Most of my machines use 5.10.21 client , with 3 or 4 at 5.10.45.... The symptoms are the same for each client. I have already posted a half dozen work units to check and there have been a couple of dozen posted by others earlier in the thread...

I would do the debug work for you, but, I don't have all the required code.... so... please read what others have posted...


Looking for a team ??? Join BoincSynergy!!


ID: 53811 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 53813 - Posted: 19 Jun 2008, 0:18:46 UTC
Last modified: 19 Jun 2008, 0:19:10 UTC

ID: 53813 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 53814 - Posted: 19 Jun 2008, 0:21:30 UTC - in response to Message 53813.  

ID: 53814 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Allan Hojgaard

Send message
Joined: 4 May 08
Posts: 9
Credit: 591,749
RAC: 0
Message 53815 - Posted: 19 Jun 2008, 2:17:13 UTC

These 3 WUs showed the same behavior as the 3 other WUs I reported earlier, but since I I can not find my post in this thread I will report it again and add these 3 to the list:

The latest:
t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_65018_0
t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_74793_0
t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_77384_0

The previous:
t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_10358_0
t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_28688_0
t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_38843_0

What happens is that these WU reach 100%, but do not use any CPU power at all. BOINC still reports these as running and does not release/report/abort them so one of my cores sits idly by and does nothing. The only solution is to manually abort the WUs and let BOINC take the next one. This is very annoying as I could be crunching WUs with all that idle time instead having my other core waiting indefinitely for an already finished WU.

Laptop specs:
Ubuntu Linux 8.04 (2.6.24-19-generic kernel)
BOINC 5.10.45
Intel Core 2 Duo T7300 @ 2GHz
2GB RAM

I would like to hear a response or workaround from the developers or forum administrators so that I, and surely many others, can better navigate the beta pitfalls and spend more time crunching and less time not crunching.
ID: 53815 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
eric

Send message
Joined: 2 Jan 07
Posts: 23
Credit: 815,696
RAC: 0
Message 53816 - Posted: 19 Jun 2008, 2:20:39 UTC - in response to Message 52124.  

Freeze calculation of WU 1wrp__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1wrp_-native__3004_6 (https://boinc.bakerlab.org/rosetta/workunit.php?wuid=137744826).
Today, I restarted my computer. After that, process of this WU was started - status "Running", but really progress of calculation is stoping (CPU Time don't changing and System Monitor for any process "rosseta_beta_5.96..." show 0%).
Message Log has only "resuming task 1wrp_..." without any errors.
I try to restart processing of this WU: suspend - resume, but progress stop. (Other WU was run).

Linux Ubuntu 7.10 (32bit), BOINC 5.10.45. Rosetta Beta 5.96 (for this WU).

Intel Core 2 Duo E4500 (2.2 GHz). RAM: 4Gb (really used 3 Gb). HDD: 33 Gb free.
MotherBoard: Intel DG33BU (internal video).

SETI@Home/Rosseta@Home - 50/50.



I'm getting the same problems as you are with WUs running but the CPU is at 0%. This is happening on a Phenom9850 and a Intel Core Duo. I am now going to move most of my crunching to WCG until this gets fixed. I have also gotten a lot of validation errors and such.
ID: 53816 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Allan Hojgaard

Send message
Joined: 4 May 08
Posts: 9
Credit: 591,749
RAC: 0
Message 53820 - Posted: 19 Jun 2008, 3:02:37 UTC

Yet another WU that takes roughly 3 hours to crunch, gets stuck at 100% and doesn't actually finish. Roughly 3 hours wasted again. I am going to crunch WUs for Spinhenge@home until the matter has been resolved. I am also going to periodically check this thread and the main page for updates on this issue. Please fix this issue quickly because so far Rosetta@Home is my favorite distributed computing project.
ID: 53820 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 14 Oct 05
Posts: 101
Credit: 578,497
RAC: 0
Message 53821 - Posted: 19 Jun 2008, 4:43:34 UTC - in response to Message 53807.  
Last modified: 19 Jun 2008, 4:45:50 UTC

Is this platform specific?

If people are getting this client hanging error please post here:
1. the platform
2. client version
3. work unit
4. additional details that may help


Windows XP, BOINC version: 5.10.45, Client Version: 5.96

I've just aborted all of my remaining WU. I checked CPU usage and I was using one core with a minirosetta WU (everything OK). The 5.96 WU was idle.

I restarted BOINC and the WU actually started crunching again but I aborted anyhow, given that I have seen it start normal after restart and then hang shortly after.

The restart always tends to either crash one or both WU (even the nominal WU's such as minirosetta). That's all I have, hope it helps.

I am going to suspend for now given the many lost hours over the past few days but I will be back when the problem is fixed.

Tim



ID: 53821 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 2
Message 53822 - Posted: 19 Jun 2008, 6:07:16 UTC

I had already posted the required info.

There is definitely a problem right now with these jobs that were submitted yesterday.


Are they gone now? I don't want to reconnect to Rosetta and find idle cores at remote sites requiring an hours drive to press a button.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 53822 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1480
Credit: 4,334,829
RAC: 0
Message 53823 - Posted: 19 Jun 2008, 6:28:46 UTC

We haven't been able to reproduce this behavior yet. Tomorrow I'll update rosetta with the latest boinc api and double check the source code to see if there were any changes between versions that could be causing this. We are seeing an odd error at the end of a local run on our linux machines that suggests an api issue but it may or may not be related.
ID: 53823 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 2
Message 53824 - Posted: 19 Jun 2008, 8:45:05 UTC

Thanks David, I'll stay suspended for now then.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 53824 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
eberndl
Avatar

Send message
Joined: 17 Sep 05
Posts: 47
Credit: 3,076,784
RAC: 461
Message 53827 - Posted: 19 Jun 2008, 12:13:57 UTC

Hello, I've had at least 3 units go to 100% and then just sit there. I'm running the linux version of 5.10.45 on Ubuntu 8.04.
The most recent unit I've had to kill is t405__CASP8_JUMPAB_TYPE2_RES81to192_SAVE_ALL_OUT_BARCODE__3785_69982 (https://boinc.bakerlab.org/rosetta/result.php?resultid=172177470)

My computer is a dual core with 2GB of RAM...


Questions? Try the Wiki!
Take a look inside my brain
ID: 53827 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 53829 - Posted: 19 Jun 2008, 13:23:08 UTC

I'm running Linux (various versions). The recent spate of stuck WUs have been t405 and Rosetta 5.96.

Before this I've seen a rare stuck WU. This seems to happen sometimes when Rosetta exits in a non-normal way, such as being ended by the watchdog.

The t405 WUs seem to get stuck after they try to end. They are usually at 100% in BOINC manager. A few show a percentage short of 100%, but I think that's just a race between getting stuck and the final update of the percentage. They are shown as "running", but the CPU time isn't increasing.

The stderr file in the stuck WU's slot directory shows the number of decoys produced, the message about the watchdog shutting down, and the call to boinc_finish (all of which looks normal). It then shows a stack trace, and finally says "Exiting...". It says all that with the WU stuck and before anything has been done about the WU.

If I stop and restart BOINC, the WU usually crunches another decoy and then tries to exit and gets stuck again. You can see I did this a few times here

Looking at the node the WU is running on shows the CPU is idle. The Rosetta process shows up with the 'ps' command. It shows as having three threads. (A normally running WU has four. Perhaps the watchdog thread is the one that disappeared?) All three threads seem to be "sleeping" ('S' status). If the Rosetta process is "kill"ed, BOINC says it exitted with no finished file, and then restarts it.

I didn't have a System.map file on my node, but I dug up what I think is the right one and added it. That lets me get the WCHAN info for the Rosetta process (assuming I did things right). Note that this is with 32bit Gentoo Linux 2.4.31-gentoo-r1. Normally the main CPU-using thread of Rosetta is active, and the other threads are sleeping in "schedule_timeout". With a stuck WU the other threads are still in "schedule_timeout", but the main thread is shown as sleeping in "rt_sigsuspend".

I hope all this is helpful.

ID: 53829 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
vicel

Send message
Joined: 28 Mar 06
Posts: 5
Credit: 957,142
RAC: 0
Message 53830 - Posted: 19 Jun 2008, 13:32:34 UTC

Again, another WU named as t405_..., after progress has arrived 100% task-status don't set to "Ready to report" and continue "Running" (CPU don't used).
WU: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=157148293

Rosetta Beta 5.96
Ubuntu 8.04 (kernel 2.6.24-19-generic, GNOME 2.22.2)
BOINC 5.10.45
Core 2 Duo E4500, Memory 3.2Mb, HDD 200 Gb (available 89 Gb)
ID: 53830 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next

Message boards : Number crunching : Problems with version 5.96



©2024 University of Washington
https://www.bakerlab.org