Message boards : Number crunching : Problems with version 5.90/5.91
Author | Message |
---|---|
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Thanks for continuing to post bugs. We'd be particularly grateful if users who were noticing memory hog issues with 5.89 could post if the newer app is better! |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
resultid=128108283 This is but one example of why it appears that 5.90 doesn't run on linux machines (atleast mine). I watched it switch from 5.89 to 5.90. Gkrellm shows 100% cpu usage on both cores of my AMD64 X2 6000, however, CPU Time and progress indicators DO NOT progress/count up. I have aborted 3 different jobs so far in the last 10 minutes for this reason, and have found none so far that run properly. I use this Boinc on all machines and am waiting for the other machines to finish 5.89 work to see what happens: 5.10.21 X86-64 NOTE: NONE of them ran on my machine. I ended up aborting them all and rebooting to windows on this machine. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
I'm seeing the exact same thing with my AMD64 2800 and my AMD64 X2 4800 as well. I let them work on the tasks for 15 min and still only have --- as a cpu time. Not even 00:00:00. Although I did see the zeros on a couple of the ones from the 6000. Also, after suspending the already running 5.89 tasks both of them changed to "computation error". The 4800 is the only one that produced the "computation error" after suspension. Looks like I'll be windows only after these 5.89's run dry. NOTE: 5.90 does run on my AMD64 3700 and using windows. Hmmm, After 15 min and before I could abort the ones on my 4800 one of them switched to 19.864% done, but still shows --- as cpu time. |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
This seems very odd. Thanks a lot for posting, especially the link to the workunit. I checked here that the %cpu usage is fine for other platforms, so I fear that this is a linux-specific issue. Anyone else out there noticing success or failure with Linux? Astro, do other apps (e.g., SETI) run fine? Also, do you happen to know what version of BOINC you are using? I'm seeing the exact same thing with my AMD64 2800 and my AMD64 X2 4800 as well. I let them work on the tasks for 15 min and still only have --- as a cpu time. Not even 00:00:00. Although I did see the zeros on a couple of the ones from the 6000. Also, after suspending the already running 5.89 tasks both of them changed to "computation error". The 4800 is the only one that produced the "computation error" after suspension. Looks like I'll be windows only after these 5.89's run dry. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
This seems very odd. Mandriva spring free 2007, Boinc (official 64b version) 5.10.21. I watched it switch from 5.89 to 5.90 on the 6000 machine, so I watched it stop working. I have aborted all 5.90 but those two remaining/running on my AMD64 X2 4800. I watched the progress jump from zero to 19, then later to 26 percent, but hasn't moved in some time. I thought it might be "checkpoints" where it updates the progress. On the other wu on that machine it took a long time to go from 0 to .010% and has just recently switched to .020% (nearly 40 min of run time, and cpu run time is set to 1 hour). I am trying to see if those two will finish and upload normally. Time will tell LOL. I'm sure they'll report ZERO cpu time, and therefor Zero Claimed Credit. Some users might not care for zero credit. LOL HMMM I just looked over and see the one that was at 26 percent, just switched to 66.929% and shows 10 min remaining, but still has zero cpu time. Also, To Completion is lowering with each update to the percentage. oops, yes, it used to do Seti fine, but not since the 26th of November, as I stopped doing Seti. [edit] both now show 75 and 79% done. I'll get you some links when they're uploaded and reported. Neither shows cpu time. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
My AMD64 X2 4800 in question is hostid=692483. At some point a minute or so from ending, the cpu time "flashed" 01:03:34 then back to ---. I just watched the 2 mit BOINC SYMM Fold and dock switch from 92% with 00:01:34 remaining to 0.029% with 01:38:05 remaining. As if it just restarted over again from scratch. Should I abort these, or let them run a bit longer??? The second/other wu was at 85% with 2 min remaining, and switched to actually displaying 00:00:00 cpu time, 0.000% done, and 01:38:05 remaining. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
I just aborted them and two 5.89's started right up working normally. The goofy ones were resultid=128129057 and resultid=128128049. I'm now free of any 5.90's and will be win only for a while(once I'm out of 5.89's). |
sslickerson Send message Joined: 14 Oct 05 Posts: 101 Credit: 578,497 RAC: 0 |
|
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
I currently have both a Ralph 5.90 and a Rosetta 5.90 running on my home system (SuSE Linux 32-bit dual cpu). The Ralph 5.90 task shows cpu time 00:00:00 and progress 0.000% and 3:59:02 time to completion (I run Ralph with 4 hour workunits). This task seems to have run for over 24 hours and will probably never finish. The Rosetta 5.90 task shows cpu time --- and progress 0.040% and 7:54:07 time to completion (I run Rosetta with 8 hour workunits). It has only run for about 10 minutes, so there is no telling yet how it will behave. While I'm typing this the progress and time to completion values have jumped up and down a couple of times (by as much as 2.5% progress and 30 minutes of time to completion), but cpu time remains just dashes. Team Helix |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
Shortly after posting the above message I saw a brief flash of the cpu time for the Rosetta 5.90 task (which was confirmed by ps), but the display went back to --- immediately afterwards. Here are all the lines starting with BOINC in the stdout.txt file for the indefinite Ralph 5.90 task. It seems actual cpu time is stuck at 0.000999 and therefore never approaches 14400 (4 hours). The Watchdog timer isn't kicking in because the client is making progress completing more and more decoys. BOINC :: [2007-12-19 21:25:55:] :: mode: pose1 :: nstartnum: 1 :: number_of_output: 9999 :: num_decoys: 0 :: pct_complete: 0 BOINC :: [2007-12-19 22:24:24:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 1 :: num_decoys: 1 :: farlx_stage: 0 BOINC :: [2007-12-19 23:17:12:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 2 :: num_decoys: 2 :: farlx_stage: 0 BOINC :: [2007-12-19 23:17:12:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 0.0004995 BOINC :: [2007-12-20 0: 3:15:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 3 :: num_decoys: 3 :: farlx_stage: 0 BOINC :: [2007-12-20 0: 3:15:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 0.000333 BOINC :: [2007-12-20 0:51:41:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 4 :: num_decoys: 4 :: farlx_stage: 0 BOINC :: [2007-12-20 0:51:41:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 0.00024975 BOINC :: [2007-12-20 1:49:10:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 5 :: num_decoys: 5 :: farlx_stage: 0 BOINC :: [2007-12-20 1:49:10:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 0.0001998 BOINC :: [2007-12-20 2:43:43:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 6 :: num_decoys: 6 :: farlx_stage: 0 BOINC :: [2007-12-20 2:43:43:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 0.0001665 BOINC :: [2007-12-20 3:34:52:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 7 :: num_decoys: 7 :: farlx_stage: 0 BOINC :: [2007-12-20 3:34:52:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 0.000142714 BOINC :: [2007-12-20 4:24:11:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 8 :: num_decoys: 8 :: farlx_stage: 0 BOINC :: [2007-12-20 4:24:11:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 0.000124875 BOINC :: [2007-12-20 5:12:53:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 9 :: num_decoys: 9 :: farlx_stage: 0 BOINC :: [2007-12-20 5:12:53:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 0.000111 BOINC :: [2007-12-20 5:55:19:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 10 :: num_decoys: 10 :: farlx_stage: 0 BOINC :: [2007-12-20 5:55:19:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 9.99e-05 BOINC :: [2007-12-20 6:41:22:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 11 :: num_decoys: 11 :: farlx_stage: 0 BOINC :: [2007-12-20 6:41:22:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 9.08182e-05 BOINC :: [2007-12-20 7:46: 5:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 12 :: num_decoys: 12 :: farlx_stage: 0 BOINC :: [2007-12-20 7:46: 5:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 8.325e-05 BOINC :: [2007-12-20 8:30:52:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 13 :: num_decoys: 13 :: farlx_stage: 0 BOINC :: [2007-12-20 8:30:52:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 7.68462e-05 BOINC :: [2007-12-20 9:19:13:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 14 :: num_decoys: 14 :: farlx_stage: 0 BOINC :: [2007-12-20 9:19:13:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 7.13571e-05 BOINC :: [2007-12-20 10: 5:29:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 15 :: num_decoys: 15 :: farlx_stage: 0 BOINC :: [2007-12-20 10: 5:29:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 6.66e-05 BOINC :: [2007-12-20 11: 3:21:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 16 :: num_decoys: 16 :: farlx_stage: 0 BOINC :: [2007-12-20 11: 3:21:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 6.24375e-05 BOINC :: [2007-12-20 11:53:49:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 17 :: num_decoys: 17 :: farlx_stage: 0 BOINC :: [2007-12-20 11:53:49:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 5.87647e-05 BOINC :: [2007-12-20 12:40:32:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 18 :: num_decoys: 18 :: farlx_stage: 0 BOINC :: [2007-12-20 12:40:32:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 5.55e-05 BOINC :: [2007-12-20 13:24:42:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 19 :: num_decoys: 19 :: farlx_stage: 0 BOINC :: [2007-12-20 13:24:42:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 5.25789e-05 BOINC :: [2007-12-20 14:14:23:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 20 :: num_decoys: 20 :: farlx_stage: 0 BOINC :: [2007-12-20 14:14:23:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 4.995e-05 BOINC :: [2007-12-20 15: 6:58:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 21 :: num_decoys: 21 :: farlx_stage: 0 BOINC :: [2007-12-20 15: 6:58:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 4.75714e-05 BOINC :: [2007-12-20 15:51: 4:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 22 :: num_decoys: 22 :: farlx_stage: 0 BOINC :: [2007-12-20 15:51: 4:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 4.54091e-05 BOINC :: [2007-12-20 16:35: 6:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 23 :: num_decoys: 23 :: farlx_stage: 0 BOINC :: [2007-12-20 16:35: 6:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 4.34348e-05 BOINC :: [2007-12-20 17:24: 1:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 24 :: num_decoys: 24 :: farlx_stage: 0 BOINC :: [2007-12-20 17:24: 1:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 4.1625e-05 BOINC :: [2007-12-20 18: 7:38:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 25 :: num_decoys: 25 :: farlx_stage: 0 BOINC :: [2007-12-20 18: 7:38:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 3.996e-05 BOINC :: [2007-12-20 19: 1: 9:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 26 :: num_decoys: 26 :: farlx_stage: 0 BOINC :: [2007-12-20 19: 1: 9:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 3.84231e-05 BOINC :: [2007-12-20 19:50:50:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 27 :: num_decoys: 27 :: farlx_stage: 0 BOINC :: [2007-12-20 19:50:50:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 3.7e-05 BOINC :: [2007-12-20 20:35:27:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 28 :: num_decoys: 28 :: farlx_stage: 0 BOINC :: [2007-12-20 20:35:27:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 3.56786e-05 BOINC :: [2007-12-20 21:31:26:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 29 :: num_decoys: 29 :: farlx_stage: 0 BOINC :: [2007-12-20 21:31:26:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 3.44483e-05 BOINC :: [2007-12-20 22:17: 3:] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 30 :: num_decoys: 30 :: farlx_stage: 0 BOINC :: [2007-12-20 22:17: 3:] :: cpu_time_pref: 14400 :: cpu_time: 0.000999 :: cpu_time_per_nstruct: 3.33e-05 Team Helix |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi sslickerson. I'm no X pert but i had a look at your results and the ones i saw where all for the 5.89app. pete. |
Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0 |
This seems very odd. Thanks a lot for posting, especially the link to the workunit. I checked here that the %cpu usage is fine for other platforms, so I fear that this is a linux-specific issue. Yes, I have two Linux boxes. Both are using 100% CPU time on Rosetta as seen in the process list, but BOINC Manager shows 0 progress and 0 CPU time on the WUs. If I stop & start the BOINC client though, the WUs completed and uploaded OK. One had 9 hrs 50 min runtime (my preference is 3 hrs). |
BitSpit Send message Joined: 5 Nov 05 Posts: 33 Credit: 4,147,344 RAC: 0 |
I just finished going through the top 1000 computers and checking the results of the Linux systems. It's a very sad state. I only found 4 jobs that ran 100% properly. The rest either: were marked invalid because the CPU time wasn't recorded ignored the CPU runtime preference and ran up to four times the preference reported properly because BOINC was restarted I've stopped all job requests here on all systems (Windows too). I've double my runtime preference to squeeze more out of the 5.89 jobs. When those are done and if there's no fix for the Linux systems, I'm finished with Rosetta. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
I just finished going through the top 1000 computers and checking the results of the Linux systems. It's a very sad state. I only found 4 jobs that ran 100% properly. The rest either: 5.90 works so far on my windows machines. It's just the Linux which has issues, and as evidenced earlier, Rhiju is watching, responding, and is involved with correcting this. I'll just run Windows until a patch can be applied. If you (anyone) is linux only, then increasing the run time is a good solution. |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 2 |
I am running Fedora 7 linux x86_64. Running 32bit BOINC 5.8.16. I presume 5.90 is still Beta? That's all I'm getting from the scheduler is beta 5.90. Anyway, yes, it uses less memory. GOOD. But when I tried to abort one, it became stuck in memory and became a Zombie process. Had to kill -9 it. BAD. I will let 3 run to completion and see if they are OK. |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,124,428 RAC: 3,579 |
I just finished going through the top 1000 computers and checking the results of the Linux systems. It's a very sad state. I only found 4 jobs that ran 100% properly. The rest either: Hello Astro, What you have described for Rosetta, I am getting on Ralph. WU appears not to be doing anything but my processors are running 100% on all 4 cores. Boinc Manager shows nothing happening except the WU is running at High Priority. Stopping and starting BM will give current state of WU but then wont keep updating. My WU's ran for 9 to 11 hours on a 6 hour preferance and produced 6 to 7 decoys in that time. Rhiju is aware of it and my latest 4 WU's are doing the same thing. Have not noticed it on Rosetta yet but probably have not finished all 5.89 WU's yet. I am running Linux Fedora Core versions 3 and 6. |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 2 |
What you have described for Rosetta, I am getting on Ralph. WU appears not to be doing anything but my processors are running 100% on all 4 cores. Yes, I am seeing the same thing now too. It doesn't show any progress, but CPU is 100%. I assume it's still working somehow? Correction: It does eventually update the tasks pane/status, but it's taking 10 minutes or so just to update the % done. This may be because of compiling with the newest BOINC API (as stated in release notes thread). |
vicel Send message Joined: 28 Mar 06 Posts: 5 Credit: 957,142 RAC: 0 |
I'm running under Ubuntu 7.10. Core2Duo. Progress indicators DO NOT progress/count up too. I have break 3 jobs. For first WU I waited three hours - progress 0, but CPU was usage. |
dcdc Send message Joined: 3 Nov 05 Posts: 1830 Credit: 119,208,549 RAC: 2,278 |
if it's so common, why wasn't the linux problem picked up on RALPH??? |
lusvladimir Send message Joined: 18 Oct 05 Posts: 12 Credit: 1,784,854 RAC: 0 |
Ubuntu 7.10 and Core2Duo Progress indicators do not progress and show on my two WU's 0% and 0.014% I'm wait 5 hours - progress freeze, CPU usage - 100 at both WU's. |
Message boards :
Number crunching :
Problems with version 5.90/5.91
©2024 University of Washington
https://www.bakerlab.org