Help us solve the 1% bug!

Message boards : Number crunching : Help us solve the 1% bug!

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 40,071
RAC: 0
Message 12741 - Posted: 28 Mar 2006, 3:53:48 UTC

What is the size of your BOINC directory?

How many days worth of workunits do your have? Which projects are attached?

Would you be willing to make a copy of the directory and in the copy abort all of the other workunits except the one that is stalling and zip everything up and send it to me?

----- Rom
My Blog
ID: 12741 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 524,083
RAC: 1,534
Message 12742 - Posted: 28 Mar 2006, 4:26:41 UTC - in response to Message 12741.  
Last modified: 28 Mar 2006, 4:30:55 UTC

What is the size of your BOINC directory?

How many days worth of workunits do your have? Which projects are attached?

Would you be willing to make a copy of the directory and in the copy abort all of the other workunits except the one that is stalling and zip everything up and send it to me?


Hi Rom,

My BOINC directory is 1.3GB. I am attached to CPDN (regular and seasonal), Rosetta, Ralph, Einstein, Seti, and Seti Beta. I currently have a CPDN seasonal and a CPDN sulphur WU, a ready-to-report Rosetta and the suspended Rosetta, a Seti Beta and an Einstein. I've set everything to "no new tasks" for now.

Running BOINC CC 5.3.28.

I keep a 0.1 day cache, so I don't have a lot of WU's around. I would not be happy to abort the CPDN WU's. I don't mind suspending everything for the time it takes to zip, etc., or aborting the other WU's.

[edit]
Wait a minute. Did I misunderstand you -- you mean abort the other WU's *after making a copy*, then send you that copy, then go about my merry way... sure, I'll do that.

Please advise on where and how to send...
[/edit]
ID: 12742 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12743 - Posted: 28 Mar 2006, 4:37:50 UTC

All work units sent out since Friday have a maximum time limit of roughly 24 hours, so no computers should be getting stuck much longer than this
ID: 12743 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 524,083
RAC: 1,534
Message 12746 - Posted: 28 Mar 2006, 4:58:02 UTC
Last modified: 28 Mar 2006, 5:05:59 UTC

Rom-

I tried this: First I suspended network activity and work. I made a backup copy of my BOINC directory, then I restarted BOINC in its original directory. I aborted everything but the stuck Rosetta. I let the Rosetta go, and it passed the stuck point.

I killed BOINC, then deleted everything from Program FilesBOINC. Copied back the contents of BOINC_backup, started up again. Unsuspended the stuck Rosetta, it got stuck again.

So, I have this backup copy of my BOINC directory where this Rosetta WU will stick, but it seems to require the other processes to be running. I can burn this backup to a DVD-R and send it to you, how about that?

[edit] BTW, 4 at a time, dual Xeon with HT. [/edit]

[edit] Going to sleep now, will check again in the AM...[/edit]
ID: 12746 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 40,071
RAC: 0
Message 12747 - Posted: 28 Mar 2006, 5:05:26 UTC

Contact me offline and I'll let you know where to send it.

----- Rom
----- Rom
My Blog
ID: 12747 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 12748 - Posted: 28 Mar 2006, 7:33:48 UTC - in response to Message 12743.  

All work units sent out since Friday have a maximum time limit of roughly 24 hours, so no computers should be getting stuck much longer than this

Not so I today have just aborted 3 that were at 1% for 28 to 38 Hrs. Your self abort is Not working I hope it at least sends you back data as to Why it did not abort and why it got stuck
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 12748 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 12760 - Posted: 28 Mar 2006, 15:28:01 UTC - in response to Message 12748.  

Not so I today have just aborted 3 that were at 1% for 28 to 38 Hrs. Your self abort is Not working I hope it at least sends you back data as to Why it did not abort and why it got stuck

If you look at the WU ID page (NOT the result ID) it gives a creation date for the WU. What are the creation dates for those stuck WUs? The "All work units sent out since Friday" would refer to the creation date, not when you actually got the WU.

Your computers are hidden, so I couldn't figure out which WUs you are talking about.
ID: 12760 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 12763 - Posted: 28 Mar 2006, 18:06:21 UTC
Last modified: 28 Mar 2006, 18:07:41 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=12293043
https://boinc.bakerlab.org/rosetta/result.php?resultid=15133540

dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 12763 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 12764 - Posted: 28 Mar 2006, 18:57:35 UTC
Last modified: 28 Mar 2006, 19:04:08 UTC

I have one stuck at 1% (7:50:25) that was creates on the 25th (Sat) at 22:19 UTC... I suspect the fix is unfixed...

(Oooops! I just noticed Davids comment about 24 hours)

Result ID 14982362
Name HB_BARCODE_30_5croA_351_22702_0
Workunit 12161077
Created 25 Mar 2006 22:19:20 UTC
Sent 26 Mar 2006 8:33:29 UTC
Received ---
Server state In Progress
Outcome Unknown
Client state New
Exit status 0 (0x0)
Computer ID 159713
Report deadline 9 Apr 2006 8:33:29 UTC
CPU time 0
stderr out

Validate state Initial

ID: 12764 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12779 - Posted: 29 Mar 2006, 3:21:01 UTC - in response to Message 12764.  

I have one stuck at 1% (7:50:25) that was creates on the 25th (Sat) at 22:19 UTC... I suspect the fix is unfixed...

(Oooops! I just noticed Davids comment about 24 hours)

Result ID 14982362
Name HB_BARCODE_30_5croA_351_22702_0
Workunit 12161077
Created 25 Mar 2006 22:19:20 UTC
Sent 26 Mar 2006 8:33:29 UTC
Received ---
Server state In Progress
Outcome Unknown
Client state New
Exit status 0 (0x0)
Computer ID 159713
Report deadline 9 Apr 2006 8:33:29 UTC
CPU time 0
stderr out

Validate state Initial



Jobs beginning HB_BARCODE... were queued before we reduced the maximum cpu time, and we can't change the time limit retroactively. if you are having a lot of trouble with stuck WU, you can delete these work units.

ID: 12779 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pappateam

Send message
Joined: 9 Jan 06
Posts: 2
Credit: 1,610,324
RAC: 0
Message 12892 - Posted: 31 Mar 2006, 22:34:32 UTC

Still having WU's stuck everyday at 1%. Computers range from Duron800 to T2300 (most of them are AMD) and no difference between them. Sometimes I notice the problem after about 50 hours, so this problem is very bad.
Is there a solution in the horizon?
ID: 12892 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12893 - Posted: 1 Apr 2006, 0:18:00 UTC - in response to Message 12892.  
Last modified: 1 Apr 2006, 1:35:38 UTC

Still having WU's stuck everyday at 1%. Computers range from Duron800 to T2300 (most of them are AMD) and no difference between them. Sometimes I notice the problem after about 50 hours, so this problem is very bad.
Is there a solution in the horizon?



The new work units should not be getting stuck at 1%. Could you try removing all pre 4.83 (on windows) work units and let us know what happens?
ID: 12893 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]TeamHC~LostPoints

Send message
Joined: 19 Mar 06
Posts: 1
Credit: 272,665
RAC: 0
Message 12908 - Posted: 1 Apr 2006, 14:29:46 UTC

Got the same 1% problem over here.
Killing the WU didn't help, the next one also got the 1% problem.
Then I reset the project. ( I've a dutch version so I don't know exactly the English name for the button)

After resetting the project all WU's were deleted and new ones were downloaded.
Now the system runs perfectly and since then no 1% errors occurred.
ID: 12908 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pappateam

Send message
Joined: 9 Jan 06
Posts: 2
Credit: 1,610,324
RAC: 0
Message 12974 - Posted: 3 Apr 2006, 9:04:26 UTC - in response to Message 12893.  

The new work units should not be getting stuck at 1%. Could you try removing all pre 4.83 (on windows) work units and let us know what happens?

This really seems to have solved the problem! Big thanks David!
ID: 12974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Osku87

Send message
Joined: 1 Nov 05
Posts: 17
Credit: 280,268
RAC: 0
Message 13096 - Posted: 5 Apr 2006, 21:01:33 UTC

Nicely done, except there is a one little flaw. It may be called 1.042% bug. (The last number can be found in graphics). WU stopped after about fifteen minutes of crunching. Rebooting the client or suspending and resuming the WU doesn't help. Now aborting. There went 9 hours of crunching...

Stage: Full atom relax
Model: 1 Step: 320044

Program version is 4.83

https://boinc.bakerlab.org/rosetta/result.php?resultid=16235196

Hope this was the only one.
ID: 13096 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13102 - Posted: 6 Apr 2006, 0:12:12 UTC

The 042 in the 1.042% is supposed to give the programmers a much better idea of where the program is getting stuck. But there's a few other numbers being passed around - so .042 may not be the only sticking point. By reporting the whole number of where the WU was stuck, they'll hopefully kill off the last traces of this bug.
ID: 13102 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Corgi
Avatar

Send message
Joined: 17 Oct 05
Posts: 2
Credit: 389,209
RAC: 0
Message 13146 - Posted: 7 Apr 2006, 2:44:51 UTC

I've got another one. Here's a copy of all the text on the BOINC display, plus the URL of a screenshot of the same. When I ran the test from the command prompt, it stopped at exactly the same point -- 39 min+ so far at time of this posting.

FA_RLXpt_hom006_1ptq__361_426_1 (left in memory)
------------------------
1.042% Complete
CPU time: 9 hr 13 min 58 sec

Corgi - Total credit: 1064.71 - RAC: 16.7777
GasBuddy

Stage: Full atom relax
Model: 1 Step: 314653
Accepted RMSD: 10.78
Accepted Energy: -51.5163

Rosetta@home v4.83 [URL]

Screenshot: http://pics.livejournal.com/sff_corgi/pic/000k21q6 (39.6Kb)
------------------------
PC ID: 23940 'Sothis'
GenuineIntel Intel(R) Pentium(R) M processor 1500MHz
Microsoft Windows XP Home Edition, Service Pack 2, (05.01.2600.00)
Corgi

ID: 13146 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13147 - Posted: 7 Apr 2006, 3:19:12 UTC - in response to Message 13146.  

I've got another one. Here's a copy of all the text on the BOINC display, plus the URL of a screenshot of the same. When I ran the test from the command prompt, it stopped at exactly the same point -- 39 min+ so far at time of this posting.

FA_RLXpt_hom006_1ptq__361_426_1 (left in memory)
------------------------
1.042% Complete
CPU time: 9 hr 13 min 58 sec

Model: 1 Step: 314653
Accepted RMSD: 10.78


Apparently this is one of the "old" pre-4.83 WUs (its date is 22-Mar-06) which obviously has a problem, as it failed on another PC:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11819500

I would just abort it.

PS: AFAIK, the only info needed when reporting a stuck WU, is just WU number e.g. #11819500 in this case (or just its name). If you just abort it, the project will also know the random-seed (it shows in stderr.txt output in resultid)
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13147 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike

Send message
Joined: 21 Dec 05
Posts: 9
Credit: 35,252
RAC: 0
Message 13162 - Posted: 7 Apr 2006, 11:02:39 UTC
Last modified: 7 Apr 2006, 11:07:45 UTC

Hi All. I have a 2.4 gb pc with 256mb of ram. Running Windows XP Home with SP2. I have had no failures since I turned off all screen savers (I turn the display off) and leave unfinished WU in memory (i.e. Hard drive.) I run Rosetta,Seti and Predictor. No failures since 17/03/06.


ID: 13162 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13195 - Posted: 7 Apr 2006, 22:23:04 UTC - in response to Message 13147.  

PS: AFAIK, the only info needed when reporting a stuck WU, is just WU number e.g. #11819500 in this case (or just its name). If you just abort it, the project will also know the random-seed (it shows in stderr.txt output in resultid)


I believe they would like to know the exact percentage complete that the WU was stuck at.
ID: 13195 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : Help us solve the 1% bug!



©2024 University of Washington
https://www.bakerlab.org