Report "hombench_..." issues here!

Message boards : Number crunching : Report "hombench_..." issues here!

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 49901 - Posted: 21 Dec 2007, 20:29:15 UTC
Last modified: 3 Oct 2008, 19:54:17 UTC

Report WU issues for the hombench_XXX WUs here please!

We'd like to keep the science related thread on topic (i.e. about the science of this project) but we love to hear your feedback on the actual WU's also, so please post them here.

I've already had some info come back on spurious WUs that take waaay too long. We're looking into it now and will hopefully resolve that soon!


Thanks, Mike Tyka
Rosetta Moderator: Mod.Sense
ID: 49901 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 652
Credit: 11,662,550
RAC: 1,276
Message 56058 - Posted: 27 Sep 2008, 16:45:04 UTC

I have one of these hombench_ units now, this one. I have 3 hours set as my length, but this thing has been crunching for approaching 6 hours now. I opened the graphics, (something I rarely do), and saw that the structure it was currently "accepting" was a long way, (not kilometers, light years!), from the native.

The time to completion is 00:09:51, but then it has been that for the last 2-3 hours...
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56058 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 652
Credit: 11,662,550
RAC: 1,276
Message 56062 - Posted: 27 Sep 2008, 19:43:39 UTC

Something seriously wrong there. This machine typically claims and gets 50 - 60 for a 3 hour wu, this wu was over 7 hours and claimed 148 and was granted 20!!!!!
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56062 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R.L. Casey

Send message
Joined: 7 Jun 06
Posts: 91
Credit: 2,728,885
RAC: 0
Message 56082 - Posted: 29 Sep 2008, 15:54:19 UTC - in response to Message 56062.  

Something seriously wrong there. This machine typically claims and gets 50 - 60 for a 3 hour wu, this wu was over 7 hours and claimed 148 and was granted 20!!!!!


Hi adrianxw,
I can't help but wonder that the "large" size of the model required much more cache memory to run efficiently. The computer in question shows only 244 kilobytes of cache (per CPU, I assume) and returned only 2 decoys. I wish that I knew more that might be helpful. Good luck, and keep on crunching!
ID: 56082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hubington

Send message
Joined: 3 Feb 06
Posts: 24
Credit: 127,236
RAC: 0
Message 56115 - Posted: 30 Sep 2008, 15:13:02 UTC - in response to Message 56082.  

Something seriously wrong there. This machine typically claims and gets 50 - 60 for a 3 hour wu, this wu was over 7 hours and claimed 148 and was granted 20!!!!!


I had a similar problem with minirosetta 1.34: hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t328___4598_724

It took 9 hours 25 mins to complete showing 9 mins 52 seconds left for atleast 3 hours of that. I'd quite like to get in the top 100,000 contributers so I decided to check the credit for this, claimed credit was 127, granted was 22 when I normally get around 35-40 for a 3 hour packet

At the end of the day I don't really care about the credits, it's just a nice little motivating factor, but I'm sure there are people who do and will start to abort these units so as not to lose credits, especially if their system dosn't kick out that much each day to start with.
ID: 56115 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nils

Send message
Joined: 27 Feb 08
Posts: 1
Credit: 7,593
RAC: 0
Message 56130 - Posted: 30 Sep 2008, 20:27:30 UTC - in response to Message 56115.  

I had the same problem as adrianxw.

After at least 20 hours of crunching ( usually 3 or 4 hours ) i aborted this WU

hombench_mtyka_foldcst_loopbuild_boinctest3_foldcst_loopbuild_t326__IGNORE_THE_REST_2A9VA_9_5040_1

after i noticed that it got stuck at 98 % for hours.
ID: 56130 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 652
Credit: 11,662,550
RAC: 1,276
Message 56136 - Posted: 1 Oct 2008, 7:38:13 UTC

The computer in question shows only 244 kilobytes of cache (per CPU, I assume) and returned only 2 decoys.

The machine has a Q6600 processor i.e. 2MB of cache per CPU.

I maintain that this, and other wu's of this type are screwed up somehow, but nobody on the project seems to want to run with this particular ball, ... as is becoming the norm.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56136 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R.L. Casey

Send message
Joined: 7 Jun 06
Posts: 91
Credit: 2,728,885
RAC: 0
Message 56145 - Posted: 1 Oct 2008, 13:19:57 UTC - in response to Message 56136.  

The computer in question shows only 244 kilobytes of cache (per CPU, I assume) and returned only 2 decoys.

The machine has a Q6600 processor i.e. 2MB of cache per CPU.

I maintain that this, and other wu's of this type are screwed up somehow, but nobody on the project seems to want to run with this particular ball, ... as is becoming the norm.


Hi adrianxw,
In looking at your WU and the one reported by Hubington, they are both of the form 'hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t328___4598_nnn', where 'nnn' changes. It does appear that for this 't328' case, the code is getting wrapped around the axle for a Q6600 core to run around 13,000 CPU seconds per decoy! This will no doubt be interesting to Mike Tyka, since the whole point of his new code (only operating since Sept. 21) is to benchmark various methods. Your WU could be invaluable to him to eliminate some strategy that breaks down in some cases. This is research, and I am reminded of the saying "If we (really) knew what we were doing, it wouldn't be research."!
The good news in this is that three other 'hombench..." WUs you crunched did work in a 'nominal' way, giving many more decoys and being awarded your more average amount of credit per CPU time.

Hopefully, Mike T. will be able to look into this case soon, if he hasn't already. Until then, know that you are contributing greatly!
Thanks for crunching Rosetta!
ID: 56145 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dalton

Send message
Joined: 30 Nov 05
Posts: 2
Credit: 27,777,725
RAC: 0
Message 56169 - Posted: 2 Oct 2008, 15:10:19 UTC

I too also had the same problem as adrian & Nils

After at least 52 hours of crunching i gave up on this WU

hombench_mtyka_foldcst_loopbuild_boinctest3_foldcst_loopbuild_t313__IGNORE_THE_REST_1D9XA_4_4571_5_0

after i noticed that it got stuck at 98 % for hours. it would get maybe 0.0001% done a day after 98%. Where a normal WU is 3-4 hours on a T7300. Noticing now in messages that this WU keeps restarting.


ID: 56169 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Barraud Denis
Avatar

Send message
Joined: 8 May 06
Posts: 6
Credit: 1,258,677
RAC: 0
Message 56174 - Posted: 2 Oct 2008, 20:21:07 UTC
Last modified: 2 Oct 2008, 20:22:45 UTC

hombench_mtyka_foldcst_loopbuild_boinctest3_foldcst_loopbuild_t328__IGNORE_THE_REST_1ET0A_11_4578_8_0
using minirosetta version 134

q6600 Xp 32bits 2*2Go DDR2-8500 -> 21:52:20 à 99,243% reste 00:09:56
sur mon BOINC 6.3.10
j'ai du Suspendre la WU pour vérifier qu'elle n'est pas défectueuse, Apparemment non, mais j'attends de voir si elle reprend du service.
ID: 56174 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 56175 - Posted: 2 Oct 2008, 20:44:10 UTC

Barraud is running BOINC 6.3.10
and had a hombench task that ran nearly 22hrs and still shows about 10 minutes remaining. He suspended it to see if it would complete or continue running. Sounds like BOINC is still running other tasks and hasn't come back to it.

From looking at other tasks of Barraud's it seems he has a 3 hour runtime preference.

Merci Barraud. Je dis l'arrêt il. C'est troup des temps.

Rosetta Moderator: Mod.Sense
ID: 56175 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hubington

Send message
Joined: 3 Feb 06
Posts: 24
Credit: 127,236
RAC: 0
Message 56182 - Posted: 3 Oct 2008, 1:29:43 UTC

New one on the way

minirosetta 1.34: hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t286___4580_1561_0

currently been running for 36 hours & 5 mins! 99.540% complete

OK I just noticed something VERY worrying while trying to see how long it took to click over 0.001%, the run time jumped back 6 mins?!?!?! and now it lost 0.001% from the progress taking it back to 99.539

running on AMD dual cores of 2.41Ghz (4800+ combined) if that makes a difference. 64bit chip with a 32bit OS

When it is makeing progress it looks as though it's taking 5 mins to get 0.001% but if the CPU run time is constantly jumping back as I observed it do, then who can say what the run time really is!
ID: 56182 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hubington

Send message
Joined: 3 Feb 06
Posts: 24
Credit: 127,236
RAC: 0
Message 56186 - Posted: 3 Oct 2008, 8:12:20 UTC - in response to Message 56182.  


currently been running for 36 hours & 5 mins! 99.540% complete


Finished at 39 hours 35 mins

credit claimed: --
credit granted: --
outcome: Validate error

ID: 56186 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Barraud Denis
Avatar

Send message
Joined: 8 May 06
Posts: 6
Credit: 1,258,677
RAC: 0
Message 56199 - Posted: 3 Oct 2008, 17:21:33 UTC
Last modified: 3 Oct 2008, 17:34:36 UTC

Suite de / Previously : Message 56174 - Posted 2 Oct 2008 20:21:07 UTC

hombench_mtyka_foldcst_loopbuild_boinctest3_foldcst_loopbuild_t328__IGNORE_THE_REST_1ET0A_11_4578_8_0
using minirosetta version 134

q6600 Xp 32bits 2*2Go DDR2-8500 sur mon BOINC 6.3.10
j'ai du Suspendre la WU pour vérifier qu'elle n'est pas défectueuse, Apparemment non, mais j'attends de voir si elle reprend du service.

MAintenant / Now : more information about my boinc / roseta parameters :

my boinc preferences : switch between application every 80 minutes
boinc -> project -> 'partages des resources': 6,25%
my roseta preferences : Target CPU run time : 4 hours

The WU restart at it turn in boinc ! it always run again with :

now...:
26:57:00 à 99,385% reste 00:09:53
26:47:00 à 99,380% reste 00:09:56
before:
21:52:20 à 99,243% reste 00:09:56

I only have change OS priority from Base TO Normal in XP's Task Manager, to see if it change something for Wu.

next to see.
ID: 56199 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 56201 - Posted: 3 Oct 2008, 17:33:57 UTC

Barraud, with your preferences, that task should have been ended by the watchdog because it has been running for more then 5 times your normal runtime preference.

Please abort it.
Rosetta Moderator: Mod.Sense
ID: 56201 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Barraud Denis
Avatar

Send message
Joined: 8 May 06
Posts: 6
Credit: 1,258,677
RAC: 0
Message 56202 - Posted: 3 Oct 2008, 17:48:38 UTC - in response to Message 56201.  
Last modified: 3 Oct 2008, 18:20:32 UTC

Barraud, with your preferences, that task should have been ended by the watchdog because it has been running for more then 5 times your normal runtime preference.

Please abort it.



For the moment i prefer to let runing it, i notice wu's death line are for 6/10/2008 21H40..
but i want to know if you will need some of file of the wu for analyse.
I could zip the task slot directory and send you later if you need it.

INFORMATION : in Boinc ->Message : it seemed the wu restart around every 15 minutes ? Confirmed : in task manager and boinc'message : it restart the wu every 15 minutes... the wu restart in a loop at the same point every 15 min ??

now : 27:03:30 - 99,387% - 00:09:56

I will abort the wu later,in regard of time i crunch it, i can let it running a few time more before borted it.
ID: 56202 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 56203 - Posted: 3 Oct 2008, 18:47:42 UTC

I could zip the task slot directory and send you later if you need it.

No, all that should be needed is the task name, and the description of what seems abnormal about it, which you have already provided. Thanks.
Rosetta Moderator: Mod.Sense
ID: 56203 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,818,279
RAC: 888
Message 56271 - Posted: 7 Oct 2008, 11:54:12 UTC - in response to Message 56203.  

I could zip the task slot directory and send you later if you need it.

No, all that should be needed is the task name, and the description of what seems abnormal about it, which you have already provided. Thanks.


G'Day Mod.Sense,

Can you let the power that be know that this "hombench" thing is still occuring with the latest 1.36 Ralph work units as well.

I have just reported on the Ralph forum the same problem that adrianxw has already reported.
My preferences set to 6 hours but had 3 WU's go to 11 hours and one past 8 hours, the 11 hour WU's then reset themselves to zero and started again.
I did not want to waste another day on 4 WU's so I aborted them after another half an hour.
They all show 9 minutes 57 seconds still to go for over 4 hours.

I might abort the lot of them yet if all I am going to get for 11 plus hours of processing is 20 points.
ID: 56271 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 56275 - Posted: 7 Oct 2008, 14:19:38 UTC

Thanks Conan. From what I'm seeing, within the hombench, it depends greatly on which protein the task is for. If you can avoid aborting them, I would let them run unless/until they take longer then the roughly 2hrs per model mentioned in the long-running models thread.
Rosetta Moderator: Mod.Sense
ID: 56275 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 56310 - Posted: 10 Oct 2008, 1:59:52 UTC

I've tracked down some problems with the hombench_ WUs, so the next batches going out soon (in preparation now) should take considerably less time, certainly less than 2-3 per model, more likely (depending on the sie of the protiens) less than half an hour.

THe reasons for the long WUs were to do with the size of the protiens, which is why the problem was much worse for some of the guys than others.
Some of the proteins in the hombench WUs are larger than the usual stuff
we had run un BOINC before. THe refinement stage of the code was using an older algorithm that turned out to scale poorly with protien size.
I've replace that part with an almost as effective, but much much more efficient algoithm.

THanks for alerting us to this problem. FOr some of the smaller sized WUs i've sent out after noticing (e.g.hombench_mtyka_looprelax_ccd_moves_2_looprelax_ccd_moves_t302_)
i'm seeing as much as 10models / hr now !
while the larger proteins (e.g. _t293 that was previously causing trouble) are now down to an acceptable 2 hours per model.


This is exciting! THere'll be a bunch of stuff going out soon. Once we've got some preliminary results we'll display them in the science thread.



Mike




Mike
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 56310 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Report "hombench_..." issues here!



©2024 University of Washington
https://www.bakerlab.org