Question for developers - Does the New Versions on the 20Th have stuck at 1% fix?

Message boards : Number crunching : Question for developers - Does the New Versions on the 20Th have stuck at 1% fix?

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 6844 - Posted: 20 Dec 2005, 8:26:50 UTC
Last modified: 20 Dec 2005, 8:30:35 UTC

Did the new versions issued on the 20th contain the hoped for fix for the stuck at 1% problem?

Secondly, were you folks aware that BOINC is supposed to be able to invalidate the prior versions so that work not started can run with the new executable? At least it used to have that feature.

It may be too late for now, but, this may be something to keep in mind for the future. For many of us, this is/was a show-stopper kind of error, and I would much prefer to not have the risk of having another "hang" ...

==== edit

Oh, and an announcement might have been friendly too ...

Nothing in new, technical news, or the forums ... oh well, more important things to do I suppose ...


ID: 6844 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 6845 - Posted: 20 Dec 2005, 8:44:16 UTC

Yup, did not know there was a new one........
ID: 6845 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 6846 - Posted: 20 Dec 2005, 9:06:17 UTC

I only noticed it because of the fact I have multiple computers showing in BOINC View and I saw the different version numbers.

another developer question

It seems that I am seeing a slightly higher client error rate. I have had 3-4 work units error out within seconds (which is less annoying than dieing after consuming 4 hours I suppose), I don't know if this is common experience with other people. But, it does seem strange that I would see as many errors as this when prior experience is that this is not the case ...
ID: 6846 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 6847 - Posted: 20 Dec 2005, 9:28:48 UTC

Looking at my results I had one on the 16th Dec and one way back in November, so nothing seen here yet!
ID: 6847 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 6858 - Posted: 20 Dec 2005, 12:04:12 UTC

Cool, Soon I'll be doing jobs on 4.79, 4.80 and now 4.??

Team mauisun.org
ID: 6858 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 6862 - Posted: 20 Dec 2005, 12:53:36 UTC - in response to Message 6846.  
Last modified: 20 Dec 2005, 12:56:44 UTC

I only noticed it because of the fact I have multiple computers showing in BOINC View and I saw the different version numbers.

another developer question

It seems that I am seeing a slightly higher client error rate. I have had 3-4 work units error out within seconds (which is less annoying than dieing after consuming 4 hours I suppose), I don't know if this is common experience with other people. But, it does seem strange that I would see as many errors as this when prior experience is that this is not the case ...


I'm glad (sort of) to see you make this claim. I got my FIRST client error ever on Dec 19th, and I've been running Rosetta since the beta testing phase. I was shocked. I checked the WU 3721807 and saw that it had been sent out to someone else before me who also had a client error. The WU was re-issued a third time. Now I see this morning another client error on WU 3745860 and it also errored out twice. There must be a problem. (edited to add urls)
ID: 6862 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 6863 - Posted: 20 Dec 2005, 13:25:43 UTC

Gee thanks.......got 3 Client Errors in a row just after I posted.....
ID: 6863 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Basilaris

Send message
Joined: 2 Nov 05
Posts: 4
Credit: 17,014
RAC: 0
Message 6864 - Posted: 20 Dec 2005, 13:28:03 UTC
Last modified: 20 Dec 2005, 13:31:07 UTC

My first job with v4.81 errored out as well. Before it did, I noticed that the native structure can be rotated now. The second seems to be running fine so far.
ID: 6864 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 6865 - Posted: 20 Dec 2005, 13:51:52 UTC - in response to Message 6862.  
Last modified: 20 Dec 2005, 13:57:30 UTC

I'm glad (sort of) to see you make this claim. I got my FIRST client error ever on Dec 19th, and I've been running Rosetta since the beta testing phase. I was shocked. I checked the WU 3721807 and saw that it had been sent out to someone else before me who also had a client error. The WU was re-issued a third time. Now I see this morning another client error on WU 3745860 and it also errored out twice. There must be a problem. (edited to add urls)

1hz6A_topology_sample_203_2788 failed on four computers (Linux/Windows/Intel/AMD/old and new application version).
ID: 6865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 7
Message 6866 - Posted: 20 Dec 2005, 14:16:33 UTC

I had to try it... the first one, 1n0u__topology_sample_204_12744, errored after 13 seconds. I was the second to get it, it's about to go to a third.

You may want to lower the number of "error/total/success" allowed. It doesn't make sense if a WU is bad, to send it to 10 people. I would think 3 or 4 would be enough to be pretty sure there's a problem.

And I have to ask - what happened to the communication? We get a new application downloaded, I have no idea if it's Windows-only, Windows/Linux, or Windows/Linux/Mac, what is changed in it (other than rotating the protein in the graphics on Windows), and I have to assume based on thread posting timestamps, that it was released just as everybody left for the day...

ID: 6866 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 6867 - Posted: 20 Dec 2005, 14:27:45 UTC - in response to Message 6866.  
Last modified: 20 Dec 2005, 14:41:06 UTC

I had to try it... the first one, 1n0u__topology_sample_204_12744, errored after 13 seconds. I was the second to get it, it's about to go to a third.

So, what are these *topology_sample_nnn_nnnnn units anyway -- are they any different from the recent *topology_sample_nnnnnn ones ? ;-)
ID: 6867 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 6876 - Posted: 20 Dec 2005, 15:29:08 UTC - in response to Message 6866.  
Last modified: 20 Dec 2005, 15:31:45 UTC

I had to try it... the first one


Me too. One is still running OK after 20 minutes (1n0u__topology_sample_204_14869_0). The other (1ogw__topology_sample_204_3923_4) errored out in less than 20 seconds (and had crashed on 4 other machines before mine). Both on Rosetta 4.81.

Observation: the error number (0xc0000005) is the same as occurs when switching Rosetta out of memory.

[EDIT]Either we have a bad batch of work units or the new app is broken[/EDIT]
*** Join BOINC@Australia today ***
ID: 6876 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 6884 - Posted: 20 Dec 2005, 15:59:30 UTC - in response to Message 6876.  

I had to try it... the first one


Observation: the error number (0xc0000005) is the same as occurs when switching Rosetta out of memory.

[EDIT]Either we have a bad batch of work units or the new app is broken[/EDIT]


As Hoelderlin said, client error occurred before and after the new version of the app; seems to reflect on the WU not the app?
ID: 6884 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 6890 - Posted: 20 Dec 2005, 17:04:00 UTC

The new applications have modifications in the scientific code. The modifications were, again, made to allow increased diversity in the search. The app can now read in larger protein fragment libraries, and the number of score specific cycles can be increased. The screen saver was slightly modified to allow rotation of the native structure. Bin in our group was able to find a bug that may have caused an infinite loop in certain very infrequent circumstances, but we do not know if this is the bug people are seeing. There is no difference between the *topology_sample_nnnnn and the *topology_sample_nnn_nnnnn work units. The additional number (nnn), specifies a batch number used in our new work generator.

I do not know what is causing these errors but I will look into it. I would try restarting boinc and seeing what happens. I'll be posting something up on the web site soon about these recent changes.
ID: 6890 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Lee Carre

Send message
Joined: 6 Oct 05
Posts: 96
Credit: 79,331
RAC: 0
Message 6898 - Posted: 20 Dec 2005, 18:14:43 UTC
Last modified: 20 Dec 2005, 18:15:08 UTC

I've got a "DEFAULT_2reb_205_29_0" unit crawling along very slowly with the v4.81 app
it's at 3.8% after 11 hours, running at 19MFIOps according to boincview
but then the deadline is 17/01/2006
is this normal?
ID: 6898 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 6908 - Posted: 20 Dec 2005, 18:57:26 UTC

IF ANYONE SEES A "DEFAULT_xxxxx_205_.........." (batch 205) WORKUNIT PLEASE ABORT IT.

An explanation will be posted soon, but in short, we accidentally sent out 1100 work units with very long run times (1000 structures to be made instead of 10).

Sorry about this problem for those who have been crunching these since last night.
ID: 6908 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 6912 - Posted: 20 Dec 2005, 19:22:29 UTC - in response to Message 6908.  

IF ANYONE SEES A "DEFAULT_xxxxx_205_.........." (batch 205) WORKUNIT PLEASE ABORT IT.

An explanation will be posted soon, but in short, we accidentally sent out 1100 work units with very long run times (1000 structures to be made instead of 10).

Sorry about this problem for those who have been crunching these since last night.

Nonetheless, if by some reason the whole workunit is processed (over days and days), might we assume the results would be valid for you (it's just that the workunit is 100X larger than normal)? Assuming it is completed before the one month deadline of course. :)
Regards,
Bob P.
ID: 6912 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Lee Carre

Send message
Joined: 6 Oct 05
Posts: 96
Credit: 79,331
RAC: 0
Message 6913 - Posted: 20 Dec 2005, 19:28:21 UTC - in response to Message 6908.  
Last modified: 20 Dec 2005, 19:28:51 UTC

IF ANYONE SEES A "DEFAULT_xxxxx_205_.........." (batch 205) WORKUNIT PLEASE ABORT IT.

An explanation will be posted soon, but in short, we accidentally sent out 1100 work units with very long run times (1000 structures to be made instead of 10).

Sorry about this problem for those who have been crunching these since last night.

lol, oops, made me smile thou
doesn't seem to be a good day for projects, SIMAP posted a news item saying that stats were available on boincsimap.com instead of boincstats.com

as rbpeake asked: would the results actually be useful? i don't mind leaving it run, and i'm pretty sure it'll meet the deadline (only 1 of 2 projects running on that dual-core host)

if not i'm happy to abort if it's not gonna be useful, i'm here for the science not credits, so doesn't bother me at all
ID: 6913 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 6914 - Posted: 20 Dec 2005, 19:28:27 UTC - in response to Message 6912.  


Nonetheless, if by some reason the whole workunit is processed (over days and days), might we assume the results would be valid for you (it's just that the workunit is 100X larger than normal)? Assuming it is completed before the one month deadline of course. :)


The results would be valid, though I'm not sure how your or our computers would like the files that are 100 times as big. (All of the structures are concatenated in one file.) It's probably better for everybody to abort and get new WUs.

We are still investigating the issue with the WUs finish too quickly.
ID: 6914 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 6915 - Posted: 20 Dec 2005, 19:31:04 UTC

I found one waiting to run and deleted it. Can we now assume that there are no more waiting to be downloaded, just in case I go to bed and get one overnight.
ID: 6915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Question for developers - Does the New Versions on the 20Th have stuck at 1% fix?



©2024 University of Washington
https://www.bakerlab.org