Problems with Rosetta version 5.93

Message boards : Number crunching : Problems with Rosetta version 5.93

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 51000 - Posted: 26 Jan 2008, 18:49:20 UTC

resultid=135831728

CPU time 127601.71875 (35.44 HOURS)
Claimed credit 501.989998586804
Granted credit 20

Mod Sense. I'm pretty sure there's something wrong here. Anyone else spot the problem???? It's not like this issue wasn't posted about early enough on Friday for someone at the project to comment upon it.
ID: 51000 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JEklund

Send message
Joined: 24 Sep 06
Posts: 7
Credit: 105,447
RAC: 0
Message 51001 - Posted: 26 Jan 2008, 19:37:32 UTC - in response to Message 51000.  

resultid=135831728

CPU time 127601.71875 (35.44 HOURS)
Claimed credit 501.989998586804
Granted credit 20

Mod Sense. I'm pretty sure there's something wrong here. Anyone else spot the problem???? It's not like this issue wasn't posted about early enough on Friday for someone at the project to comment upon it.


Based on the info in the log it seems that it was stuck and the watchdog killed it ( and appreciated your work as 20 credits .. which is not fair for 35 hours work IMHO )

No clue what is wrong with that work unit though ..

-- Lundi --

ID: 51001 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mhhall

Send message
Joined: 28 Mar 06
Posts: 7
Credit: 10,188,899
RAC: 3
Message 51007 - Posted: 26 Jan 2008, 22:03:39 UTC - in response to Message 50335.  

Please post problems and/or bugs with rosetta 5.93. Thanks for your
support!

My slower computer (ID #187636 -- older Linspire Linux box) is set to accept
jobs of approx 14 hours. I have a job on machine at this time which say it
is 99.67% completed with 50:16:19 of CPU time. For time being, I've suspended
the job. Name starts "2h4o_BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK
(Work unit 123162090).

Don't know if this is a Rosetta issue or a problem w/ this specific job.
I know that I have another of same name in my queue (135883853).

Just wondering if someone else has seen similar issue/problem.

Hope this helps!!

ID: 51007 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 51008 - Posted: 26 Jan 2008, 22:58:53 UTC - in response to Message 51000.  

resultid=135831728

CPU time 127601.71875 (35.44 HOURS)
Claimed credit 501.989998586804
Granted credit 20

Mod Sense. I'm pretty sure there's something wrong here. Anyone else spot the problem???? It's not like this issue wasn't posted about early enough on Friday for someone at the project to comment upon it.


Oh no, you did get 20. You should have got at least an extra 100 for all the effort you put into it.
ID: 51008 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,241,429
RAC: 45,634
Message 51009 - Posted: 26 Jan 2008, 23:51:20 UTC

I've got one here:

https://boinc.bakerlab.org/rosetta/result.php?resultid=135314464

Rosetta score is stuck or going too long. Watchdog is ending the run!
CPU time: 58569.2 seconds. Greater than 4X preferred time: 14400 seconds

Claimed credit 211.010587329225
Granted credit 80
ID: 51009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>France>TDM>Centre]Jeannot Le Tazon

Send message
Joined: 8 Dec 05
Posts: 6
Credit: 153,161
RAC: 0
Message 51013 - Posted: 27 Jan 2008, 7:51:31 UTC

I've aborted this one https://boinc.bakerlab.org/rosetta/result.php?resultid=135287253 after 11h. (prefs set to 12h)
11 h crunching, then cpu benchmark, and then back to 10% complete. :(
it seemed to do nothing interesting after, maybe, 1h and 1 decoy
(Model 1, Step 27091, Accepted RMSD 9124, Accepted energy 6.65805)
Nothing displayed on "Searching", "Accepted", nothing moving after 1 decoy on "RMSD" & "Accepted Energy".
ID: 51013 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 65,736,681
RAC: 460
Message 51019 - Posted: 27 Jan 2008, 20:25:18 UTC

I started getting lots of computation errors today. I did make 1 change to the system but it should not have caused this problem. Most of the time the CPU cranks on the WU for 50+ min. before the error.

Is there a problem with some of the WUs in the 5.93 beta? I just installed the newest BOINC Client (5.10.30) and I guess it could be at fault as well.

Any insight is greatly appreciated.

Paul
Thx!

Paul

ID: 51019 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5659
Credit: 5,691,837
RAC: 1,806
Message 51028 - Posted: 27 Jan 2008, 21:19:40 UTC - in response to Message 51019.  

paul - do the group a favor and tell us which one of your many computers is having fits and which work units as you have alot of different computers and lots of workunits in queue. Its not the BOINC program that has the errors, rather the project work units themselves. You probably notice that you have errors on RAH vs the other projects you are working on. If it was a BOINC program error you would have errors on all your projects.

I started getting lots of computation errors today. I did make 1 change to the system but it should not have caused this problem. Most of the time the CPU cranks on the WU for 50+ min. before the error.

Is there a problem with some of the WUs in the 5.93 beta? I just installed the newest BOINC Client (5.10.30) and I guess it could be at fault as well.

Any insight is greatly appreciated.

Paul


ID: 51028 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
PieBandit
Avatar

Send message
Joined: 17 Apr 07
Posts: 6
Credit: 228,220
RAC: 0
Message 51029 - Posted: 28 Jan 2008, 0:08:43 UTC

several of my WU are also failing with compute errors:

Result ID 136334535
Result ID 136319412
Result ID 136308989
Result ID 136258153
Result ID 135343580
Result ID 135260720
Result ID 134993972

since January 21st, I've had about a 50% success rate
ID: 51029 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 65,736,681
RAC: 460
Message 51030 - Posted: 28 Jan 2008, 0:31:32 UTC - in response to Message 51028.  

paul - do the group a favor and tell us which one of your many computers is having fits and which work units as you have alot of different computers and lots of workunits in queue. Its not the BOINC program that has the errors, rather the project work units themselves. You probably notice that you have errors on RAH vs the other projects you are working on. If it was a BOINC program error you would have errors on all your projects.

I started getting lots of computation errors today. I did make 1 change to the system but it should not have caused this problem. Most of the time the CPU cranks on the WU for 50+ min. before the error.

Is there a problem with some of the WUs in the 5.93 beta? I just installed the newest BOINC Client (5.10.30) and I guess it could be at fault as well.

Any insight is greatly appreciated.

Paul



Greg:

Thanks for the note. I do have lots of WUs checked out and it takes a long time to find the issues.

The computer is 591177 and it has more compute errors than successes. I will keep fighting with the hardware but I think it is OK now. All of my temps are well in spec and I don't have any other issues.

I run 100% R@H so I can not compare these WUs to anything else. I did notice that none of my other systems have the same issues so a BIOS upgrade later, I think we may have some stability.

Thx

Paul

Thx!

Paul

ID: 51030 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,818,279
RAC: 980
Message 51039 - Posted: 28 Jan 2008, 11:06:18 UTC

The problems I was getting over at Ralph appear to have carried over to Rosetta.

The Wu's starting with "2h4o" were causing problems on Ralph so I was supprised to see them over here on Rosetta.

They have a habit of running well past your preference time (up to 21 hours with preference time of 6 hours),
All seem to get to just over 97% completed with 9 minutes 59 seconds to go and just sit there for hours,
Says 100% completed but still shows "Waiting to Run" in Boinc Manager,
Often giving computation errors after the extra long run time (this was mainly on Ralph),
If it does complete after the extra long run time will only give a very poor amount of credit because usually only 1 decoy has been produced in all this time.

I have just aborted two of these WU's
WU 135437069 ran for over 3 1/2 hours got to 100% but still waiting to run in BM, after aborting results show Zero (0) time taken on job.
WU 135437323 was already over an hour past my preference time of 6 hours and still grinding away with 9 minuts 59 seconds to go at 97% completed, it had been this way for quite some time.
WU 135372094 completed after more than 21 hours, returning just 2.5 cr/h.

If I see any more of these WU type then I will be aborting them.
ID: 51039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 51046 - Posted: 28 Jan 2008, 18:22:36 UTC - in response to Message 51039.  



The Wu's starting with "2h4o" were causing problems on Ralph so I was supprised to see them over here on Rosetta.



I'm seeing the same problems as Conan on a number of my servers. The trouble workunits are 2h4o and 1zpy and all require manual abortion. Restarting Boinc will just reset the amount of time already spend on them and starting them again.

The 2h4o units in particular tend to stay at 100% Completed but state "Running" with no increase in amount of cpu time spend. Looking at the stdout.txt/stderr.txt files shows that there was an attempt by the watchdog to shut down the client (and as far as I know that has never worked properly for Rosetta on Linux).
Team Helix
ID: 51046 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 51052 - Posted: 28 Jan 2008, 21:06:46 UTC

I aborted them all as well, Still waiting on my 480 missing credits too...

I wonder when the staff gets in to work? These have really got to be affecting the total rate of return (i.e work done).
ID: 51052 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 11 Jan 08
Posts: 23
Credit: 2,163,056
RAC: 0
Message 51070 - Posted: 29 Jan 2008, 8:16:33 UTC - in response to Message 51052.  

Same here, had to abort the last 2h4o Model.
One of my faster Hosts effectively stopped working, as the hourly rotation of the last 2h4o__BOINC_TWIST_RINGS WorkUnit apparently reset CPU time over and over, while making zero progress.

As a side-effect, the Rosetta Long Term Debt of the affected Clients rocketed upto -90000s (lots of work but almost no progress done)
ID: 51070 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MerePeer

Send message
Joined: 6 Nov 05
Posts: 3
Credit: 1,787,446
RAC: 0
Message 51086 - Posted: 29 Jan 2008, 23:41:11 UTC - in response to Message 51070.  

Same here. Same problem with 2h4o__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK* just hanging. Restarting boinc results in same problem 8 hours later. Linux box.

ID: 51086 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 51087 - Posted: 30 Jan 2008, 1:11:08 UTC
Last modified: 30 Jan 2008, 1:36:10 UTC

I'm not sure what to think. Complaints about the 2h4o wus started atleast 5 days ago. I ran a test on one of mine starting 5 days ago, which leaves 3 full business days and two weekend days for management to make a statement. I've seen or heard nothing. How often do they monitor these boards? Are they of any importance? I'm feeling a bit like any "beta" tests or any other tests are really a waste of our man hours and CPU Seconds. Perhaps, I'll be considered impatient...hmmmm....How long must one wait before one isn't considered as such???

I don't know. I know I've stopped ALL rosetta work. It really isn't what I wanted, but I don't wanna "Pi**" away my CPU time for nothing when it might be spent more wisely. (I.E if my machines are just going to use electricity without scientific benefit, what's the point of leaving them on)

tony

I started at 200K and was shooting for 600K before stopping, but I guess 350K is OK. If that's what they want.(well, would stay 350K but I loaned out a machine before I knew the score, so I have to await it's return before I remove it.)
ID: 51087 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
j2satx

Send message
Joined: 17 Sep 05
Posts: 97
Credit: 3,670,592
RAC: 0
Message 51089 - Posted: 30 Jan 2008, 2:24:58 UTC - in response to Message 51039.  

The problems I was getting over at Ralph appear to have carried over to Rosetta.

The Wu's starting with "2h4o" were causing problems on Ralph so I was supprised to see them over here on Rosetta.



Were you "really" surprised?
ID: 51089 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,818,279
RAC: 980
Message 51090 - Posted: 30 Jan 2008, 8:08:24 UTC - in response to Message 51089.  

The problems I was getting over at Ralph appear to have carried over to Rosetta.

The Wu's starting with "2h4o" were causing problems on Ralph so I was surprised to see them over here on Rosetta.



Were you "really" surprised?


G'Day j2satx,
No I guess I was not, considering no response over on Ralph either. A lot of wasted time when these things run to over 21 hours and then often error out.
It is a shame, I do like the project and it's goals, it was one of the best monitored and responsive projects for a good while.
ID: 51090 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
j2satx

Send message
Joined: 17 Sep 05
Posts: 97
Credit: 3,670,592
RAC: 0
Message 51094 - Posted: 30 Jan 2008, 15:00:39 UTC - in response to Message 51090.  

The problems I was getting over at Ralph appear to have carried over to Rosetta.

The Wu's starting with "2h4o" were causing problems on Ralph so I was surprised to see them over here on Rosetta.



Were you "really" surprised?


G'Day j2satx,
No I guess I was not, considering no response over on Ralph either. A lot of wasted time when these things run to over 21 hours and then often error out.
It is a shame, I do like the project and it's goals, it was one of the best monitored and responsive projects for a good while.


I know....I started crunching Ralph again when it looked like they were making a change with the "minis", but seems that was short lived also.
ID: 51094 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hedera
Avatar

Send message
Joined: 15 Jul 06
Posts: 76
Credit: 5,150,682
RAC: 620
Message 51096 - Posted: 30 Jan 2008, 18:30:29 UTC

The interesting thing with all this is that, after that one bad day a couple of weeks ago, I made a minor adjustment to the amount of memory (from 90% to 85% when computer is not in use) and CPU (from 100% to 90%) allowed, and since that time my WUs have been cranking happily away, finishing in the normal 2-4 hours of CPU time, and not overwhelming my Pentium IV. And no errors. Maybe I'm just lucky.
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

ID: 51096 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : Problems with Rosetta version 5.93



©2024 University of Washington
https://www.bakerlab.org