minirosetta 2.05

Message boards : Number crunching : minirosetta 2.05

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

AuthorMessage
Aroundomaha

Send message
Joined: 11 Sep 08
Posts: 14
Credit: 55,623,619
RAC: 0
Message 64996 - Posted: 15 Jan 2010, 21:46:29 UTC - in response to Message 64951.  

For the past two days my Windows 7 machine has been bombing with occasional blue screen of death crashes. I ran the Microsoft debugger and it points to an issue with minirosetta 2.05.


--------- enclosed debug information -----------------
3: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

MULTIPLE_IRP_COMPLETE_REQUESTS (44)
A driver has requested that an IRP be completed (IoCompleteRequest()), but
the packet has already been completed. This is a tough bug to find because
the easiest case, a driver actually attempted to complete its own packet
twice, is generally not what happened. Rather, two separate drivers each
believe that they own the packet, and each attempts to complete it. The
first actually works, and the second fails. Tracking down which drivers
in the system actually did this is difficult, generally because the trails
of the first driver have been covered by the second. However, the driver
stack for the current request can be found by examining the DeviceObject
fields in each of the stack locations.
Arguments:
Arg1: fffffa800afb3320, Address of the IRP
Arg2: 0000000000000eae
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------


IRP_ADDRESS: fffffa800afb3320

CUSTOMER_CRASH_COUNT: 1

DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT

BUGCHECK_STR: 0x44

PROCESS_NAME: minirosetta_2.

CURRENT_IRQL: 2

LAST_CONTROL_TRANSFER: from fffff8000285fb95 to fffff80002875f00


ID: 64996 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,382,979
RAC: 11,480
Message 65002 - Posted: 16 Jan 2010, 3:02:32 UTC - in response to Message 64953.  
Last modified: 16 Jan 2010, 3:04:02 UTC

Hi,

I'll be resubmitting the *gbnnotyr* protein design trajectories to boinc over the next few hours. The tests I ran on ralph showed that the checkpointing issue is resolved. To make sure that there are no other issues, I will submit these trajectories 'slowly' starting with a modest sized batch, and according to the responses I get on the thread I will increase the number of work units over the next few days. Please keep me posted about these problems. Your reports have been invaluable in tracking this problem down!

Sarel.


At last I have received enough WUs of this type for check. My output - still there are problems with checkpointing. In difference from version 2.03 the information about "CPU time at last checkpoint" is displayed now correctly that gives the chance to BOINC client to switch between projects, but after restart calculation still starts from the beginning.
Here a task example which I watched: 8gbnnotyr_3gbn_2iug_9Jan2010_16915_7_0
Before restart it has been used 0:33 hour CPU time, 27 models done, after restarting another 1:27 hour and 72 more models are calculated.
But apparently in the report 72 models counted after restarting are mirrored only, 27 models do not suffice, also the task was completed with Validate error.

Here another example: 8gbnnotyr_3gbn_1ijt_9Jan2010_16915_1_0
The same results - in report there are only models counted after restarting and Validate error too.

For matching here the task of this type which was computing without breaks: 8gbnnotyr_3gbn_1woj_9Jan2010_16909_12_0
Without interruption 2 hours of CPU result to 94 models (compare with 72 and 67 in the previous cases in the same 2 hours of CPU time) and Validate state = Valid
The difference just corresponds somewhere to 0.5 hours of CPU time, and so much time passed before restartings
ID: 65002 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,382,979
RAC: 11,480
Message 65003 - Posted: 16 Jan 2010, 3:22:05 UTC - in response to Message 64995.  

Please don't presume that the information from the Project Team is an inaccurate description and that your memory observations are a new and permanent condition for all to enjoy going forward. As Sarel points out, they introduced a new type of work unit which has a new low-memory phase to execution. And so you are only going to see the lower memory usage when that specific type of task is being worked on. And this new type of work unit was introduced in prior versions, so the actual delta to v2.05 is small. Since this new type of work is a current area of review, you may see a high concentration of this type of work for a period of time. But it doesn't mean we can presume more then was stated.

Yes, here I was mistaken. Simply with new version 2.05 some time in the beginning i recieve ONLY the new types of WU using few RAM. From what I have come to a (wrong) conclusion.
But now some WUs of old types come, and for them memory usage about same have as in version 2.03.
ID: 65003 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 65011 - Posted: 16 Jan 2010, 23:27:01 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=310901552
This one stalled twice at about 5 hrs 35 mins but was running for over 9 hours. I restarted boinc and it then stalled again in the same place.
ID: 65011 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike_Solo

Send message
Joined: 16 Nov 09
Posts: 2
Credit: 67,261
RAC: 0
Message 65013 - Posted: 17 Jan 2010, 11:06:30 UTC

Soooo... this new version hangs too often. 2.0.3 was much more stable.
It hangs on my 2xAthlonMP 2800 as well on the Intel E8400 so the CPU is not the issue.
I think 15% of tasks stuck in the middle consuming >200 Megs of RAM but no CPU.
I'm thinking to leave Rosetta for a while until new version ready as tired of kicking off broken tasks every morning :(

ID: 65013 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65015 - Posted: 17 Jan 2010, 11:47:55 UTC

Looks like Mike Solo has 3 machines:
One WinXP using BOINC version 6.10.18
One WinXP using BOINC version 6.10.18
One WinServer 2003 using BOINC version 6.10.18
Rosetta Moderator: Mod.Sense
ID: 65015 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,382,979
RAC: 11,480
Message 65020 - Posted: 17 Jan 2010, 18:11:01 UTC
Last modified: 17 Jan 2010, 18:12:57 UTC

2 more tasks of type *gbnnotyr* with the same result - by operation without stops all work normally, but if during calculation there was a break - results befo a break disappear, and the task is ended with validate error.
Total i have:
2 WU handled without stops, seems all of them is OK:
https://boinc.bakerlab.org/rosetta/result.php?resultid=310752146
https://boinc.bakerlab.org/rosetta/result.php?resultid=311145245

And 3 WU with a break in processing, all were completed with validate errors:
https://boinc.bakerlab.org/rosetta/result.php?resultid=310935403
https://boinc.bakerlab.org/rosetta/result.php?resultid=310946429
https://boinc.bakerlab.org/rosetta/result.php?resultid=311163725

P.S.
Last from these 3(id 311163725) it has been stopped at the very beginning of operation, still before 1st checkpoint has been written. However after restarting its processing all was completed with validate error.
So it is possible validate errors in this type of WUs are not linked with checkpoints directly and these are 2 different bugs.
ID: 65020 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 65021 - Posted: 17 Jan 2010, 19:06:18 UTC - in response to Message 65020.  

Thanks! We'll have a look at this as soon as possible and let you know what we find. Best, Sarel.

2 more tasks of type *gbnnotyr* with the same result - by operation without stops all work normally, but if during calculation there was a break - results befo a break disappear, and the task is ended with validate error.
Total i have:
2 WU handled without stops, seems all of them is OK:
https://boinc.bakerlab.org/rosetta/result.php?resultid=310752146
https://boinc.bakerlab.org/rosetta/result.php?resultid=311145245

And 3 WU with a break in processing, all were completed with validate errors:
https://boinc.bakerlab.org/rosetta/result.php?resultid=310935403
https://boinc.bakerlab.org/rosetta/result.php?resultid=310946429
https://boinc.bakerlab.org/rosetta/result.php?resultid=311163725

P.S.
Last from these 3(id 311163725) it has been stopped at the very beginning of operation, still before 1st checkpoint has been written. However after restarting its processing all was completed with validate error.
So it is possible validate errors in this type of WUs are not linked with checkpoints directly and these are 2 different bugs.


ID: 65021 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 65022 - Posted: 17 Jan 2010, 19:40:25 UTC

In the last week I've had to abort 11 tasks on W7 because the tasks are hung consuming 0% CPU time. I was hoping that the combination of upgrading to the latest BOINC and the new 2.05 version of R@h would fix the problem but no: it continues as before. Tasks on Mac OS X seem to be unaffected by this problem. Until there's some indication this problem is fixed I'm not getting any more tasks for W7.
ID: 65022 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 65023 - Posted: 17 Jan 2010, 21:10:47 UTC

Task: 311103842
Workunit: homopt_nat2.t368_.t368_.IGNORE_THE_REST.S_00003_0000018_07.pdb_00003.pdb.JOB_16835_29

ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file

AdeB
ID: 65023 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 65024 - Posted: 17 Jan 2010, 21:17:05 UTC
Last modified: 17 Jan 2010, 22:15:46 UTC

Here's another Validate error, it didn't seem to have any problems running.

Edit/ This was on 64bit linux.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=283574991

8gbnnotyr_3gbn_1s68_9Jan2010_16915_22_0

# cpu_run_time_pref: 14400
======================================================
DONE :: 37 starting structures 14469.9 cpu seconds
This process generated 37 decoys from 37 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

Validate error__Done__14,470.06
=========================================================================
Edit/ added this.

This one was on linux 32bit, again didn't seem to have a problem.

Very low credits.

8gbnnotyr_3gbn_1opd_9Jan2010_16915_42_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=283817716

# cpu_run_time_pref: 14400
======================================================
DONE :: 8 starting structures 12134.6 cpu seconds
This process generated 8 decoys from 8 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

Success__Done__12,135.35__28.60__4.61
ID: 65024 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin

Send message
Joined: 13 Apr 07
Posts: 42
Credit: 260,782
RAC: 0
Message 65025 - Posted: 17 Jan 2010, 23:06:39 UTC

Validate Error on Win7, successfully completed by a wingman on win xp
https://boinc.bakerlab.org/rosetta/result.php?resultid=311128874
name: 8gbnnotyr_3gbn_1iuk_9Jan2010_16915_131_0
ID: 65025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1990
Credit: 38,536,805
RAC: 15,887
Message 65026 - Posted: 18 Jan 2010, 1:15:29 UTC
Last modified: 18 Jan 2010, 1:17:58 UTC

About time I updated my recent fault lists. I've had several errors under 2.03, but only this under 2.05:

On Intel T5500 laptop running W7 and Boinc 6.10.18

Outcome Validate error
8gbnnotyr_3gbn_2onu_9Jan2010_16909_17_0
# cpu_run_time_pref: 28800
======================================================
DONE :: 345 starting structures 28787.1 cpu seconds
This process generated 345 decoys from 345 attempts
======================================================


Note: On several occasions the following line appears:

No heartbeat from core client for 30 sec - exiting


Edit: Wingman running XP also received a validate error on apparently successful completion.
ID: 65026 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MVeiga

Send message
Joined: 15 Oct 07
Posts: 1
Credit: 2,012,011
RAC: 425
Message 65029 - Posted: 18 Jan 2010, 12:24:34 UTC

Hi guys, let me just tell you.
If youre using Windows7 the beta version 6.10.24 or even the new beta 6.10.29 is much more stable.
Ive used a lot of time the beta 6.10.24 and i had no problem at all with rosetta.
For me its much more stable than 6.10.18 in windows7 of course. Anyway its just my case.
ID: 65029 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,382,979
RAC: 11,480
Message 65031 - Posted: 18 Jan 2010, 13:45:20 UTC - in response to Message 65023.  

Task: 311103842
Workunit: homopt_nat2.t368_.t368_.IGNORE_THE_REST.S_00003_0000018_07.pdb_00003.pdb.JOB_16835_29

ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file

AdeB


I too had a same error in this type of WU: https://boinc.bakerlab.org/rosetta/result.php?resultid=310238605
And on 2nd computer processing this WU - too: https://boinc.bakerlab.org/rosetta/result.php?resultid=310471681
The truth it was still version 2.03, therefore I did not write about it, but above an example of the same error and to versions 2.05.
ID: 65031 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,382,979
RAC: 11,480
Message 65032 - Posted: 18 Jan 2010, 14:51:56 UTC - in response to Message 65024.  

Here's another Validate error, it didn't seem to have any problems running.

Edit/ This was on 64bit linux.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=283574991
8gbnnotyr_3gbn_1s68_9Jan2010_16915_22_0

Seems only one problem with that WU - it has restart too (may be swith to another project?) and bug related with it.


This one was on linux 32bit, again didn't seem to have a problem.

Very low credits.

8gbnnotyr_3gbn_1opd_9Jan2010_16915_42_0
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=283817716

I too have such example: https://boinc.bakerlab.org/rosetta/result.php?resultid=311202691
Claimed credit=54.35 vs Granted credit = 1.83 (about 30 times lower)
And I even can tell what exactly with it have occurred:
Usually in this type of WUs model settle up very fast, nearby 1 or several minutes on 1 model. This task started as - approximately for 15 minutes 13 models have been calculated (on ~500 steps in each) , but about 14th something has occurred, calculation has not stopped on 500th step, and proceeded much longer, I saw as the counter have passed for 40000 steps, and did not look any more further(i think all was about 60000-70000 steps total).
I was already think to abort this task since thought that calculation has gone in cycles, but in 5 hours (instead of several minutes) calculation of 14th model all the same was completed. I.e. 13 models were considered about 15 minutes, and 14th about 5 hours.
From here from such small stake-in Granted credit - since they are calculated proportionally to quantity of models. (If not this 14th model, for 5 hours it would be calculated about 300 models instead of 14 and Granted credit would be close to Claimed credit).
I think too most was and in your taks...

P.S.
Quite probably that it NOT an error, but a feature of algorithm - if it finds something interesting more detail calculation of this model probably starts. It is desirable for specifying for scientists responsible for this type of WUs.
ID: 65032 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 65034 - Posted: 18 Jan 2010, 18:39:22 UTC

Hello,

based on the reports of validator issues, David Kim has now fixed the validator. He also asked me to remind people that credit is granted based on the client's claimed credit, regardless of validator results.

Let us know if you see more such problems.

Thanks, Sarel.
ID: 65034 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1990
Credit: 38,536,805
RAC: 15,887
Message 65035 - Posted: 18 Jan 2010, 19:27:57 UTC

Thanks for the information Sarel - and David for the fix.

No further errors today, but a cursory check has revealed I haven't re-booted my desktop since Dec 15th! I'm sure I've had various updates since then, but that's a ridiculous amount of uptime for me... Back in 5... ;)
ID: 65035 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 352
Credit: 382,349
RAC: 0
Message 65036 - Posted: 18 Jan 2010, 19:57:23 UTC - in response to Message 65034.  

credit is granted based on the client's claimed credit, regardless of validator results.

Does that not apply only to results with compute errors or validate errors?
.
ID: 65036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,382,979
RAC: 11,480
Message 65037 - Posted: 18 Jan 2010, 23:49:52 UTC - in response to Message 64959.  
Last modified: 18 Jan 2010, 23:56:15 UTC

hellotheworld wrote:

Hi,
I have a strange graphic I wanted to show you... I *think* there *might* be a problem...
Please go to see this sreen shoot :
http://www.flickr.com/photos/37828392@N08/4273113531/
(Capitain Flam is my account on Flickr)

Possible bug for the application BOINC / ROSETTA, because the protein is *completely* folded, in a tiny meat ball ;-)
I hope this is NOT a bug, or even, I hope it will help you to solve it ;)

Oxfez wrote:
One of my tasks has "meatballed" too:

lr5_no_pro_close_no_dun_A_rlbd_1rnb_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_16701_583_0

Running new 2.05
According to the time to completion, it's going to be a long old process too.


I have another "meatball" too.
Task: https://boinc.bakerlab.org/rosetta/result.php?resultid=311361747
Some screenshots:
http://s001.radikal.ru/i193/1001/1f/cffd2181b53b.jpg
http://i073.radikal.ru/1001/d9/c87d3083bfb9.jpg
http://s41.radikal.ru/i094/1001/8e/a86dfd3a7d6a.jpg
Plus about last 2 hours of computation(or ~20 steps) there were no changes in Energy or RMSD at all. (I did not do more screenshots since further varied nothing except CPU Time and Steps count)

I do not think that it is an error in the software, but probably weak place in the scientific algorithm itself, so it is necessary to address it not to programmers, but scientists.
ID: 65037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

Message boards : Number crunching : minirosetta 2.05



©2024 University of Washington
https://www.bakerlab.org