Report long-running models here

Message boards : Number crunching : Report long-running models here

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 14 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58756 - Posted: 12 Jan 2009, 14:34:38 UTC

I think you're correct AMD, Rosetta may have to resume the task or hit the end of a model before the runtime revision takes effect. Thanks for pointing this out.
Rosetta Moderator: Mod.Sense
ID: 58756 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 29 Oct 08
Posts: 61
Credit: 2,137,555
RAC: 0
Message 58787 - Posted: 13 Jan 2009, 20:25:18 UTC
Last modified: 13 Jan 2009, 20:42:21 UTC

Yet another long running model: cc2_1_8_mammoth_mix_cen_cst_hb_t303__IGNORE_THE_REST_2FDRA_5_5873_80.

After over 6.5 hours of work (I have set 4hrs of work for a WU) it reported no result; then unfortunately the laptop got shot down. After the restart it started from ca. 35% and the model 2.
The model 2 took at least 5 hours of work with no result on a decent laptop and I am afraid it was even more (probably the first model was calculated earlier).

Being honest, this is a nuisance to explain people that they need to keep their computer turned on for at least 13 hours just to let a WU crash (and no, they obviously don't like a suspend-to-RAM way). After such an experience, most people want just to have this BOINC thing uninstalled.

Without proper checkpointing, Rosetta@Home can be barely considered as a project ready to run without direct micromanagement, on a laptop or with no additional efforts. The waste and annoyance are too big.

Would it be that difficult to set some curfew for the models, after which the model is ended and all the parameters needed to start from the very same point are sent to the server? Or maybe a way to save even the huge log files on the disk - just for security and to upload or come back to them later?

The long-running models can (and should) be run in the lab, by dedicated users who can crunch 24h/a day or just from the log file.

I know the team has a lot of work. But if you ask others for their help, please give some respect to their efforts and use this help wisely.


P.S. How do I know how long the computer was on? Guess what, because of the problems with Rosie (and lack of WUs) it has a second project set: PrimeGrid. It isn't flawless, so far I had two minor issues with it - but you cannot compare a great stability of supposedly beta PrimeGrid and the performance of officially stable R@H. :/

Pity. I would love to do even sth more "scientific", so maybe QMC or Spinhedge are the way... so far, Rosetta is troublesome. :/
ID: 58787 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58788 - Posted: 13 Jan 2009, 20:43:33 UTC
Last modified: 13 Jan 2009, 20:50:22 UTC

Aegis, I made all of the same suggestions to the Project Team about a month ago. You should be pleased by the changes in the next release. This will begin on Ralph.

[edit]
Oh! Mike just posted details. The first point, about faster loop close was found to be the cause of many of the long-running models. And "Increased the density of checkpoints" means that as you shutdown your laptop and actually move to another location with it, you will not be losing as much of the crunch time that you've invested, because more of the work is preserved more frequently.
Rosetta Moderator: Mod.Sense
ID: 58788 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 29 Oct 08
Posts: 61
Credit: 2,137,555
RAC: 0
Message 58789 - Posted: 13 Jan 2009, 22:24:46 UTC

Thanks a lot, Mod.Sense! In Mod We Trust. ;)

I guess I will give RALPH a try on my personal computer. I will notify about the changes my team as well.
ID: 58789 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 29 Oct 08
Posts: 61
Credit: 2,137,555
RAC: 0
Message 58791 - Posted: 14 Jan 2009, 0:11:40 UTC
Last modified: 14 Jan 2009, 0:12:58 UTC

UPDATE regarding the latest Work Unit.

The laptop user wasn't happy but left the computer on for the night.
The task has been run again... - and it has been completed!

The official CPU time is only 9925.47 and the claimed credit was 33.30. However, the granted credit for 2 decoys is a whooping 139.84!
See here.

Personally I'm puzzled, maybe the laptop was in a suspend mode (it works for Rosetta, isn't it?) and not just shut down after all... However, if the credit wasn't granted manually, it would mean the program somehow guessed that a big amount of crunching went into this WU.

If the program was only suspended/frozen, it would mean the BOINC manager has issues with estimating the elapsed time. If it was not the case and the WU had been restarted, that would suggest the long-running WUs may be run smoothly - from some other point, using some other pseudorandom number etc. etc.

Anyway, that should be interesting for the team.

The only other possibility is that this WU was not processed in any sufficient way - however the BOINC settings, even intervals of reported PrimeGrid units and a reported lack of any heavy activity on the laptop make this rather impossible.

I hope this info helps.
Best Luck!
ID: 58791 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58792 - Posted: 14 Jan 2009, 0:33:31 UTC

Keep in mind that granted credit is based on average credit claims of all the other machines out there crunching the same specific protein. And so if their models took even longer then yours, then that would be why the models for this protein are so valuable in credit terms.

A user with a 12 or 24hr runtime could easily only produce a couple of models and not even notice. And for them, that amount of credit seemed reasonable as well. So, it goes unnoticed.
Rosetta Moderator: Mod.Sense
ID: 58792 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Virtual Boss*
Avatar

Send message
Joined: 10 May 08
Posts: 35
Credit: 713,981
RAC: 0
Message 58870 - Posted: 17 Jan 2009, 13:46:39 UTC

2 recent long running models

Pref Run Time : Actual Run Time : Models Done : Workunit

4 : 5.9 : 1 : 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_193920_1

10 : 5.2 : 1 : 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_121304_1
ID: 58870 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 59042 - Posted: 26 Jan 2009, 21:54:11 UTC

First long-running model in a while:

Task ID: 223722299
Name: 1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_11368_0
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 966658
Report deadline 5 Feb 2009 1:29:30 UTC
CPU time 35239.55

stderr out <core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 35237.5 seconds. Greater than 3X preferred time: 10800 seconds
**********************************************************************
called boinc_finish

</stderr_txt>
]]>

Validate state Valid
Claimed credit 98.8352645051024
Granted credit 80
application version 1.47
ID: 59042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 59071 - Posted: 27 Jan 2009, 21:01:56 UTC
Last modified: 27 Jan 2009, 21:03:08 UTC

This one took 10hrs to finish on a 6hr runtime, it had started the last model

before it hit 6hrs, so it took over 4hrs to do the last model i think that counts

as long.

It was done with mini 1.47.

1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_69044_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=204078438

pete.
ID: 59071 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2117
Credit: 41,140,182
RAC: 15,917
Message 59120 - Posted: 29 Jan 2009, 2:48:40 UTC
Last modified: 29 Jan 2009, 2:56:01 UTC

I think I've got a long-running model in MiniRosetta 1.54

1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_66995_1

Starting work on structure: _00001
# cpu_run_time_pref: 14400
Starting work on structure: _00002
======================================================
DONE :: 1 starting structures 26242.9 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

Preferred runtime 4hrs, so first decoy must've run in less than 2 hours, so second decoy appears to have run in excess of 5 hours. Total runtime is about 7h 20m, which is less than prefered+4, so watchdog correctly wouldn't have intervened.
ID: 59120 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59147 - Posted: 29 Jan 2009, 17:09:27 UTC

The watchdog will step in and complete a task any time it runs 4hrs passed your preferred runtime. We see in the file you reported that yours was:
# cpu_run_time_pref: 14400
(4hrs, as you say)

And so the watchdog will only step in if a task runs for your 4hr runtime plus 4hrs (the 4's are the same by coincidence). So, you shouldn't see any task run significantly longer then 8hrs. If you had a 12hr preference, the watchdog will assure tasks take no longer then 16hrs total.

Your task ran for 7.3hrs, and apparently completed that second model. So, I don't think the watchdog even stepped in here. The second model completed normally and the task was completed.

The great news is that all the detail on your models is now automatically reported back with the results. So, if there are problems, the team and see them and study the data.

From your description, it sounds like you feel the watchdog is watching each model. But that's not what they've done here. It simply watches total runtime are compared to target runtime. And then at the end of each model, the system always considers the runtime target before beginning another.

So, regardless of how long that first model took, this task looks like it behaved as expected. It did not exceed a runtime of target plus 4hrs.
Rosetta Moderator: Mod.Sense
ID: 59147 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 59155 - Posted: 29 Jan 2009, 18:36:04 UTC
Last modified: 29 Jan 2009, 18:38:59 UTC

Here is one where the watchdog stepped in, it was stopped after 16hrs with a 12hr preference:

task: 224203414

CPU time 57633.47

stderr out:
...
Starting watchdog...
Watchdog active.
Starting work on structure: S_shuffle_00001 <--- F_00008_0003416_0
Fullatom mode ..
# cpu_run_time_pref: 43200
Starting work on structure: S_shuffle_00002 <--- F_00001_0000109_0
Fullatom mode ..
Starting work on structure: S_shuffle_00003 <--- F_00002_0003276_0
Fullatom mode ..
Hbond tripped.
====>
called boinc_finish
...


AdeB
ID: 59155 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 29 Oct 08
Posts: 61
Credit: 2,137,555
RAC: 0
Message 59168 - Posted: 29 Jan 2009, 21:57:19 UTC
Last modified: 29 Jan 2009, 22:32:15 UTC

Hi There!

I'm just having yet another long unit, however with a different behaviour than before. Not sure if it's 1.54 specific or if it had been before.

Task 1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_146489_1 is just being crunched on my laptop.
It has been running for 6h 50min (and more to go!), while my prefs are 6 hrs. Still model 1.

The graphics shows strange behaviour. The accepted version is like 3/4 folded, however with a positive energy amount (71.4777; low energy is on 30.80764) and still on some earlier stage - without high details specific for the last steps. The searching protein is a different beast - it can be truly folded and is always with high details. I am on a "ShearMoverMoverBase+Minimization" right now (I think I've seen SmallMoverMoverBase-Minimization as well).

And now the most important part: the Searching protein is moving each ca. 2 seconds, obviously changing itself, however the "Step" number is virtually stalled: it is crawling, increasing by one each ca. 40 seconds or more! Now I am on 316 605 step.

It looks like as if the majority of the attempts in searching were not considered as "steps".

I've been watching the task for 40 minutes. Maybe there were some changes in the accepted model but I don't think so. Now I am waiting for this task to get finished or killed by the watchdog.

The WU is taking a bit more memory than usually. It takes 160+ MB of RAM, 338 of peak and 305 MB of VM. The next strange thing is, I can't find in BOINC logs (ver 6.2.19) when the WU started and if it was checkpointed and paused to run another project (QMC) - I don't have logs from 9:05 to 20:06. o.O

Do you know all the reasons why the WU can be a long-running one and what is happening here?

a.m.
BOINC@Poland

EDIT: The machine as you can see - Win XP SP2, 512 MB RAM minus RAM consumed by an integrated graphics, enough place on the HDD for the swap file.

EDIT2: The task was suspended after 07:19:22 and the BOINC restarted a QMC workunit. I don't place WUs in RAM while rotating projects (too little RAM) and I have set a default rotation time for 3 hours.
ID: 59168 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59170 - Posted: 29 Jan 2009, 22:53:54 UTC

Aegis, that does sound a bit odd, but please allow it to run further. It should be caught by the watchdog once it has run for your 6hr preference plus 4 hours. And this is why it is having some trouble showing you any faster change in the % completed.

It is a good thing there are more checkpoints now, the task should be able to pick up more or less where it left off when BOINC moved to run QMC.
Rosetta Moderator: Mod.Sense
ID: 59170 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 29 Oct 08
Posts: 61
Credit: 2,137,555
RAC: 0
Message 59179 - Posted: 30 Jan 2009, 8:10:20 UTC - in response to Message 59170.  
Last modified: 30 Jan 2009, 8:11:21 UTC

Aegis, that does sound a bit odd, but please allow it to run further. It should be caught by the watchdog once it has run for your 6hr preference plus 4 hours. And this is why it is having some trouble showing you any faster change in the % completed.

It is a good thing there are more checkpoints now, the task should be able to pick up more or less where it left off when BOINC moved to run QMC.


Hi Mod! I've restared my computer for sure and let it crunch the WU to the bitter end. :]

First thing I have seen - I am not sure if the checkpointing is working correctly...
After the initialization, the searching WU started as a straw chain which went into SmallMoverEnergyCutRotamerTrials+Minimization. After a couple of seconds, I got something quasifolded with a lot of high details - actually there were only high details (only the thin chains with no thick ones!) and it got accepted as a low energy state. This low energy had over 300 000 energy units! - and it was step 1.

After that the procedure of searching was being continued and after several seconds I got step 2 with different low energy ("only" over 150 000 units).

Finally only after 7:34:56 and I guess less than 200 steps, the WU finished claiming success. Stderr out:

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-29 15:40:18:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _00001
# cpu_run_time_pref: 21600
BOINC:: Initializing ... ok.
[2009- 1-30 8:32:28:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _00001
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage_1 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage_2 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage_3_iter1_1 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage_3_iter1_2 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage_3_iter1_3 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage_3_iter1_4 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage_3_iter1_5 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage_3_iter1_6 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage_3_iter1_7 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage_3_iter1_8 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage_3_iter1_9 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage_3_iter1_10 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage4_kk_1 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage4_kk_2 ... success!
Continuing computation from checkpoint: chk_S_00000001_ClassicAbinitio__stage4_kk_3 ... success!
# cpu_run_time_pref: 21600
Continuing computation from checkpoint: chk_stage_1_ClassicRelax__S_00000001_fa ... success!
Continuing computation from checkpoint: chk_stage_2_ClassicRelax__S_00000001_fa ... success!
======================================================
DONE :: 1 starting structures 27296.2 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


All of that looks quite similar to the behaviour I have seen in RALPH - please read the second part of this report.

I haven't got any feedback regarding my test of the checkpointing; in a 1.54 summary Mike wrote he corrected some issues with checkpointing, but no one directly responded to my results saying "confirmed" or "here it works" - so I'm not sure if it was ignored, solved or actually what.

I know that Mike is working very hard and needs to set priorities but this one doesn't look well. I am not sure if it is a bug - my knowledge how minirosetta is actually running is quite limited - but then again, if we knew more, we could be much more of help. While watching graphics we would be much more aware what is a strange behaviour and what is not.

I do appreciate your work (and my humble crunching devices) and I would like to help.

Best for you all,
a.m.
BOINC@Poland
ID: 59179 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pramo

Send message
Joined: 21 Oct 05
Posts: 4
Credit: 312,015
RAC: 0
Message 59185 - Posted: 30 Jan 2009, 12:36:42 UTC

Should I let this run?
Any other information I can provide?

Full WU name (you can copy the BOINC message from when the task completes).
1/30/2009 6:39:29 AM|rosetta@home|Restarting task 1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_63390_0 using minirosetta version 147

Type of operating system (version of Windows, Linux distribution, or Mac info.)
Windows XpSP3
BOINC version (see BOINC Manager "About" page).
6.2.18
Rosetta version (see BOINC Manager "tasks" page).
Rosetta Mini 1.47
A link to the task's results page.

https://boinc.bakerlab.org/rosetta/result.php?resultid=223918616

There are no results , it's still running; 34+ hours cpu time. These haven't changed in a good while->Progress at 99.522%, 10 minutes to ompletion

stdout.txt
[2009- 1-26 17:33: 1:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-26 19:34:22:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-26 21:41:33:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-26 23:49:23:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-27 2: 6: 1:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-27 4:14:55:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-27 6:22:16:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-27 8:51:35:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-27 10:52:53:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-27 12:55:56:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-27 15:10: 7:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-27 15:32:20:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-27 17:41:28:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-27 19:50:45:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-27 22: 0:44:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-28 0: 9:27:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-28 2:17:51:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-28 4:18: 6:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-28 6:27:13:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-28 9:36:43:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-28 11:36:51:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-28 11:40:41:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-28 13:44:32:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-28 15:54:10:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-28 18: 3:15:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-28 20:12:20:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-28 22:13:10:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-29 0:13:24:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-29 3:22:48:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-29 7:32:28:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-29 9:45:11:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-29 11:58:10:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-29 14: 7:52:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-29 16:13:20:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-29 19:32:21:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-29 23:35:37:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-30 2:37:25:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1-30 6:39:30:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...

Stderr.txt
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600

ID: 59185 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59186 - Posted: 30 Jan 2009, 14:26:23 UTC

pramo, no, please abort that one. The newer version of Rosetta should resolve such problems in the future.

Aegis, yes, it looks like it took longer then most, but completed ok. The new program isn't perfect, but has much fewer of these long models. And when one does occur, it detects it and ends the task if necessary. And even if it doesn't come to that, it reports the time taken so the Project Team can study why it took so much longer then the others (i.e. continual improvement cycle).
Rosetta Moderator: Mod.Sense
ID: 59186 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pramo

Send message
Joined: 21 Oct 05
Posts: 4
Credit: 312,015
RAC: 0
Message 59187 - Posted: 30 Jan 2009, 14:59:00 UTC - in response to Message 59186.  

[quote]pramo, no, please abort that one. The newer version of Rosetta should resolve such problems in the future.


Aborted, Thank you!
ID: 59187 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 59195 - Posted: 31 Jan 2009, 0:31:43 UTC

Another one that was stopped after preferred runtime + 4hrs.

CPU time 57846.37

stderr out:
...
Starting watchdog...
Watchdog active.
Starting work on structure: _00001
# cpu_run_time_pref: 43200
Starting work on structure: _00002
====>
called boinc_finish
...


AdeB
ID: 59195 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
karlandellie

Send message
Joined: 25 Sep 07
Posts: 1
Credit: 39,393
RAC: 0
Message 59198 - Posted: 31 Jan 2009, 11:16:10 UTC

I am currently crunching abinitio_abrelax_nohomfrag_129_B_2hkvA_6224_3670 so far the CPU time is 29:49:50 and it is still crunching but seemes to have slowed at 99.44%

I am using
Windows XP 1.8 GHZ
BOINC 6.4.5
Rosetta Mini 1.47
ID: 59198 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 14 · Next

Message boards : Number crunching : Report long-running models here



©2024 University of Washington
https://www.bakerlab.org