Minirosetta 3.14

Message boards : Number crunching : Minirosetta 3.14

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,702,763
RAC: 2,140
Message 70960 - Posted: 7 Aug 2011, 20:37:00 UTC

8 memory error tasks and 2 validate error tasks.
Getting old guys!

No tasks queued, so your working on this?
Why not say something?
ID: 70960 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ed

Send message
Joined: 2 Aug 11
Posts: 31
Credit: 662,563
RAC: 0
Message 70979 - Posted: 8 Aug 2011, 18:57:01 UTC

Thanks for the comments guys.

I am not currently getting any Rosetta work units. All my cpu time is going to Seti. At least that one run consistently.
ID: 70979 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1982
Credit: 38,456,205
RAC: 14,686
Message 70984 - Posted: 9 Aug 2011, 2:46:20 UTC - in response to Message 70950.  

I've been good all week so maybe fate will smile on me and they will end up on Sid's bucket.

Oi! Blooming cheek! ;)

The project runs a script, nightly I think, to grant credit for workunits that have failed validation. You won't find it on your tasks lists or workunits pages. You have to scroll to the bottom of the task details page to see it. All the invalid flxdsgn I've looked at have received credit (after a day or so).

Validate errors are not client errors and don't necessarily mean the workunits have failed and the results are useless. I see no reason to abort them. I would though, very much like the project to chime in here and tell us what invalidation means in this particular case.

Beat me to it it - agree. Not a problem for me. I don't need to be told why, as long as the project team see it and know why. It's not of any concern to me.
ID: 70984 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 70985 - Posted: 9 Aug 2011, 3:32:54 UTC

@Sid: Yes, I know about the script, and quite frankly I really don't give a rat's posterior about the credits - lost or gained.

My logic in aborting these tasks while they were still in the queue was based on efficiency. To me it made no sense to run a work unit that was predestined to generate a validation error just to have it routed to a wingman to be recomputed.

Why compute the same work units twice or even three times (if the wingman also had a fast processor)

As far as being cheeky? I think that you are just showing your insecurity in the face of American exceptionalism. We may be the upstarts on the block, but we're catching up. For years you could proudly claim to have the world leader with the biggest ears.

But I think that Prince Charles has now been eclipsed by Obama.
ID: 70985 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1982
Credit: 38,456,205
RAC: 14,686
Message 70988 - Posted: 9 Aug 2011, 10:38:53 UTC - in response to Message 70985.  

My logic in aborting these tasks while they were still in the queue was based on efficiency. To me it made no sense to run a work unit that was predestined to generate a validation error just to have it routed to a wingman to be recomputed.

I think there's some value in a job reporting a failure to run (presumably in a pattern already detected by crunchers), especially if it only runs 20 minutes, rather than reporting as aborted.

As far as being cheeky? I think that you are just showing your insecurity in the face of American exceptionalism. We may be the upstarts on the block, but we're catching up. For years you could proudly claim to have the world leader with the biggest ears.

But I think that Prince Charles has now been eclipsed by Obama.

lol ;) Where did you hear Charles was a leader of anything?
ID: 70988 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,811,598
RAC: 764
Message 70989 - Posted: 9 Aug 2011, 13:19:26 UTC - in response to Message 70985.  


My logic in aborting these tasks while they were still in the queue was based on efficiency. To me it made no sense to run a work unit that was predestined to generate a validation error just to have it routed to a wingman to be recomputed.

Why compute the same work units twice or even three times (if the wingman also had a fast processor)


I was under the impression, quite possibly inaccurate, that resends are not exact duplicates but are additional copies. The second cruncher is not in fact rerunning the exact same models of the first cruncher but rather running additional models. Perhaps Mod.Sense can clarify.

Further, if invalidation does not prevent interesting models from being more closely examined by the project scientists, then there's no reason not to continue running these types of tasks even in the face of frequent invalidation.

Although, one, frequent invalidation is annoying, they really should fix that; and two, I could be wrong on either or both points.

Best,
Snags
ID: 70989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ed

Send message
Joined: 2 Aug 11
Posts: 31
Credit: 662,563
RAC: 0
Message 70998 - Posted: 10 Aug 2011, 12:17:25 UTC

I would be interested on some information about resends. My understanding, frankly from Seti, and other grid projects I have been part of, is that each WU is sent to multiple computers. When the work comes back they are compared to validate the results.

The 3 way comparison is the one I seem to recall as being most common. If they send out three and two match, they are marked good and the third as not good.

I realize this is not a requirement, but would be interested to understand how this project works.
ID: 70998 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,848,401
RAC: 2,043
Message 71002 - Posted: 10 Aug 2011, 13:54:04 UTC
Last modified: 10 Aug 2011, 13:55:57 UTC

The last I knew, Rosetta@Home usually sent out only one of each workunit. If it came back with an error or was sent to a computer considered unreliable, they would send another. Also another for a small fraction of the computers considered reliable. If those two agreed, no need for a third one. If they disagreed, then a third one was sent. For most of the workunits, they had a fairly quick way of calculating how good the outputs were and could use that as part of deciding whether to send another copy. However, this was months ago, so it might not describe the current setup.
ID: 71002 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,811,598
RAC: 764
Message 71008 - Posted: 10 Aug 2011, 22:33:43 UTC - in response to Message 71002.  

The last I knew, Rosetta@Home usually sent out only one of each workunit. If it came back with an error or was sent to a computer considered unreliable, they would send another. Also another for a small fraction of the computers considered reliable. If those two agreed, no need for a third one. If they disagreed, then a third one was sent. For most of the workunits, they had a fairly quick way of calculating how good the outputs were and could use that as part of deciding whether to send another copy. However, this was months ago, so it might not describe the current setup.



Robert, I don't recall this ever being the procedure on rosetta. Perhaps you are thinking of malariacontrol.net? They use adaptive replication.


Ed, my speculations earlier in the thread regarding resends apply to rosetta @home only. If you go to your tasks list and click on the "workunit id" link you'll see that initial replication=1 and minimum quorum=1. Another copy will be sent only if the original is returned with an error, fails to validate or misses its deadline.

Some projects use multiple replications to prevent cheating or to discard results that produce the wrong answer but don't throw client errors. As I understand it the method rosetta uses is not depending on finding the single right answer but is collecting best guesses. For each experiment the project sends out hundreds (thousands?) of workunits in order to create tens (hundreds?) of thousands of models which they can then analyze statistically. A single computer returning garbage should not effect the results. Likewise the failure of a single workunit or single models within workunits is not a cause for concern.


Best,
Snags
ID: 71008 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,848,401
RAC: 2,043
Message 71012 - Posted: 10 Aug 2011, 23:07:41 UTC - in response to Message 71008.  

The last I knew, Rosetta@Home usually sent out only one of each workunit. If it came back with an error or was sent to a computer considered unreliable, they would send another. Also another for a small fraction of the computers considered reliable. If those two agreed, no need for a third one. If they disagreed, then a third one was sent. For most of the workunits, they had a fairly quick way of calculating how good the outputs were and could use that as part of deciding whether to send another copy. However, this was months ago, so it might not describe the current setup.



Robert, I don't recall this ever being the procedure on rosetta. Perhaps you are thinking of malariacontrol.net? They use adaptive replication.


Ed, my speculations earlier in the thread regarding resends apply to rosetta @home only. If you go to your tasks list and click on the "workunit id" link you'll see that initial replication=1 and minimum quorum=1. Another copy will be sent only if the original is returned with an error, fails to validate or misses its deadline.

Some projects use multiple replications to prevent cheating or to discard results that produce the wrong answer but don't throw client errors. As I understand it the method rosetta uses is not depending on finding the single right answer but is collecting best guesses. For each experiment the project sends out hundreds (thousands?) of workunits in order to create tens (hundreds?) of thousands of models which they can then analyze statistically. A single computer returning garbage should not effect the results. Likewise the failure of a single workunit or single models within workunits is not a cause for concern.


Best,
Snags


Possibly - I have my computers participating in most of the BOINC projects I've found connected to medical research, and it's often hard to keep track of which project is doing what.
ID: 71012 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 71016 - Posted: 11 Aug 2011, 0:16:31 UTC - in response to Message 70989.  


My logic in aborting these tasks while they were still in the queue was based on efficiency. To me it made no sense to run a work unit that was predestined to generate a validation error just to have it routed to a wingman to be recomputed.

Why compute the same work units twice or even three times (if the wingman also had a fast processor)


I was under the impression, quite possibly inaccurate, that resends are not exact duplicates but are additional copies. The second cruncher is not in fact rerunning the exact same models of the first cruncher but rather running additional models. Perhaps Mod.Sense can clarify.

Further, if invalidation does not prevent interesting models from being more closely examined by the project scientists, then there's no reason not to continue running these types of tasks even in the face of frequent invalidation.

Although, one, frequent invalidation is annoying, they really should fix that; and two, I could be wrong on either or both points.

Best,
Snags


When tasks are resent, the second person gets the same task... and the same random seed that defines which exact models to run, but the second machine have a different runtime preference. So they will start out crunching exactly the same models, but may run more or less models then the first machine.

Rosetta Moderator: Mod.Sense
ID: 71016 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 71017 - Posted: 11 Aug 2011, 0:19:53 UTC

Right, R@h just sends the task to a single machine, and only issues resends when that task is not returned before deadline, or is returned with an error.

R@h does not define "reliable hosts". Some other projects do.

Bottom line, rather then have one machine waste its time simply double checking the work of another, it crunches new models noone else has done. Net result, the project gets a wider sampling of the search space.
Rosetta Moderator: Mod.Sense
ID: 71017 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ed

Send message
Joined: 2 Aug 11
Posts: 31
Credit: 662,563
RAC: 0
Message 71020 - Posted: 11 Aug 2011, 3:07:28 UTC

Thanks for the clarifications.
ID: 71020 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,811,598
RAC: 764
Message 71030 - Posted: 11 Aug 2011, 16:12:37 UTC - in response to Message 71016.  


When tasks are resent, the second person gets the same task... and the same random seed that defines which exact models to run, but the second machine have a different runtime preference. So they will start out crunching exactly the same models, but may run more or less models then the first machine.



So I was just crazy talking. Shucks, I really liked that idea. Ah, well.

Thanks, Mod.Sense, for clearing that up.

Best,
Snags
ID: 71030 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 71031 - Posted: 11 Aug 2011, 19:44:48 UTC

Task 440910744 (T0423_3d01.pdb_boinc_symmetric_lr_symm_wangyr_IGNORE_THE_REST_30148_3650_0) gave a Validate error on Mac after completing one decoy.
ID: 71031 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,702,763
RAC: 2,140
Message 71033 - Posted: 12 Aug 2011, 2:56:06 UTC
Last modified: 12 Aug 2011, 2:57:34 UTC

This crashed and burned:T0409_3d0f.pdb_boinc_symmetric_lr_symm_wangyr_IGNORE_THE_REST_30145_9598_0

It ran only 50% of its alotted time and as far as I can tell produced no decoys.
ID: 71033 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 71042 - Posted: 13 Aug 2011, 5:47:06 UTC

I am sure others have noted it already but tasks T0423* appear to be behaving in the exact same manner as the flxdsgn tasks of the past 10 days or so.

They are designed to generate only one decoy and if the system completes it in less than 1201 seconds it gets a validate error and is then sent to a second system.

Task ID 440948630 is an example where both my I and my wingman completed the task in less than 1201 seconds and we both got a validate error.

Task ID 440943948 is an example where my system completed the task in less than 1201 seconds and got a validate error while my wingman took 3350 seconds and got a success.


ID: 71042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 71043 - Posted: 13 Aug 2011, 5:55:54 UTC

I'm sorry, I think I need to editorialize a little bit. The T0423* tasks in my post were generated this past Thursday, a full week after the "1201 second" problem was spotted by another participant here.

Yet here we go again? Does anyone at the project read this forum? Better yet, does anyone at the project do anything to verify that a known problem is not propagated into a new batch of tasks before they are released into the wild?

While the cause of the problem behind the "1201 second" issue may be complex and as yet not identified, its signature is easy to spot - and could have been picked up in even the most rudimentary dry runs.

Dang!
ID: 71043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 71050 - Posted: 13 Aug 2011, 13:23:56 UTC

The validation has no way to determine a specific number of seconds that would be valid, or invalid. The same amount of effort from a slow machine might mean it takes twice as long to reach that same point of execution. So any such signature is a red herring.

The true problem would seem to be elsewhere with how the tasks are being processed in some way. Which would explain why both machines that crunched it had the same problem.
Rosetta Moderator: Mod.Sense
ID: 71050 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,848,401
RAC: 2,043
Message 71066 - Posted: 16 Aug 2011, 12:47:37 UTC

Another workunit that's stopped using any CPU time:

minirosetta_3.14_windows_x86_64.exe
working set 618,000K
peak working set 651,888K

T0441_3d8u.pdb_boinc_symmetric_lr_symm_wangyr_IGNORE_THE_REST_29961_12936
max RAM usage 95 MB
CPU time at last checkpoint 02:23:47
CPU time 02:24:11
Elapsed time 28:40:13
Estimated time remaining 59:43:59
Fraction done 17.166%

8/13/2011 2:07:36 PM | | Starting BOINC client version 6.12.33 for windows_x86_64
8/13/2011 2:07:36 PM | | log flags: file_xfer, sched_ops, task
8/13/2011 2:07:36 PM | | Libraries: libcurl/7.19.7 OpenSSL/0.9.8l zlib/1.2.5
8/13/2011 2:07:36 PM | | Data directory: C:ProgramDataBOINC
8/13/2011 2:07:36 PM | | Running under account Bobby
8/13/2011 2:07:36 PM | | Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz [Family 6 Model 23 Stepping 10]
8/13/2011 2:07:36 PM | | Processor: 6.00 MB cache
8/13/2011 2:07:36 PM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe
8/13/2011 2:07:36 PM | | OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
8/13/2011 2:07:36 PM | | Memory: 8.00 GB physical, 15.66 GB virtual
8/13/2011 2:07:36 PM | | Disk: 919.67 GB total, 541.21 GB free
8/13/2011 2:07:36 PM | | Local time is UTC -5 hours
8/13/2011 2:07:36 PM | | NVIDIA GPU 0: GeForce GTS 450 (driver version 28026, CUDA version 4000, compute capability 2.1, 993MB, 476 GFLOPS peak)

8/16/2011 12:17:04 AM | | Number of CPUs: 3
8/16/2011 12:17:04 AM | | 3026 floating point MIPS (Whetstone) per CPU
8/16/2011 12:17:04 AM | | 8778 integer MIPS (Dhrystone) per CPU
8/16/2011 12:17:05 AM | | Resuming computation
8/16/2011 2:04:43 AM | | Project communication failed: attempting access to reference site
8/16/2011 2:04:44 AM | | Internet access OK - project servers may be temporarily down.
8/16/2011 2:34:09 AM | | Project communication failed: attempting access to reference site
8/16/2011 2:34:10 AM | | Internet access OK - project servers may be temporarily down.
8/16/2011 2:48:39 AM | | Project communication failed: attempting access to reference site
8/16/2011 2:48:40 AM | | Internet access OK - project servers may be temporarily down.
8/16/2011 3:54:44 AM | rosetta@home | Sending scheduler request: To fetch work.
8/16/2011 3:54:44 AM | rosetta@home | Requesting new tasks for CPU
8/16/2011 3:54:45 AM | rosetta@home | Scheduler request completed: got 1 new tasks
8/16/2011 3:54:47 AM | rosetta@home | Started download of 2011_8_15_mini_s016_folding.zip
8/16/2011 3:54:54 AM | rosetta@home | Temporarily failed download of 2011_8_15_mini_s016_folding.zip: HTTP error
8/16/2011 3:54:55 AM | rosetta@home | Started download of 2011_8_15_mini_s016_folding.zip
8/16/2011 3:54:58 AM | | Project communication failed: attempting access to reference site
8/16/2011 3:54:59 AM | | Internet access OK - project servers may be temporarily down.
8/16/2011 3:55:09 AM | rosetta@home | Finished download of 2011_8_15_mini_s016_folding.zip

Requested runtime 12 hours
BOINC 6.12.33
64-bit Windows Vista Home Premium SP2
8 GB memory; BOINC allowed to use 40% of it
Set to keep workunits in memory when suspended

Now suspended; should I allow it to resume? Should I abort it? Is it best to set R@H to no new tasks?
ID: 71066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Minirosetta 3.14



©2024 University of Washington
https://www.bakerlab.org