minirosetta 2.05

Message boards : Number crunching : minirosetta 2.05

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
TomaszPawel

Send message
Joined: 28 Apr 07
Posts: 54
Credit: 2,791,145
RAC: 0
Message 65250 - Posted: 9 Feb 2010, 19:02:22 UTC - in response to Message 65248.  

ID: 65250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Craig Dickinson

Send message
Joined: 7 May 07
Posts: 8
Credit: 924,823
RAC: 547
Message 65251 - Posted: 9 Feb 2010, 22:16:02 UTC - in response to Message 65244.  

Anyone else seeing the following consistent error:-

File - minirosetta_graphics_1.92_windows_x86_64.exe stops downloading at 4.57/5.10 MB

Message section is showing this as a HTTP error followed by Internet access OK - project servers may temporarily be down.

I have reset the project (more than once) also detached and waited until next PC boot to re-attach. All this had no impact and its been doing this for several days now. So I am unable to process any work units as the applications hasn't finished downloading.

Running on Boinc 6.10.18 for Windows 64Bit on Windows 7, AMD 64Bit Dual Core, 4GB RAM

I am also running Seti@Home and this is running error free in both the standard and astropulse projects.


It should recover the transfer from where it left off and get the rest of the file. But it seems it must have a hiccup along the way. Are you using a cacheing proxy server or something?

Sounds like you've enabled the http tracing. Which Rosetta server does it say it is trying to get the file from? It should actually cycle through all of them as it does the retries. This should confuse a proxy enough that it would start fresh.

You could always download it with your browser and drop it in the rosetta project directory. Here is one of the direct URLs:
http://srv4.bakerlab.org/download/minirosetta_graphics_1.92_windows_x86_64.exe


It does not show which server it is connecting to, I tried the url you provided and srv3 as well. Both of these download at 136KB/sec consistently until the download hits 4.57MB then the transfer rate drops to 0KB/sec and after 5 minutes or so it times out the connection.
ID: 65251 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 65252 - Posted: 9 Feb 2010, 23:21:22 UTC

This one failed after about 20 seconds

lr15clusfa_opt_.1ail.1ail.IGNORE_THE_REST.c.1.24.pdb.pdb.JOB_17559_3

Exit status -1073741819 (0xc0000005)
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000

ID: 65252 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65254 - Posted: 10 Feb 2010, 3:55:31 UTC - in response to Message 65251.  


It does not show which server it is connecting to, I tried the url you provided and srv3 as well. Both of these download at 136KB/sec consistently until the download hits 4.57MB then the transfer rate drops to 0KB/sec and after 5 minutes or so it times out the connection.


...even so, the reaction on the client should be to see that it did not receive the entire length of the file, and then try a second request for just the remainder of the file (this is done with an HTTP range header).

I just tried the link from home and it worked fine for me 5.1MB. So that would explain why others are not seeing the same.

Is it possible your ISP is limiting the time of each connection or something? Even so, I'm still puzzled why it doesn't sound like it is doing a retry from where it left off. What is the reaction on the client after the connection gets the 4.57M and then times out?? It should schedule a retry on the file until it gets it. And the retry should pick up where the first attempt left off at the 4.57M. You should see this in the advanced view, in the transfers tab.

Perhaps there is something up with Win7? I'd be curious to have a look at a Wireshark trace if you would take the time to gather one.
Rosetta Moderator: Mod.Sense
ID: 65254 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 65255 - Posted: 10 Feb 2010, 5:54:05 UTC

Hi jcorn.

Either this is an old task or the memory limit hasn't been changed yet, this one

had the same problem on the same rig, would you believe!

Only ran for 10min this time.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=288907163

igfhum_looprefine_placestub2_2dsrI_2R99_ProteinInterfaceDesign_2Feb2010_17660_441_0


Wed 10 Feb 2010 16:19:03 EST|rosetta@home|Aborting task igfhum_looprefine_placestub2_2dsrI_2R99_ProteinInterfaceDesign_2Feb2010_17660_441_0: exceeded memory limit 910.28MB > 909.78MB

Wed 10 Feb 2010 16:19:05 EST|rosetta@home|Output file igfhum_looprefine_placestub2_2dsrI_2R99_ProteinInterfaceDesign_2Feb2010_17660_441_0_0 for task absent

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
Maximum memory exceeded
</message>



ID: 65255 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 65263 - Posted: 10 Feb 2010, 15:45:34 UTC

ID: 65263 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Craig Dickinson

Send message
Joined: 7 May 07
Posts: 8
Credit: 924,823
RAC: 547
Message 65264 - Posted: 10 Feb 2010, 16:52:10 UTC - in response to Message 65254.  


It does not show which server it is connecting to, I tried the url you provided and srv3 as well. Both of these download at 136KB/sec consistently until the download hits 4.57MB then the transfer rate drops to 0KB/sec and after 5 minutes or so it times out the connection.


...even so, the reaction on the client should be to see that it did not receive the entire length of the file, and then try a second request for just the remainder of the file (this is done with an HTTP range header).

I just tried the link from home and it worked fine for me 5.1MB. So that would explain why others are not seeing the same.

Is it possible your ISP is limiting the time of each connection or something? Even so, I'm still puzzled why it doesn't sound like it is doing a retry from where it left off. What is the reaction on the client after the connection gets the 4.57M and then times out?? It should schedule a retry on the file until it gets it. And the retry should pick up where the first attempt left off at the 4.57M. You should see this in the advanced view, in the transfers tab.

Perhaps there is something up with Win7? I'd be curious to have a look at a Wireshark trace if you would take the time to gather one.




I have the wireshark trace from the re-attaching of the Rosetta project to the Boinc client showing the original download failure of the graphics file and the first 2 attempts to resume the download of the graphics file. Once again all other Rosetta files and the first work unit have sucessfully downloaded.

Where do you want me to send the wireshark trace report.
ID: 65264 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,333,868
RAC: 11,943
Message 65267 - Posted: 11 Feb 2010, 2:15:03 UTC - in response to Message 65248.  

This is a good idea, but I think the specific WU I mentioned had another problem. It continued to take memory until the maximum available was reached. So maybe it tooke more RAM if I would have more in my PC.
So far I'm the only one that notice this problem, maybe it is only one case.

By the way - it looks like a typical memory leak...
A fairly common error in computer programs

Hi jcorn.
Either this is an old task or the memory limit hasn't been changed yet, this one
had the same problem on the same rig, would you believe!
Only ran for 10min this time.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=288907163
igfhum_looprefine_placestub2_2dsrI_2R99_ProteinInterfaceDesign_2Feb2010_17660_441_0

Yes, its old.
Hint: name of the task contains date when it was scheduled. 2 Feb 2010 in this case.
ID: 65267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,703,329
RAC: 2,182
Message 65270 - Posted: 11 Feb 2010, 8:35:40 UTC

Credit wise, this task: https://boinc.bakerlab.org/rosetta/result.php?resultid=315895911 (igfhum_looprefine_placestub2_2dsrI_2WA0_ProteinInterfaceDesign_2Feb2010_17660_334_0) wasn't even worthy my CPU time. I claimed 99 credit and got only 3!
It ran the full length of time, 15000+ seconds,ran 44 models and generated 2 decoys.

Something is wrong with those numbers. Especially granted credit.
ID: 65270 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,703,329
RAC: 2,182
Message 65271 - Posted: 11 Feb 2010, 8:36:58 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=315583449
lr15clusfa_opt_.1dhn.1dhn.IGNORE_THE_REST.c.14.1.pdb.pdb.JOB_17574_1_0

Compute error -177 (0xffffff4f)

Got full credit though.
ID: 65271 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1982
Credit: 38,461,917
RAC: 15,153
Message 65276 - Posted: 11 Feb 2010, 11:34:26 UTC - in response to Message 65270.  

Credit wise, this task: https://boinc.bakerlab.org/rosetta/result.php?resultid=315895911 (igfhum_looprefine_placestub2_2dsrI_2WA0_ProteinInterfaceDesign_2Feb2010_17660_334_0) wasn't even worthy my CPU time. I claimed 99 credit and got only 3!
It ran the full length of time, 15000+ seconds,ran 44 models and generated 2 decoys.

Something is wrong with those numbers. Especially granted credit.

But the times you were awarded more than claimed credit weren't a problem? Funny how that works.

It's an average and you're ahead of average generally. I am too but I thought best not to mention it ;)
ID: 65276 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65277 - Posted: 11 Feb 2010, 11:57:27 UTC

Let's not get testy Sid. It looks like he ran 46 models and got credit for only the last 2. I've asked the Project Team to look in to these "double headers" as I call them. Thanks for reporting it Greg. If you have any hints about any rare events that may have occurred on your PC about the time those last two models would have been run, that would be great. Did you happen to power off or shutdown BOINC about that time?
Rosetta Moderator: Mod.Sense
ID: 65277 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 65278 - Posted: 11 Feb 2010, 12:41:47 UTC

It would appear that some of these lr15clusfa.. work units have a problem.

lr15clusfa_opt_.2cmx.2cmx.SAVE_ALL_OUT_IGNORE_THE_REST.c.2.28.pdb.pdb.JOB_17759_1_0

The previous one I reported has also failed on its second attempt
ID: 65278 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65286 - Posted: 11 Feb 2010, 17:45:13 UTC - in response to Message 65264.  


I have the wireshark trace from the re-attaching of the Rosetta project to the Boinc client showing the original download failure of the graphics file and the first 2 attempts to resume the download of the graphics file. Once again all other Rosetta files and the first work unit have sucessfully downloaded.

Where do you want me to send the wireshark trace report.


Craig and I arranged exchange of the trace via PMs... Craig, I've reviewed the trace and it seems the BOINC client is requesting a partial transfer from the next server in Rosetta's cluster, just as expected. But your machine never acknowledges receipt of any of the packets. And so the 3rd and 4th try are asking for the same point in the file that the second did. Plenty of time elapses, and the frame is resend about 10 times. But never acknowledged. Eventually the connection is closed.

The only times I've seen such fundamental things not work properly is when (bear with me :) there is a fundamental problem. Some examples from my past... bad ethernet port in the hub. Bad cable. But most commonly there's a router between the client and the server that's in need of a code refresh. That, or perhaps Win7 itself or the driver for your LAN adapter need an update. Because it sounds like it is happening so consistently at the same point, I tend to think there is a code problem here, because a hardware problem like a bad cable or port would probably be more intermittent.

The only other thing I can think of to try is that the downloads started out looking good, so after that first one gets most of the file and fails, shutdown BOINC, and shutdown the LAN adapter ("disable"), reenable the adapter, and fire up BOINC again. Now the retry will occur with a fresh start on the adapter and perhaps get you over the hump. But ultimately, that is just a work around. I think the true fix is going to be drivers, Win7, or router/firewall code updates. You might also try disabling the IPv6 support on the adapter.

Is anyone aware of any specific TCP fixes for Win7?

Notes to self (cuz this little scrap of paper is sure to disappear as soon as anyone askes me anything else about it!): client starts on srv1, retries occur to srv5, srv6, and srv3. Local ports used are 51221, 51229, 51235, 51241. These make good filters, such as: tcp.port == 51235 in the Wireshark display.
Rosetta Moderator: Mod.Sense
ID: 65286 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mfbabb2

Send message
Joined: 10 Oct 08
Posts: 4
Credit: 10,345
RAC: 0
Message 65289 - Posted: 12 Feb 2010, 0:43:24 UTC

What is up with the low credit?
316913595 289066389 10 Feb 2010 15:09:57 UTC 12 Feb 2010 0:36:02 UTC Over Success Done 12,371.91 36.80 2.13

ID: 65289 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1982
Credit: 38,461,917
RAC: 15,153
Message 65290 - Posted: 12 Feb 2010, 2:39:21 UTC - in response to Message 65277.  

Let's not get testy Sid.

I didn't mean it that way - sorry if that's how it came across. I just recalled Sarel's comment way up the thread that "The critstubs jobs are of the same sort of unevenly crediting trajectories as the *gbnnotyr*. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details" so I'm pretty much ignoring all the vagaries of credit awards against claims. It averages out so we win some, we lose some. Is that not right?

If it's not then I can report quite a few too, for what it's worth.

Probably of more benefit I should report some compute errors, much the same as reported by others:

BOINC client version 6.10.18 for windows_x86_64
Processor: 2 GenuineIntel Intel(R) Core(TM)2 Duo CPU T6600@2.20GHz [Intel64 Family 6 Model 23 Stepping 10]
OS: Microsoft Windows 7: Home Premium x64 Edition, (06.01.7600.00)
Memory: 4.00 GB physical

# cpu_run_time_pref: 28800
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000

CPU time 20.65453
lr15clusfa_opt_.1ctf.1ctf.IGNORE_THE_REST.c.18.2.pdb.pdb.JOB_17573_10_0

BOINC client version 6.10.18 for windows_x86_64
Processor: AMD Phenom(tm) 9850 Quad-Core Processor [AMD64 Family 16 Model 2 Stepping 3]
OS: Microsoft Windows Vista Home Premium x64 Edition, Service Pack 2, 06.00.6002.00)
Memory: 8.00 GB physical

# cpu_run_time_pref: 28800
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000

CPU time 14.77329
lr15clusfa_opt_.1scj.1scj.IGNORE_THE_REST.c.2.32.pdb.pdb.JOB_17610_1_0
CPU time 15.2101
lr15clusfa_opt_.1iib.1iib.IGNORE_THE_REST.c.9.2.pdb.pdb.JOB_17588_5_1
CPU time 15.1477
lr15clusfa_opt_.1ttz.1ttz.IGNORE_THE_REST.c.0.27.pdb.pdb.JOB_17619_4_1
CPU time 15.0073
lr15clusfa_opt_.1ail.1ail.IGNORE_THE_REST.c.4.11.pdb.pdb.JOB_17559_8_1
ID: 65290 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Copelco

Send message
Joined: 11 Feb 10
Posts: 1
Credit: 8,097
RAC: 0
Message 65292 - Posted: 12 Feb 2010, 4:20:36 UTC

I'm a new user running latest version. The first work unit you sent ran fine to about 70% then stopped and dropped off the task list as submitted. Account shows no work units submitted. May be a problem.

Thanks,
TC
ID: 65292 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 352
Credit: 382,349
RAC: 0
Message 65298 - Posted: 12 Feb 2010, 14:33:07 UTC

Now I've got also quite low credit: WU 288293546.
I usually need something like 450-650 CPU-seconds for 1Cr, on this WU I got 1Cr/1972sec.
.
ID: 65298 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin

Send message
Joined: 13 Apr 07
Posts: 42
Credit: 260,782
RAC: 0
Message 65300 - Posted: 12 Feb 2010, 15:59:56 UTC

Compute error - exit status 1 lrmixclus_opt_.1hz6.1hz6.SAVE_ALL_OUT_IGNORE_THE_REST.c.20.2.pdb.pdb.JOB_17816_1_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=317250268

ERROR: start_res != middle_res
ERROR:: Exit from: ....srcprotocolsmovesKinematicMover.cc line: 132
BOINC:: Error reading and gzipping output datafile: default.out
ID: 65300 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 65302 - Posted: 12 Feb 2010, 20:35:24 UTC

This one failed after just 14 sec.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=289171483

lr15clusfa_opt_.1ail.1ail.SAVE_ALL_OUT_IGNORE_THE_REST.c.4.20.pdb.pdb.JOB_17676_7_0

Fri 12 Feb 2010 21:40:02 EST|rosetta@home|Output file lr15clusfa_opt_.1ail.1ail.SAVE_ALL_OUT_IGNORE_THE_REST.c.4.20.pdb.pdb.JOB_17676_7_0_0 for task absent

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>

BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
SIGSEGV: segmentation violation
Stack trace (8 frames):
[0x96c49b3]
[0x96ee888]
[0xb7fd1420]
[0x80a8721]
[0x808fcc1]
[0x804985f]
[0x974c15c]
[0x8048121]

Exiting...

</stderr_txt>



ID: 65302 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : minirosetta 2.05



©2024 University of Washington
https://www.bakerlab.org