At a bit of a loss

Message boards : Number crunching : At a bit of a loss

To post messages, you must log in.

AuthorMessage
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68651 - Posted: 17 Nov 2010, 13:10:54 UTC
Last modified: 17 Nov 2010, 13:14:47 UTC

Yesterday I had a system start aborting all of its tasks. When mew tasks were downloaded I was getting "file size or signature" errors on the minirosetta_2.17_x86_64-pc-linux-gnu file.

However, I compared the size and checksum of this file to the same file on several "working" systems and everything matched.

This system normally is rack mounted running "headless"over a VNC connection, so the network is functioning OK.

dmesg and messages file show no errors reported at the system level.

I deleted the minirosetta_2.17_x86_64-pc-linux-gnu file and reinstalled BOINC - when it was brought back up it tried to download the file and got the same error.

I took the system out of the rack and put it on the bench with a real monitor and keyboard, ran fsck a number of times, and brought BOINC back up.

Still getting the checksum / signature error.

I once again cleaned up the project file, deleting minirosetta* and started over. I'm now getting getting the checksum / signature error on both the minirosetta_2.17_x86_64-pc-linux-gnu file and the minirosetta_database_rev39052,zip file.

Other files downloaded such as abrelax.default.v1 and 2RN2_pcs_cst_files.r2.pnoe.V1 download with out this error.

The error is the same with image file verification feature turned on or turned off.

The following types of messages are shown in the job output for all these jobs:

<file_xfer_error>
<file_name>minirosetta_graphics_1.92_x86_64-pc-linux-gnu</file_name>
<error_code>-200</error_code>
</file_xfer_error>

<file_xfer_error>
<file_name>minirosetta_2.17_x86_64-pc-linux-gnu</file_name>
<error_code>-200</error_code>
</file_xfer_error>

Once again I stress, it appears that the files are making it into the ../projects/boinc.bakerlab.org_rosetta directory with the correct filesize and checksum (as caluclated using "sum")

Any ideas?
ID: 68651 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[AF>france>pas-de-calais]symaski62

Send message
Joined: 19 Sep 05
Posts: 47
Credit: 33,871
RAC: 0
Message 68652 - Posted: 17 Nov 2010, 14:24:22 UTC

ID: 68652 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68653 - Posted: 17 Nov 2010, 16:17:54 UTC

If the files are arriving uncorrupted, as you seem to have verified as many ways as possible, and you are still getting the error, the only culprit left would seem to be BOINC. It would seem to be erroneously reporting the error. What BOINC version are you running on that machine? One other possibility would be a piece of the project configuration, have you tried completely detaching and reattaching R@h?
Rosetta Moderator: Mod.Sense
ID: 68653 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68658 - Posted: 17 Nov 2010, 23:52:36 UTC
Last modified: 17 Nov 2010, 23:57:16 UTC

What BOINC version are you running on that machine?


It was running 6.10.56 at the time of the failure, but upgraded 6.10.58 as a part of the diagnostic process. That did not help.

I would snag a copy of the downloaded files which were marked as being in error before they were cleaned up and manually unzip them - assuming that I may see an indication of what was causing the length / signature errors. They unzipped clean.

I brought up network tools and started to look to see if I had a ratty network connection - which I doubted since moving it to the bench this morning it was connected directly to the router with a new cable instead of being on the downstream switch.

0 Xmit errors, 0 Recv Errors, 0 Collisions

Still getting length / signature errors.

In desperation I tried your suggestion of completely disconnecting from the project and reconnecting - bang, everything started working.

As Gomer Pyle would say - Shazam.

Want to give me a clue as to what kind of magic is at work here? Could it have been a fat electron?
ID: 68658 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68659 - Posted: 18 Nov 2010, 0:24:39 UTC

Am I correct to presume that R@h previously had been running fine on that machine? If so, then it almost sounds like there's a project key used to test the signature validity, and it somehow became corrupted on your hard drive. With a corrupted key you can't come up with a matching signature, hence they never match.
Rosetta Moderator: Mod.Sense
ID: 68659 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 68660 - Posted: 18 Nov 2010, 1:48:27 UTC

Sweet, may be this way I might be able to catch up with you. lol
Now I just have to multiply my RAC by x13 and voila!
ID: 68660 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68661 - Posted: 18 Nov 2010, 2:15:23 UTC

Am I correct to presume that R@h previously had been running fine on that machine?


Most certainly, this was one of my early machines and had been running fine until sometime yesterday while I was at work. It had garnered a cumulative credit of over 350K.

Corruption of a file is always a possibility here - especially since when I got home from work yesterday it was clear that the local electric company had once again "showed its affection" for those of us living in the complex by "blipping" our electricity - the timer on the microwave always rats them out when it flashes "reset"

Your idea of a corrupted key does make sense.
ID: 68661 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68662 - Posted: 18 Nov 2010, 2:22:44 UTC
Last modified: 18 Nov 2010, 2:27:53 UTC

Chile Man - they say that everything is bigger up here in Texas - including the RAC - you know, I'm pretty good with Photoshop, you want me to try fixing that Lone Star on your avatar? Something just seems slightly out of place.

Have a great day my friend
ID: 68662 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : At a bit of a loss



©2024 University of Washington
https://www.bakerlab.org