Message boards : Number crunching : Problems with Minirosetta v1.54
Previous · 1 · 2 · 3 · 4 · 5 . . . 15 · Next
Author | Message |
---|---|
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,214,047 RAC: 1,450 |
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's. I turned it off and nothing changed, I use the free version of Avast. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's. got any firewalls active? |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,214,047 RAC: 1,450 |
mikey, I don't know why I didn't think of this before... Change #3....I downloaded and installed the latest version of DirectX, no changes noted. Change #4....I installed Boinc 6.6.3, got this message "1/29/2009 8:28:31 AM|rosetta@home|Scheduler request completed: got 0 new tasks". I may have errored out all my available work for the day. No files downloading, so maybe it will take this time? No clue. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,214,047 RAC: 1,450 |
mikey, if you would like to study this further, it would be helpful if you could create a cc_config.xml file and add the flag for debug of file transfers. You have to define the first three flags as shown, then just add a line for the: Okay I have downloaded the file and put it in the BoincData directory. I took out the asterisks and changed the <file-xfer_debug line to a 1, it was a zero. As for the http setting I use Firefox 3.0.5 and do not see that setting. I know it is/was in IE, but I do not see it in Firefox. |
Scott A. Howard* Send message Joined: 16 Oct 05 Posts: 2 Credit: 11,005,855 RAC: 0 |
Hello, Here's the problem in a nutshell. On my Dell Precision T5400 with dual Xeon E5410 2.33 GHz chips (for a total of 8 cores) running on XP Pro SP3, almost every one of the Rosetta jobs (minirosetta version 154) fail. The typical failure mode is that they are exceeding their CPU time allocation. For example, if the job is estimated to require 4 hours of CPU time, they are killed at something like 20 hours. Sometimes the tasks show progress, other times they are stuck at zero. Also, the exe is not removed from memory when the computer is in use. I have reset the project and detached and attached again but it continues to happen. Nothing like this happens with the lhcathome, QMC@HOME, Docking@Home, or boincsimap tasks. I also don't see this behavior on any of my other machines. Do you guys produce any diagnostic logs that might of use in troubleshooting the problem? Maybe it's my configuration - maybe a coding error showing up when running 6 or 8 of these tasks simultaneously. (It appears to occur with any number running, from 1 - 8). I have a full development environment and debuggers if you want some traces. Scott Howard Addendum: Now that I thought about it a little more, does the app use any global resource locking? E.g., mutexes, semaphores, file acess? Maybe that's why the progress is halted, it's deadlocked - but I am not sure why the task would continue to use CPU time though. Just some random thoughts... |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
mikey, we're not talking about the HTTP setting of your browser. We're talking about the http setting used by BOINC. If it were specifically set, it would have a line in that cc_config.xml file. Once you have the file in the directory, abort the transfer. You probably got no work because BOINC knew you already had enough coming. So you probably see a number of tasks in a "downloading" state. The file transfer debug messages will appear in the messages tab. One of the things to note there is which of Rosetta's servers is currently being used to retrieve the file (the host name). I believe this will change from one retry to the next. But if not, you might try blocking outbound traffic to that server with a firewall, and this would then force the client to try the next server in the list. Does each try go for 5 minutes before waiting again? Does any data come down in that period of time? Once you determine which server is being used, could you do a ping and a tracert to that server's host name and report the results? Rosetta Moderator: Mod.Sense |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
Hello, the version 1.47 was very well for me with 151 Workunits and 0 errors and an average CPU time 2.8 hours. Hope that the new version 1.54 will be as well... 1.47 worked rather well for me, with perhaps one out of ten workunits giving an error. Not enough 1.54 workunits yet to say whether 1.54 is better. I'm asking for 14 hour workunits, so it will take me longer to run that many. |
Scott A. Howard* Send message Joined: 16 Oct 05 Posts: 2 Credit: 11,005,855 RAC: 0 |
Here's a follow up. I did the following: 1) detached from the project. 2) removed the Rosetta project folder from under Bonic... 3) removed all files from a slot that contained Rosetta data 4) reattached to the project 5) allowed for 50% of the cpus to be used (4 in this case) 6) allowed the four projects to run - each expected to take about 4 hours Observed results: The status for the projects are "Running, high priority", each has used about 20 minutes of cpu time, the progress is 0.000% Setting the activity back to "run based on preferences" results in each task no longer using cpu time but they are not removed from memory. It looks like that's all I can do. If there are no suggestions from your end, I'll need to stay detached from the project so I don't waste cycles. I see the thread that's consuming the CPU has a pretty regular call stack. Here is the call stack. If you have your debug symbols for your build, you should be able to locate the routine and line at which the program is hung... ntkrnlpa.exe!KiSwapContext+0x2f ntkrnlpa.exe!KiSwapThread+0x8a ntkrnlpa.exe!KeWaitForSingleObject+0x1c2 ntkrnlpa.exe!KiSuspendThread+0x18 ntkrnlpa.exe!KiDeliverApc+0x124 hal.dll!HalpApcInterrupt+0xc6 minirosetta_1.54_windows_intelx86.exe+0x91a63 <------ look for problem here minirosetta_1.54_windows_intelx86.exe+0x17d3 minirosetta_1.54_windows_intelx86.exe+0x1afcd minirosetta_1.54_windows_intelx86.exe+0x9289e minirosetta_1.54_windows_intelx86.exe+0x4a4bc3 minirosetta_1.54_windows_intelx86.exe+0xb0892 minirosetta_1.54_windows_intelx86.exe+0x3e0c24 |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,214,047 RAC: 1,450 |
mikey, we're not talking about the HTTP setting of your browser. We're talking about the http setting used by BOINC. If it were specifically set, it would have a line in that cc_config.xml file. I changed the dual core settings to use both cores, this is a laptop and I do not like stressing it that much, and set the other project to no new work. I updated Rosetta and it proceeded to download new work. The same file stopped at the same place, 89.25%. I aborted it, after all other files were done downl0ading, and no new entries showed up in the cc_config.xml file. I was browsing thru the stdout.txt file and found this: 9:21:33 AM: Error: can't open file 'C:Boinc\RebootPending.txt' (error 2: the system cannot find the file specified.) [01/27/09 09:21:34] TRACE [2064]: RPC_CLIENT::init connect 2: Winsock error '10061' [01/27/09 09:21:34] TRACE [2064]: RPC_CLIENT::init connect on 444 returned -1 [01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init boinc_socket returned 444 [01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init connect returned -1 [01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init attempting connect [01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init_poll sock = 444 It is in there many, many times. I do not see what server I am downloading from, and only use the Windows firewall, so unless I could block thru the Hosts files, I do not know how to block that particular server anyway. Yes each retry deferral is about 4 minutes. I did find one more thing in that stdoutgiu.txt file: [01/29/09 11:10:31] TRACE [3932]: RPC_CLIENT::init connect 2: Winsock error '10061' [01/29/09 11:10:31] TRACE [3932]: RPC_CLIENT::init connect on 524 returned -1 It is also in there many, many times. I did a search and found where it said to change the attributes for the Boinc directory and all subdirectories. It was set to read only and when I unchecked that and changed it also for all subdirectories, Boinc will not run. It also auto defaults back to read only after it errors out. DO NOT DO THIS LAST PART It crashed my whole Boinc setup and I had to delete the Boinc directory, and all subdirectories, then reboot and then reinstall Boinc from scratch. FORTUNATELY it did a repair install instead of a brand new install from scratch! I lost all workunits from all projects though!!!! I attached to Rosetta and guess what? The EXACT SAME FILE is stuck at the EXACT SAME PLACE!!! A TON of files are downloading besides just that one, but that one is stuck all over AGAIN!!! |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,214,047 RAC: 1,450 |
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's. No I use the Windows one, I have Windows XP Media Center on this laptop. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
mikey, I don't know why I didn't think of this before... What antivirus program do you have, and what version? Some antivirus programs don't fully turn off when you try to turn them off; they stop reporting that they have found a virus, but don't stop looking for a virus. I'm also running Ad-Aware, but without this problem, so this antispyware program is less likely to be causing the problem. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's. I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using? |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,214,047 RAC: 1,450 |
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's. I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!! I am also using the 4.8 Home version of Avast. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Scott Howard: Setting the activity back to "run based on preferences" results in each task no longer using cpu time but they are not removed from memory. There are many many BOINC settings possible and you've not described any of yours. When you set BOINC to run based on preference, you are telling it to only use CPU on the days and during the hours you've configured. If you've configured it to not be running at the current time or day of the week, it will suspend the currently active tasks. Any time a task is suspended, it will not make any progress. And there is a memory setting for whether or not tasks should remain "in memory" (virtual memory) while suspended. Doing so preserves the work done since the last checkpoint taken by the task. ...so major portions of what you are reporting may be exactly what you have configured BOINC to do. You have 4 hosts, three are Windows XP and one is Win Vista. Which one is having problems? Is it this one? There are many failed tasks there with access violations. Are you overclocking this machine? Other then more CPUs and different CPU type, what is different about this machine then your others that having been running fine? Rosetta Moderator: Mod.Sense |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's. I also use the Windows firewall, but the Vista SP1 version. Laptops sometimes have problems with overheating when running BOINC workunits but set to use 100% of the CPU time, and I think I've read that minirosetta is likely to have problems when set to run at less than 100% of the CPU time. What is your setting of what fraction of the CPU time to use? I've installed the SpeedFan program on my machine program to check for overheating, but don't have the file needed to show results with proper labels for my motherboard yet. The highest temperature it shows is 109F, though. http://www.almico.com/speedfan.php |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
To Scott_A_Howard, I notice that your 8 core machine only has 3GB. That's a bit small for 8 rosetta tasks. In your BOINC preferences what percent of memory are you allowing the machine to use when the machine is/isn't in use? You might try setting both to 100% on that machine and see if it makes any difference. |
rembertw Send message Joined: 21 Apr 07 Posts: 14 Credit: 628,529 RAC: 0 |
Scott Howard: I seem to have the same problem. No special settings in Rosetta preferences, all kind of computers under XP, and tasks running 100+ hours with 0% progress. |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
resultid=224470749 Reason: Access Violation (0xc0000005) at address 0x00467846 read attempt to address 0x11B524C4 This task was running fine but after I suspended it, rebooted my system, and restarted the task it terminated almost immediately with access violation. Maybe restarts don't work very well or something is flakey with my hard drive or system. Having some troubles with access violations on Einstein tasks as well. But I've run memtest86 and prime95 and CHKDSK and none of them indicate any local computer problems. I'm just shaking my head in disgust. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's. I've recently run Poem workunits on one core of my HP Compaq Presario PC, model SR5125CL, and minirosetta 1.54 workunits at the same time on the other core, without problems. I previously ran Malaria workunits on one core and earlier minirosetta workunits on the other without problems, back when I could still get Malaria workunits. My machine also has 2 GB. I haven't tried ABC workunits, so you might want to try running yours a while with no ABC workunits active. I also run WCG workunits (all active projects there except beta test), with Ralph workunits and Boincsimap workunits when I can get them. I used to run Cels workunits, back when that project was active. My CPU is an AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ 1.90 Ghz; what's yours? Also, You may want to give your ISP the instructions for downloading the problem file with FTP, and ask them to test whether their antivirus software considers it to have a problem. |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
This task was aborted after my preferred runtime + 4 hours. It was working on the 3th model. stderr out: ... AdeB |
Message boards :
Number crunching :
Problems with Minirosetta v1.54
©2024 University of Washington
https://www.bakerlab.org