Message boards : Number crunching : All tasks failed : finish file present too long
Previous · 1 · 2
Author | Message |
---|---|
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 797,293 RAC: 0 |
Separate unrelated issue, anyone know what would cause this?: https://boinc.bakerlab.org/rosetta/result.php?resultid=1162348400 Peak working set size 1,821.06 MB Peak swap size 6,092.32 MB Could it be just memory allocation failure? How much RAM/swap this host allowed and how many tasks it does at once? |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
Peak working set size 1,821.06 MB Host in question: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=4022775 It's set to use 90% of available memory when unused (which it was over the weekend when the error happened). It has 32GB installed. At the time of the error it had 8 tasks assigned to run on its 8 cores (this machine has its cache setting to 0). I'm winding down remaining tasks on it at the moment as this machine is only a weekend cruncher. |
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 797,293 RAC: 0 |
It has 32GB installed. At the time of the error it had 8 tasks assigned to run on its 8 cores (this machine has its cache setting to 0). And how SWAP (file or partition, didn't deal with OS X) configured? Can it grow up to ~50GB ? |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
It has 32GB installed. At the time of the error it had 8 tasks assigned to run on its 8 cores (this machine has its cache setting to 0). I have no idea how OSX handles swap, but there is over 250GB free on the (only) drive, so it's not lacking in hard drive space. Also 32GB RAM should be more than enough for 8 tasks with zero cache and no other programs running besides the core OS and BONIC. |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
If you could post that with a copy of the Stderr output of a couple of those Tasks over at the BOINC forums, it would let them know there is still an issue & to check that the fix was actually included in the latest released version. And if so, that it needs further investigation.But the newer versions of BOINC have corrections to wait longer for clean up to complete as tasks end.It's still a problem for large core-count machines. My 24 core Xeon machines have SSD storage and gigabit internet. If all 24 threads finish WU's within a very short time of each other I still get the errors mentioned above even with the latest version of bonic. So last night I started 24 tasks at once when I left work, and sure enough even with my 96GB ram upgrade I still managed to get some of the problems we were discussing when they all mostly finished at once. https://boinc.bakerlab.org/rosetta/result.php?resultid=1162604713 https://boinc.bakerlab.org/rosetta/result.php?resultid=1162626369 https://boinc.bakerlab.org/rosetta/result.php?resultid=1162615133 Host in question: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3752305 As discussed this only happens when a large number of tasks finish simultaneously. The host is running a SSD drive. The new version of BONIC has made this less likely to happen, but it is still a thing. |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
I had it happen again when my second set of 24 8 hour WU's completed. They were a little more spread out this time so only 1 error out of 24. https://boinc.bakerlab.org/rosetta/result.php?resultid=1163051734 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The new version of BONIC has made this less likely to happen, but it is still a thing. To my thinking, the change being tested on Ralph to have all slots share the read-only database will resolve these issues, or at least avoid ever placing 5,000 things into the slot directory that need removing before the task can end. Rosetta Moderator: Mod.Sense |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
The new version of BONIC has made this less likely to happen, but it is still a thing. Well, while not fully loaded (50/50 split ralph/rosetta) I have 12 Ralph test units that are all due to complete at the same time, so that will be a decent mini-test. /edit I suspended some tasks to get them to line up better, and now I'll have 15 tasks end at once. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Perfect! Ralph picked the right machine to send those to. Rosetta Moderator: Mod.Sense |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
Perfect! Ralph picked the right machine to send those to. Checkpointing made it pretty tough to get them lined up, this is the best I could do. The ones at the bottom started later. Hopefully they end soon, I don't want to re-activate my Rosetta threads until the Ralph ones are done, but at the same time, I don't really want to sit here at the office waiting for them to finish just so I can restart the rosetta app. Otherwise this machine will sit idle for most of the night until I come back in tomorrow AM and restart everything I idled for Ralph. |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
All the tasks finished more or less at the same time and didn't throw up any errors. So that's good. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1679 Credit: 17,826,032 RAC: 22,941 |
</stderr_txt> <message> Process still present 5 min after writing finish file; aborting</message> ]]> Odd that this would still occur after 5min. I'd have thought that would be plenty of time for things to have been cleaned up even on a HDD, let alone a SSD. Have you ever benchmarked the SSD on that system? Years back i had disk issues on a Windows system because the motherboard chipset drivers hadn't installed themselves properly, in particular the disk controller driver. I'm not at all familiar with OSX; does it have an equivalent to Windows Device Manager? So you can check to see if the storage controller driver is setup/ running correctly (in the case of a SATA SSD. For a NVe SSD you need the OS or the drive manufacturer's driver to support it fully). Grant Darwin NT |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
Part of why I use OSX is because I don't need to deal with drivers for 90% of the hardware installed. 8-). Yes that can be limiting in some respects, and some hardware does still need them, but basic stuff (networking, SSDs, GPU's) do not need drivers in OSX as it's baked into the OS. GPU drivers are a sore spot due to Nvidia GPU's not working after OSX 10.13. Regardless, the Boinc drive is a 1TB SATA III Samsung 850 EVO SSD with the typical 500MB/sec read/writes that SATA III caps you on. It's 75% empty so space isn't the issue. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1679 Credit: 17,826,032 RAC: 22,941 |
Regardless, the Boinc drive is a 1TB SATA III Samsung 850 EVO SSD with the typical 500MB/sec read/writes that SATA III caps you on. It's 75% empty so space isn't the issue.I'm wondering if the new BOJNC Manager actually got the fix, or the fixed value isn't the 5minutes mentioned there. Even with 2 dozen Tasks all finishing at exactly the same time, i just can't see things taking more than a minute or 2 to sort themselves out (let alone 5) on that drive (the 120GB drive would struggle, but it should still be able to deal with that sort of load in well under 5min). Grant Darwin NT |
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 797,293 RAC: 0 |
Regardless, the Boinc drive is a 1TB SATA III Samsung 850 EVO SSD with the typical 500MB/sec read/writes that SATA III caps you on. It's 75% empty so space isn't the issue.I'm wondering if the new BOJNC Manager actually got the fix, or the fixed value isn't the 5minutes mentioned there. I would try the same on Ralph project currently. 4.17/4.18 already has shared DB usage - would be interesting to see if error can be reproduced there. EDIT: though seems it was done and all was OK: All the tasks finished more or less at the same time and didn't throw up any errors. So that's good. Perhaps just need to wait when 4.18 arrives on main project. |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
I only have a few windows boxes and I can't recall it happening on them with the new version, but they are all lower core count machines. There is one windows Xeon 24core machine that I run, but it's at 33% CPU use for Boinc as it does other things so the mass start/finish doesn't happen there. Again this is a semi-rare occurrence. It would really only pop up when a high core count machine starts a project up fresh and (in my case) grabs 24 units at the same time, all starting at once. Running 24/7 they do slowly drift and after a few 8 hour cycles the work units start to get spread a bit and so the finish times separate and it becomes less of an issue. All my machines have Ralph on them as well, and so far <knock on wood> the mass start/finish issue seems to have gone away on 4.17. |
Message boards :
Number crunching :
All tasks failed : finish file present too long
©2024 University of Washington
https://www.bakerlab.org