All tasks failed : finish file present too long

Message boards : Number crunching : All tasks failed : finish file present too long

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Raistmer

Send message
Joined: 7 Apr 20
Posts: 49
Credit: 797,293
RAC: 0
Message 95427 - Posted: 27 Apr 2020, 16:00:10 UTC - in response to Message 95421.  

Separate unrelated issue, anyone know what would cause this?: https://boinc.bakerlab.org/rosetta/result.php?resultid=1162348400

Seems to be a one-off. The machine in question is running OS X, with 0 cache, dedicated, 24/7 operation. It's the first time I've seen this error, and looking around on the web for the error code provided a few answers that didn't seem to apply to me.


Peak working set size 1,821.06 MB
Peak swap size 6,092.32 MB

Could it be just memory allocation failure? How much RAM/swap this host allowed and how many tasks it does at once?
ID: 95427 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95432 - Posted: 27 Apr 2020, 17:05:12 UTC - in response to Message 95427.  
Last modified: 27 Apr 2020, 17:05:59 UTC

Peak working set size 1,821.06 MB
Peak swap size 6,092.32 MB

Could it be just memory allocation failure? How much RAM/swap this host allowed and how many tasks it does at once?



Host in question: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=4022775

It's set to use 90% of available memory when unused (which it was over the weekend when the error happened). It has 32GB installed. At the time of the error it had 8 tasks assigned to run on its 8 cores (this machine has its cache setting to 0).

I'm winding down remaining tasks on it at the moment as this machine is only a weekend cruncher.
ID: 95432 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Raistmer

Send message
Joined: 7 Apr 20
Posts: 49
Credit: 797,293
RAC: 0
Message 95434 - Posted: 27 Apr 2020, 17:24:10 UTC - in response to Message 95432.  
Last modified: 27 Apr 2020, 17:24:28 UTC

It has 32GB installed. At the time of the error it had 8 tasks assigned to run on its 8 cores (this machine has its cache setting to 0).

And how SWAP (file or partition, didn't deal with OS X) configured? Can it grow up to ~50GB ?
ID: 95434 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95441 - Posted: 27 Apr 2020, 19:17:33 UTC - in response to Message 95434.  

It has 32GB installed. At the time of the error it had 8 tasks assigned to run on its 8 cores (this machine has its cache setting to 0).

And how SWAP (file or partition, didn't deal with OS X) configured? Can it grow up to ~50GB ?


I have no idea how OSX handles swap, but there is over 250GB free on the (only) drive, so it's not lacking in hard drive space. Also 32GB RAM should be more than enough for 8 tasks with zero cache and no other programs running besides the core OS and BONIC.
ID: 95441 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95476 - Posted: 28 Apr 2020, 13:05:35 UTC - in response to Message 95375.  

But the newer versions of BOINC have corrections to wait longer for clean up to complete as tasks end.
It's still a problem for large core-count machines. My 24 core Xeon machines have SSD storage and gigabit internet. If all 24 threads finish WU's within a very short time of each other I still get the errors mentioned above even with the latest version of bonic.
If you could post that with a copy of the Stderr output of a couple of those Tasks over at the BOINC forums, it would let them know there is still an issue & to check that the fix was actually included in the latest released version. And if so, that it needs further investigation.


So last night I started 24 tasks at once when I left work, and sure enough even with my 96GB ram upgrade I still managed to get some of the problems we were discussing when they all mostly finished at once.

https://boinc.bakerlab.org/rosetta/result.php?resultid=1162604713
https://boinc.bakerlab.org/rosetta/result.php?resultid=1162626369
https://boinc.bakerlab.org/rosetta/result.php?resultid=1162615133

Host in question: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3752305

As discussed this only happens when a large number of tasks finish simultaneously. The host is running a SSD drive. The new version of BONIC has made this less likely to happen, but it is still a thing.
ID: 95476 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95484 - Posted: 28 Apr 2020, 15:40:25 UTC - in response to Message 95476.  

I had it happen again when my second set of 24 8 hour WU's completed. They were a little more spread out this time so only 1 error out of 24.

https://boinc.bakerlab.org/rosetta/result.php?resultid=1163051734
ID: 95484 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 95505 - Posted: 28 Apr 2020, 21:57:53 UTC - in response to Message 95476.  

The new version of BONIC has made this less likely to happen, but it is still a thing.


To my thinking, the change being tested on Ralph to have all slots share the read-only database will resolve these issues, or at least avoid ever placing 5,000 things into the slot directory that need removing before the task can end.
Rosetta Moderator: Mod.Sense
ID: 95505 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95508 - Posted: 28 Apr 2020, 22:32:07 UTC - in response to Message 95505.  
Last modified: 28 Apr 2020, 22:34:36 UTC

The new version of BONIC has made this less likely to happen, but it is still a thing.


To my thinking, the change being tested on Ralph to have all slots share the read-only database will resolve these issues, or at least avoid ever placing 5,000 things into the slot directory that need removing before the task can end.


Well, while not fully loaded (50/50 split ralph/rosetta) I have 12 Ralph test units that are all due to complete at the same time, so that will be a decent mini-test.

/edit I suspended some tasks to get them to line up better, and now I'll have 15 tasks end at once.
ID: 95508 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 95509 - Posted: 28 Apr 2020, 22:35:19 UTC - in response to Message 95508.  

Perfect! Ralph picked the right machine to send those to.
Rosetta Moderator: Mod.Sense
ID: 95509 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95514 - Posted: 28 Apr 2020, 23:51:46 UTC - in response to Message 95509.  

Perfect! Ralph picked the right machine to send those to.


Checkpointing made it pretty tough to get them lined up, this is the best I could do. The ones at the bottom started later. Hopefully they end soon, I don't want to re-activate my Rosetta threads until the Ralph ones are done, but at the same time, I don't really want to sit here at the office waiting for them to finish just so I can restart the rosetta app. Otherwise this machine will sit idle for most of the night until I come back in tomorrow AM and restart everything I idled for Ralph.

ID: 95514 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95516 - Posted: 29 Apr 2020, 0:27:03 UTC - in response to Message 95509.  

All the tasks finished more or less at the same time and didn't throw up any errors. So that's good.
ID: 95516 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,745,395
RAC: 22,930
Message 95531 - Posted: 29 Apr 2020, 6:44:46 UTC

</stderr_txt>
<message>
Process still present 5 min after writing finish file; aborting</message>
]]>

Odd that this would still occur after 5min.
I'd have thought that would be plenty of time for things to have been cleaned up even on a HDD, let alone a SSD.


Have you ever benchmarked the SSD on that system?
Years back i had disk issues on a Windows system because the motherboard chipset drivers hadn't installed themselves properly, in particular the disk controller driver.

I'm not at all familiar with OSX; does it have an equivalent to Windows Device Manager? So you can check to see if the storage controller driver is setup/ running correctly (in the case of a SATA SSD. For a NVe SSD you need the OS or the drive manufacturer's driver to support it fully).
Grant
Darwin NT
ID: 95531 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95552 - Posted: 29 Apr 2020, 16:26:14 UTC - in response to Message 95531.  

Part of why I use OSX is because I don't need to deal with drivers for 90% of the hardware installed. 8-). Yes that can be limiting in some respects, and some hardware does still need them, but basic stuff (networking, SSDs, GPU's) do not need drivers in OSX as it's baked into the OS.

GPU drivers are a sore spot due to Nvidia GPU's not working after OSX 10.13.



Regardless, the Boinc drive is a 1TB SATA III Samsung 850 EVO SSD with the typical 500MB/sec read/writes that SATA III caps you on. It's 75% empty so space isn't the issue.
ID: 95552 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,745,395
RAC: 22,930
Message 95581 - Posted: 30 Apr 2020, 6:40:14 UTC - in response to Message 95552.  

Regardless, the Boinc drive is a 1TB SATA III Samsung 850 EVO SSD with the typical 500MB/sec read/writes that SATA III caps you on. It's 75% empty so space isn't the issue.
I'm wondering if the new BOJNC Manager actually got the fix, or the fixed value isn't the 5minutes mentioned there.
Even with 2 dozen Tasks all finishing at exactly the same time, i just can't see things taking more than a minute or 2 to sort themselves out (let alone 5) on that drive (the 120GB drive would struggle, but it should still be able to deal with that sort of load in well under 5min).
Grant
Darwin NT
ID: 95581 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Raistmer

Send message
Joined: 7 Apr 20
Posts: 49
Credit: 797,293
RAC: 0
Message 95592 - Posted: 30 Apr 2020, 11:02:03 UTC - in response to Message 95581.  
Last modified: 30 Apr 2020, 11:03:49 UTC

Regardless, the Boinc drive is a 1TB SATA III Samsung 850 EVO SSD with the typical 500MB/sec read/writes that SATA III caps you on. It's 75% empty so space isn't the issue.
I'm wondering if the new BOJNC Manager actually got the fix, or the fixed value isn't the 5minutes mentioned there.
Even with 2 dozen Tasks all finishing at exactly the same time, i just can't see things taking more than a minute or 2 to sort themselves out (let alone 5) on that drive (the 120GB drive would struggle, but it should still be able to deal with that sort of load in well under 5min).

I would try the same on Ralph project currently.
4.17/4.18 already has shared DB usage - would be interesting to see if error can be reproduced there.

EDIT: though seems it was done and all was OK:
All the tasks finished more or less at the same time and didn't throw up any errors. So that's good.

Perhaps just need to wait when 4.18 arrives on main project.
ID: 95592 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95603 - Posted: 30 Apr 2020, 13:14:06 UTC
Last modified: 30 Apr 2020, 13:16:29 UTC

I only have a few windows boxes and I can't recall it happening on them with the new version, but they are all lower core count machines. There is one windows Xeon 24core machine that I run, but it's at 33% CPU use for Boinc as it does other things so the mass start/finish doesn't happen there.

Again this is a semi-rare occurrence. It would really only pop up when a high core count machine starts a project up fresh and (in my case) grabs 24 units at the same time, all starting at once. Running 24/7 they do slowly drift and after a few 8 hour cycles the work units start to get spread a bit and so the finish times separate and it becomes less of an issue.

All my machines have Ralph on them as well, and so far <knock on wood> the mass start/finish issue seems to have gone away on 4.17.
ID: 95603 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : All tasks failed : finish file present too long



©2024 University of Washington
https://www.bakerlab.org