Posts by FalconFly

1) Message boards : Number crunching : Rosetta and Android (Message 77518)
Posted 28 Sep 2014 by Profile FalconFly
Post:
Generally I wouldn't discount the Android Client, not for its performance potential per device - but due to the sheer number of devices.

But of course, if Rosetta could be optimized for the latest SIMD or even OpenCL, that would be a massive performance gain...
2) Message boards : Number crunching : Rosetta and Android (Message 77515)
Posted 27 Sep 2014 by Profile FalconFly
Post:
Allright, the conclusion after a day of testing is : the Android Client just doesn't work.

The 2nd workunit apparently completed without issues (nothing bad visible in logfile either) but ended up as a computing error as well - just like the one of my wingman.
3) Message boards : Number crunching : Rosetta and Android (Message 77514)
Posted 27 Sep 2014 by Profile FalconFly
Post:
Got the Android Client running over night - and the results aren't good.

The Application basically starts off good but apparently ceases to utilize CPU within seconds.
After that, the Task sits at what looks like idle and eventually times out after wasting about 150% of the alotted target CPU time.

It basically doesn't process anything, just blocks a CPU core and fails.

I've just restarted a Task that was sitting idle for >9 hours on a 6 hours task - at least right now it seems to actually process data. Let's see for how long...
RAM usage seems very low, my Android Device has plenty of free RAM remaining while processing 2x SIMAP + 1x Rosetta.

With a bit of luck it was only the 1st tasks that failed (maybe some project init problems on the device).
4) Message boards : Number crunching : Rosetta and Android (Message 77513)
Posted 26 Sep 2014 by Profile FalconFly
Post:
Rosetta has released offical app for android (version 3.58)
https://boinc.bakerlab.org/rosetta/apps.php

Has someone tried it?


I'll give it a shot for a few Tasks, will report back tomorrow :)
5) Message boards : Number crunching : Problems with Minirosetta 1.75 (Message 61787)
Posted 16 Jun 2009 by Profile FalconFly
Post:
Hi FalconFly -

I don't see the validation problems you're reporting over on RALPH. Could you join over there for a little while, otherwise I cannot actually see the log files your machine is producing.

M


Alright, I can't get that Host out of its massive Validation problems and have now joined RALPH (Rosetta is being work depleted for now, since it's wasting >50% of computing time).
Host there is this one.

System is on 24/7 and for now working RALPH full-time. As said, never had validation problems on any other project with it, so I'm clueless right now.

PS.
Would it help you guys more if I steered my Network completely from Rosetta to RALPH? You seem to need a hand there, as the Minirosetta application hasn't been quite "friendly" over the last year or so it seems.
6) Message boards : Number crunching : Problems with Minirosetta 1.75 (Message 61773)
Posted 15 Jun 2009 by Profile FalconFly
Post:
Hi FalconFly -

I don't see the validation problems you're reporting over on RALPH. Could you join over there for a little while, otherwise I cannot actually see the log files your machine is producing.

M


I'm presently not attached to RALPH@Home, but if the problems persist after my current attempt for a fix, I'll join there.

The stderr_txt looks completely normal to me, except for ending with the invalid status.

I've exchanged the RAM, set Vcore to Auto again and improved CPU cooling a bit. Maybe that helps. I'll report back on that in about 24hrs, with a bit of luck I smashed that elusive cause of failure (right now betting on defective RAM).
7) Message boards : Number crunching : Problems with Minirosetta 1.75 (Message 61766)
Posted 15 Jun 2009 by Profile FalconFly
Post:
There is nothing on your end to look for or correct here.


Okidok, no Problem :)
I just wasn't sure if that is a MiniRosetta problem or some other quirk.
8) Message boards : Number crunching : Problems with Minirosetta 1.75 (Message 61761)
Posted 15 Jun 2009 by Profile FalconFly
Post:
WorkUnit mentioned is this one here.

Noteworthy mentioning maybe, that this System is seeing some strange Validation problems which I'm still trying to pinpoint.

Worked flawless so far for SETI, SIMAP, POEM and LHC, thus seeing it fail to validate all of a sudden got me off guard.

Hopefully a warning is implemented in all validating projects soon that give an adequate alert when a Host starts to fail validation, as I requested it at Berkeley a year ago.
9) Message boards : Number crunching : Hughe Upload data sizes / Upload Problems ? (Message 58904)
Posted 18 Jan 2009 by Profile FalconFly
Post:
Apparently all solved overnight, all is back to normal here :)
10) Message boards : Number crunching : Hughe Upload data sizes / Upload Problems ? (Message 58890)
Posted 17 Jan 2009 by Profile FalconFly
Post:
That would be unusually large FalconFly. The good news is that BOINC is able to make partial transfers and continue from where it left off if necessary.

Check the transfers tab and see if you've actually got any data moving, or if the upload server is really the current problem.

Both issues are certainly possible. If several others have these large results files, that's going to bogg down the upload server. On the other hand, the Project Team will need to see what they are filled with in order to address the problem.


Alright, as of now I'm seeing a total of 230MB in the Upload Queue.
None of the transfers in my current upload queue (not even the very small ones) was able to transfer more than ~3kB before stalling.

Additionally, my BOINC 5.10.45 installations do not seem to download new WorkUnits at normal rate, I'm hardly receiving any.
The Message Tabs are full of these lines :
Temporarily failed upload of _CAPRI17_T38_2_.sjf_br_docking.protocol__6221_13586_0_0: http error
Backing off 24 min 46 sec on upload of _CAPRI17_T38_2_.sjf_br_docking.protocol__6221_13586_0_0


Additionally, I got a few of those :
Sending scheduler request: To fetch work. Requesting 19271 seconds of work, reporting 0 completed tasks
Scheduler request failed: HTTP gateway timeout



==========================
I have Climate Prediction working parallel, its transfers work normal on all Systems, Internet connection is working fine as well.

--- edit ---

Just now I'm seeing a single large Result being transferred at good speed (2108 UTC time).
According to my Results table, however, my Network actualy has been reporting results throughout the day.
But apparently reporting/downloading any took longer than computing fresh ones, leading to my Network running dry and the upload queue filling up.
11) Message boards : Number crunching : Hughe Upload data sizes / Upload Problems ? (Message 58872)
Posted 17 Jan 2009 by Profile FalconFly
Post:
Hm, maybe I just never noticed but after seeing an Upload Transfer Queue having built up, I had a look at it.

While some Results (jump-neg-****) are only 30-50kb, I also have a series of _CAPRI17_T38*** that clock in between 7500kb and over 10000kB *ugh*

Considering I can't seem to upload at the moment, is that a normal Upload Data size for that type of WorkUnit or is that an anomaly (and possibly the cause fo my stuck Upload Queue) ?

My ADSL Upload is upto 45kb/sec but squeezing a Queue of (currently) ~166MB through it really takes some time, even if it was working normal :P
12) Message boards : Number crunching : No Work Units (Message 58549)
Posted 6 Jan 2009 by Profile FalconFly
Post:
In 5 years of BOINC I have only witnessed four typical reasons for normal running Systems to run "dry" despite good Internet connection and working Project Scheduler server :

1) LTD of work-starved projects exceed 2000000s or more on a Host

After some discussion with Berkeley Devs I assume they still deny this inherent Client Scheduler malfunction, although they've repeatedly implemented dirty workarounds which somewhat slow the build rate of LTD. Looking at his Projects participation, I deem this unlikely though as all usually have work available.

2) Deferred Communication

Mostly affected older 4.x Clients, as Deferred Communication could reach ~7 days within a relatively short time. As he's using BOINC 6.x, I don't think this is an issue either however.

3) Packet Loss/Broken Lines towards Project Server

Usually caused either by Routing errors due to Backbone failure or manual cuts among carriers(latter occured several times in Europe), a shaky Internet connection (pure Packet Loss) or a provider filtering/blocking/slowing Ports required for BOINC communication; basically poor QOS in favor of classic HTTP/FTP/SMTP customer traffic.

"Should" be unlikely but can be checked by running continuous Ping and a TraceRoute check from a console.

As I'm seeing srv4.bakerlab.org (140.142.20.112) here, the Windows console commands would read
ping -t srv4.bakerlab.org
tracert srv4.bakerlab.org

4) Bad luck of the draw

After every outage of a larger or I/O-intensive Project (considering the Filesizes, I condsider Rosetta I/O-intensive) some Users get a comparably quick restart, while others (even with a fleet of Hosts) have to wait longer.
Not sure how to explain this from the technical side, but it's just something I witnessed over and over in several Projects. Call it "Murphy's Law" if you wish, apart from the Update/Reset Project hardcore methods (not desired or recommended) nothing a BOINC user could do about it except let back Projects run and patiently wait.

( 5 ) Optional and considered unlikely during normal operations

Human error ;) or Host failure :(

For first I'd check if BOINC is really set to have Network Connectivity and the Project is set to accept new work (restart the System is sometimes also a suitable fix). In some cases a Scandisk after a few 'hard' shutdowns is also in order. Last resort, a "reset Project" sometimes helped a quirked BOINC setup back to life - after all, it's just Software that sometimes has the "ghost within". If a new Router, separate Firewall or new Network Device Drivers have been installed, it's frequently a good point to look for.

Otherwise, Worms/Viruses can screw up Windows Network settings with the remaining System still apparently working normal. Deep scans with respectively dedicated Tools against Spyware/Adware/Malware, Viruses/Worms/Trojans and Backdoors/Rootkits can reveal a hidden Problem once a while.
13) Message boards : Number crunching : NIS 2009 does not like some work units (Message 57758)
Posted 9 Dec 2008 by Profile FalconFly
Post:
On sidenote, I read an article from CCC about Firewalls a while ago.

Those guys were stunned that Norton Firewall created in excess of 20000 (twenty thousand!) entries in the Windows Registry after installation.

Bloatware stays bloatware IMHO, even if it works thanks to fast CPUs and Gigabytes if RAM.
14) Message boards : Number crunching : Active WorkUnits table stuck ? (Message 57757)
Posted 9 Dec 2008 by Profile FalconFly
Post:
It's been a while and the Active WorkUnits table is still stuck...

Can someone give it a kick to update please?
15) Message boards : Number crunching : can not upload files (Message 57388)
Posted 1 Dec 2008 by Profile FalconFly
Post:
Same here, getting http errors on all Uploads, all transfers die off before transferring more than 2kb.

Downloads are very difficult/near impossible, most of my Systems ran out of Work already.

Maybe the Server is being overrun with Uploads/Downloads after the outage...?
16) Message boards : Number crunching : Minirosetta v1.40 bug thread (Message 57304)
Posted 28 Nov 2008 by Profile FalconFly
Post:
FalconFly, i noticed that you are crunching for LHC@home as well.
It might be that LHC@home is causing your crashes. I've had some crashes too this week. Next time it happens check your boinc.log file, the last message there, before SIGSEGV and the stack trace, is probably: [lhcathome] Scheduler request
A few weeks ago this has also been mentioned by several people in the LHC@home message boards.

AdeB


Darn, it seems you could be right on the spot with that. Nice catch!
I haven't seen any anomalies for >24hrs now, as the most recent batch of LHC WorkUnits have been processed.

Given the somewhat shaky state of LHC@Home, I'd say Rosetta is off the hook concerning my recent problems :)
17) Message boards : Number crunching : Minirosetta v1.40 bug thread (Message 57258)
Posted 26 Nov 2008 by Profile FalconFly
Post:
for the team to know what is going on, please post your affected work units links in your next message.


This is going to be a tedious task, as the WorkUnits (most of them) complete normally after the deadlock is solved.
And after BOINC has crashed, I have no way of telling which WorkUnit may have caused it, since I'm looking at upto 8 WorkUnits per Host which will restart all normal when re-launching BOINC.

For now I'm afraid I'm best off with just solving the deadlocks, had to do that ~8 times today already.

(the only real solution I'd see is to run BOINC in debug mode to get behind it crashing or the MiniRosetta Client failing, which I'm very hesitant to do on 24 active production Systems running 24/7 at full speed - sounds like loads of work :p )

Anyway, for now I haven't seen any such behaviour on my 32bit Win32 Systems so far, only my Linux Systems seem randomly affected.

-- edit --

Oh, forgot :
How does Rosetta react to undervolting of CPUs ?

Most of my Systems run with reduced Vcore tested stable with Prime95, given a small safety buffer and have 100% validation on other Projects (Einstein, MalariaControl, SETI, LHC).

I'm very careful before I blame anything on a Project Client when I'm not running hardware 100% to its specifications.
18) Message boards : Number crunching : Minirosetta v1.40 bug thread (Message 57250)
Posted 26 Nov 2008 by Profile FalconFly
Post:
I'm seeing a significantly above average failures, which result in the shutdown/crash of BOINC (MiniRosetta 1.40).

Happens across all my Linux Systems with no derterminable pattern (64bit BOINC V5.10.45) and naturally results in loss of computing power (need to restart BOINC or the System for ease of purpose)

Otherwise, repeatedly above average numbers of WorkUnits stuck at a certain percentage and its MiniRosetta Task either failed or using 0% CPU power, effectively blocking a CPU core each. Also requires a BOINC restart to get the affected WorkUnits kick into action again.
19) Message boards : Number crunching : Active WorkUnits table stuck ? (Message 57203)
Posted 24 Nov 2008 by Profile FalconFly
Post:
I read on the Active WorkUnits table in my Account Page that these are normally updated daily.

However, mine hasn't updated since 11 Nov ?

Intentional (reduce server load) or does the routine on the server need a small kick ? ;)
20) Message boards : Number crunching : Discussion on increasing the default run time (Message 57012)
Posted 16 Nov 2008 by Profile FalconFly
Post:
I don't mind 6h default runtime, as that's what I'm using right now anyway.

I also wouldn't mind setting it higher, but :
Is it still correct that the Rosetta Client can enter a deadlock and will abort the WorkUnit not before 2x (or even 4x ?) of the scheduled runtime has elapsed ?

At least that's what I remember from reading the Q&A a long time ago.
I don't have any problems getting an occasional Computing Error or stalled WorkUnit but would mind wasting 24h (or even more) of runtime.

If that's all history already and not valid anymore, I'd happily switch to 24h runtime.

Just thought I'd ask, as I'm about to set Rosetta to full throttle in my network.

-- edit --
I'm also seeing h001b_BOINC_ABRELAX_RANGE_yebf failing with Compute Errors (on different Systems including other Hosts of the Quorum)... Losing 2-5h of work is one thing, losing 12-23h would be more disappointing.

Right now (pending any "max time exceeded" related problems), that would by my only concern increasing runtime significantly beyond what I got right now.

(would be cool if correct/complete predictions of a failed WorkUnit before the error occured could be credited and counted - that way a model induced compute error wouldn't really matter anymore regardless of runtime)


Next 20



©2022 University of Washington
https://www.bakerlab.org