qa-float crash

Started by newdep, September 22, 2009, 03:17:55 AM

Previous topic - Next topic

Lutz

#45
The version I uploaded to Ted contains all of Norman's modifications.

TedWalther

#46
Thanks Lutz.  I was just sort of excited by the way I saw the bug being chased down.  I thought it would be a good climax to the detective novel to see the solution viz the diff.  I'll check it out in the 10.1.6 tarball.  Thank you!



Ted
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

newdep

#47
To make a long story even longer..



It might make sence it might not..Im sure I forgot most parts of the

story but here is the ending.. at least the end of the ending..not

the full ending..i mean.. Whats a start of it anyway..





Last week I had to come out and tell everyone on the internet;
"I have a problem, i want NaN's on my newlisp prompt in OS/2"

I use the OS/2 GCC port and the Klibc build and want NaN's.

I had actualy 3 problems, (1) Why is there no default FP ieee
return like NaN and Inf on Gcc for OS/2. (2) why does
newlisp need a double input to display the result of a
FPException. and (3) Why do I need to digg into my
long gone C knowledge to get this working..

First,..Well the simplest way is to point your finger,
I learned very quickly that this finger pointing is a
nice start of a 2 week struggle, actualy i got very
frustrated with the whole gcc port in OS/2 to get this
little irritating thing working.

Im using here a P4 system that run OS/2.. Fast enough
for coding and very nice for speed compare.

So why did newlisp not return the NaN or Inf? Could it
be GCC, could it be the P4? So I had to rule out.
First the P4.. Early bugs from Intel? I tested these.
No problems on my P4. That left me with GCC and the
precompiled Libc. So I dugg into the C code of gcc and
Libc. Well it was not what I hoped to find, speeking
of programming, i have never seen so much exceptions
in a bunch of code as here. Tracking was the only option.

I came across the solution by pure luck, actualy by
expirimenting with FloatingPoints in C and newlisp.
From what I read it seemed the original gcc on
gnu-systems has this ieee NaN Inf return behaviour
default and doesnt use the SIGFPE for this.

As OS/2 isnt a pure gnu system but uses a gcc to
compile gnu applications the results and methods are
actualy in a dark area, its not 'yet' well documented.
Also I could not find anywhere a good description on
what a SIGFPE actualy does in this OS/2 port, so i
assumed it all depends per OS...period

I could not find anywhere in the bug tracking system
of OS/2 gcc a problem with SIGFPE related to
triggering NaN results..So i had to simmulate to proof
concept.

The behaviour of SIGFPE on OS2 gcc is:
- Setting up the signal SIGFPE
- Trigger it
- Restore stack et voila..you application has ieee.

To get SIGFPE return to operational mode inside your
application, the only save way is the longjmp way.
(Thanks Richie, for the only decent C book).

From my tests I could see they all worked out of the
box, but not so inside newlisp.

(2) Why did they work in my C-code and not in newlisp?

Im now having a good idea when the NaN is triggered,
but why doesnt it work in newlisp. Newlisp uses some
very simple but effective stack returns to restore
errors.

I adjusted the newlisp code over and over and when you
look long enough to code you finaly dont see the
solution anymore. Finaly I was on the right track.
a sync problem with the stacks created with jmp_buf.

Newlisp uses an initial stack at startup, SIGFPE which
can occeure anytime and is only triggered when the
Exception happens, does a longjmp to that stack to
restore from the Exception the FP caused and returns
after the longjmp to the first setjmp. (officialy there
is no return from longjmp but i just call it this way)
Now these stacks where not in sync. I did the longjmp
all the time to the default stack.

I was even on Flag and Co-Processor level at one point,
far too deep! And I did not want to go there eighter.

I had to figure out the seqence newlisp has in
errorhandling using the stack created with jmp_buf.
There was at initialization a problem which I already
found very quickly in the beginning but did not
realize I was so close to the solution at that time.

Also I still had to find a way to trigger that SIGFPE.
Which wasnt that easy because newlisp has a tight
error checking. So a wrongly placed fooled sqrt(-1)
SIGFPE trigger inside the newlisp code and newlisp
would quit on me. I choose to fool the SIGFPE, there
was no other working way.

As SIGFPE does a longjmp and is initialized when the
Exception occeurs it will always longjmp to the stack
newlisp started with. Because the problem only happens
at the First time the SIGFPE happens. This stack does
not contain any SIGFPE signals/results/flags, so when
the SIGFPE happend the return to the newlisp console was
empty, which is correct because newlisp did just that,
restore the original stack. So I had to make sure that
the SIGFPE restored the correct stack when the Exception
happend but now including its own handler in the stack.
And that was the solution.

So there are still questions open like: What does the
stack actualy contain at setjmp init. because the
sigsetjmp takes care of the signals and flags. Which
isnt used here. Is the behaviour of the SIGFPE trigger
indeed a default way of catching the Ieee NaN's ?

The most irritating about this SIGFPE I found actualy
while searching the internet. There was not 1 good
answer on the whole internet regarding the SIGFPE.
(Yes not even on the Big names sites eighter..)
And 'no' I did not want to use siginfo or sigaction.
They all just touched the Simple parts of the SIGFPE.

Its like reading a Bad technical book, all the stuff
you actualy want to know is aways in the back of
the apendix or Reference chapter, just too short
to learn from.

So yes.. Simple.. I know.. Im a Wuzzy that cant
program.. a nothing a rooky a lame coder..
..yes folks I know I know.. But I got it running :-)

So now its time for a beer!

-- (define? (Cornflakes))

newdep

#48
Here is a quotation from the EMX OS/2 documentation the GCC/2 version is leaning on..

Except the use of the _control87() that did not do anything in GCC for me.



By default, all floating point exceptions are masked: The coprocessor will perform a
default action (replace the result with a NaN, for instance) and continue without
generating SIGFPE. Use _control87() to enable floating point exceptions. However,
SIGFPE is not reliable
-- (define? (Cornflakes))

newdep

#49
and I even found more insight IBM info...finaly some good documents...Perhpas

this sigfpe stack swapping will be changed... Im rechecking _control87()
-- (define? (Cornflakes))

newdep

#50
Lutz,



GOOD NEWS!  



If got it running (tested it on 10.1.4) but will work on a clean 10.1.6 too.

The SIGFPE handler can be removed, im unable to test it on a different PC btw

at the moment..





newLISP v.10.1.4 on OS/2 IPv4, execute 'newlisp -h' for more info.

(MAIN)-> (sqrt -1)
nan
(MAIN)-> (log 0)
-inf
(MAIN)-> (log -1)
nan
(MAIN)-> (atan -1)
-0.7853981634
(MAIN)-> (mod -1)
-1
(MAIN)-> (% -1 0)

ERR: division by zero in function %
(MAIN)-> (/ 0 0)

ERR: division by zero in function /
(MAIN)->




This is what I added to the newlisp.c (above inside main())

I did test befor this section below, I dont know what I did differently

but now its working..



/* Unmask all floating-point exceptions. */
#ifdef OS2
/* _clear87(); */
 _fpreset();
/* _control87( PC_53, MCW_RC ); */
#endif


initLocale();
initNewlispDir();
IOchannel = stdin;

initialize();
initStacks();
initDefaultInAddr(); /* nl-sock.c */



Now my main thought still stays.. whats the cause of this behaviour..

(the 87? the OS/2 Kernel? the Libc/gcc? its again in the dark ;-)



Anyway it works now overhere without Stack swapping..





PS: perhpas this should be default anyway in newlisp for all machines?

As it might be a hardware issue..
-- (define? (Cornflakes))

Lutz

#51
The latest 10.1.6 is in the usual location for you, perhaps you can work the changes into it then send me a zip of newlisp.c  to:  lutz  nuevatec  com



Just search for all: OS2 in that file, I assume that all the old #ifdefs OS2 referring to SIGFPE can be removed?

newdep

#52
I have put the fixed files link in a PM to you..

Ill email them too ;-)
-- (define? (Cornflakes))