newLISP Fan Club

Forum => newLISP and the O.S. => Topic started by: newdep on September 22, 2009, 03:17:55 AM

Title: qa-float crash
Post by: newdep on September 22, 2009, 03:17:55 AM
Hi Lutz,



on OS/2..10.1.5, Im getting a crash in return from this code inside qa-float ->


(set 'result '())
(set 'u 1.0)
;(while (> u 0.0) (set 'u (mul u 0.5)) (push u result))
 (while (> u 0.0) (set 'u (mul u 0.5)) (println u))


Not sure if it should or not crash, I dont have any other OS at the moment to compare it with..


..
..
..
1.822780505e-304
9.113902524e-305
4.556951262e-305
2.278475631e-305
1.139237816e-305
5.696189078e-306
2.848094539e-306
1.424047269e-306
7.120236347e-307
3.560118174e-307
1.780059087e-307
8.900295434e-308
4.450147717e-308
2.225073859e-308

Killed by SIGFPE
pid=0x0097 ppid=0x0040 tid=0x0001 slot=0x006e pri=0x0200 mc=0x0001
E:PROGNLNEWLISP-10.1.5NEWLISP.EXE
NEWLISP 0:00011eb9
cs:eip=005b:00021eb9      ss:esp=0053:0017faf0      ebp=0017fb28
 ds=0053      es=0053      fs=150b      gs=0000     efl=00002297
eax=005503a0 ebx=005503a0 ecx=00550a40 edx=3fe00000 edi=0017fb08 esi=00000003
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

Title:
Post by: Lutz on September 22, 2009, 04:20:37 AM
This is a rare underflow condition handled by OS/2 with an exception. You could setup a  signal handler for SIGFPE in function setupAllSignals() around line 310 in newlisp.c, or you could set it up in newLISP itself in the startup code. All other OSs handle subnormals by returning the smallest possible FP value.



This has always been in OS/2 but with qa-float we are forcing it to show up for the first time.
Title:
Post by: newdep on September 22, 2009, 09:11:49 AM
Aha!





This is what I get now..after the sigfpe fix..



A far better error report ;-)


[E:prognlnewlisp-10.1.5].newlisp qa-float
SYS1808:
The process has stopped.  The software diagnostic
code (exception code) is  0097.




Here is the simple addon for the newlisp.c code at line 310 ->




#ifndef WIN_32

#if defined(SOLARIS) || defined(TRU64) || defined(AIX)
setupSignalHandler(SIGALRM, sigalrm_handler);
setupSignalHandler(SIGVTALRM, sigalrm_handler);
setupSignalHandler(SIGPROF, sigalrm_handler);
setupSignalHandler(SIGPIPE, sigpipe_handler);
setupSignalHandler(SIGCHLD, sigchld_handler);
#else
setupSignalHandler(SIGALRM, signal_handler);
setupSignalHandler(SIGVTALRM, signal_handler);
setupSignalHandler(SIGPROF, signal_handler);
setupSignalHandler(SIGPIPE, signal_handler);
setupSignalHandler(SIGCHLD, signal_handler);
#ifdef OS2
setupSignalHandler(SIGFPE, signal_handler);
#endif
#endif

#endif
}




Btw.. Ill put that in for the (import ...) too, thatone crashes far too often here ;-)
Title:
Post by: newdep on September 22, 2009, 10:46:42 AM
mmm actualy something is fishy with the NaN's in OS/2 code..

even a simple (sqrt -1) takes ages and then cracks..



Ill have a closed look inside the code on NaN's I did not check that actualy..



im nor sure if its the Pentium im working on or the compiler or the code ;-)
Title:
Post by: Lutz on September 22, 2009, 12:30:27 PM
I assume it is the FP library in OS/2, but check Windows or Linux on this machine, if you can. Both should be fine on qa-float, unless its one of those (very old) Pentiums with problems in the FP processing units.
Title:
Post by: newdep on September 22, 2009, 02:26:19 PM
Now this is getting intresting.. Its good to read that a bunch of hardware coders and

compiler writers dont even care about precision ;-) But thats a different story..



Seeking the internet for a Faulty P4 I did indeed ran into story's from back in 1995,

this P4 is from around 1998 so I expect it to have a bug anyway because it one from

a lowcost Compaq mainstream where the Sticker "Intel INside" is bigger then the

machine ittself..



Looking at the GCC compiler optimalizations I added the -march=pentium4 together

with the -O2 this indeed does help a bit but not yet fully to cover the DBL_MIN value

of (DBL_MIN  2.2250738585072014e-308) which is bothering me..



So this is what i did on the command line, a max and a min.. The only difference

where it clashes is the -308...not the max..




>
(mul 2e308 2)
inf
> (mul 2e+308 2)
inf
> (div 2e+308 2)
inf
> (div 2e308 2)
inf
> (div 2e-308 2)
SYS1808:
The process has stopped.  The software diagnostic
code (exception code) is  009A.




Im seeking deeper...



PS: After a small C program compiled with GCC laso that clashed.. So its now time

to digg into this gcc port..



PPS: And why is there a difference of 6 all the time between the result and the

original??


> (div (mul 4195835e128 3145727e128 ) 3145727e128 )
4.195835e+134

= 4195835e128 equal to 4.195835e+134 ? 6 zero's more ? Is this a rounding error?

> (div (mul 4195835e134 3145727e134 ) 3145727e134 )
4.195835e+140

..

> (div (mul 4195835e140 3145727e140) 3145727e140 )
4.195835e+146

> (div (mul 4195835e146 3145727e146) 3145727e146 )
4.195835e+152

> (div (mul 4195835e152 3145727e152) 3145727e152 )


SYS1808:
The process has stopped.  The software diagnostic
code (exception code) is  0098.




PPPS: I did some tests for the FDIV P4 Bug but thats not My P4.. So lets assume

Compaq did sell a working "Intel Inside" for a second...
Title:
Post by: newdep on September 23, 2009, 04:00:33 AM
aa yes oke the 6 zero is obvious..

(not when your staring at it already a few hours ;-)



I have now 2 possibillity's.. Or its the gcc that has the issue or the Klibc im working against...



...still digging...
Title:
Post by: newdep on September 24, 2009, 05:58:09 AM
I digged into the gcc (OS2 port) and cant make any decent bread from it.. Its a spagetti of #defines..



Anyway.. from a simple test in plain C a (sqrt -1) and a (div 0 0) do trigger the SIGFPE.

Im unable to get those to return NaN..



The (div 1 0) returns inf

The (div 0 0) crashes (or with the adjustment traps and then exit's)



Where in the newlisp code do I exacly need to make the adjustment to make

it return "nan" ?  I tried several places.. no luck without a trap..



It would be fine for me to get a return of "nan" by trap, but i would like to keep newlisp

running from that point on... not sure if thats possible..



Norman
Title:
Post by: Lutz on September 24, 2009, 06:41:30 AM
Division by zero is caught by newLISP in file nl-math.c. For integers it is the arithmetikOp() function and for floats the floaOp() function.
Title:
Post by: newdep on September 24, 2009, 08:20:07 AM
oke its not that simple and it seems that the NaN is simply not defined correctly in the GCC of OS2.



I use for now the generic error return of MATH_ERR,

which is an official error from newlisp and does not alter any code

and MODULO still uses it too ->





> (div 0 0)



ERR: division by zero in function div





Now i need still to fix



(div 0)



(sqrt -1)



and some rest..
Title:
Post by: Lutz on September 24, 2009, 08:40:07 AM
I wonder what the following C program produces in OS2:


#include <stdio.h>

#ifdef __BIG_ENDIAN__
#define __nan_bytes     { 0x7f, 0xf8, 0, 0, 0, 0, 0, 0 }
#endif

#ifdef __LITTLE_ENDIAN__
#define __nan_bytes     { 0, 0, 0, 0, 0, 0, 0xf8, 0x7f }
#endif

int main(int argc, char * argv[])
{
double dFloat;
char bytes[8] = __nan_bytes;

dFloat = *(double *)bytes;

printf("NaN = %lfn", dFloat);
}


produces:


NaN = nan

on Mac OS X



The bit pattern used is derived from this:


> (unpack "bbbbbbbb" (pack "<lf" (sqrt -1)))
(0 0 0 0 0 0 248 255)
> (unpack "bbbbbbbb" (pack ">lf" (sqrt -1)))
(255 248 0 0 0 0 0 0)
>
Title:
Post by: newdep on September 24, 2009, 09:09:01 AM
;-) Thats what I thought about too last night ..

Good you bring that up actualy.. Ill try that tonight...



actualy the raw (sqrt -1) in C crashes too so anything with a (sqrt -1) in it  i cant test...

But ill test that Indian code..
Title:
Post by: newdep on September 24, 2009, 09:19:52 AM
I did adust __xxx_ENDIAN__ to __xxx_ENDIAN

this is what it returns..


#ifdef __BIG_ENDIAN
#define __nan_bytes     { 0x7f, 0xf8, 0, 0, 0, 0, 0, 0 }
#endif

#ifdef __LITTLE_ENDIAN
#define __nan_bytes     { 0, 0, 0, 0, 0, 0, 0xf8, 0x7f }
#endif

int main(int argc, char * argv[])
{
double dFloat;
char bytes[8] = __nan_bytes;

dFloat = *(double *)bytes;

printf("NaN = %lfn", dFloat);
}
[E:PROGNLnewlisp-10.1.5]nan
NaN = nan




Using BIG_ENDIAN it return -> nan

using LITTLE_ENDIAN it return -> NaN = 0.000000



Now because you previously mentioned the IEEE 754 it got me thinking...
Title:
Post by: newdep on September 24, 2009, 01:51:28 PM
resume "totally utterly flabbergasted".. I cant find it..
Title:
Post by: newdep on September 26, 2009, 09:05:06 AM
let see if I can adjust nl-math.c with

//http://www.gnu.org/s/libc/manual/html_node/Infinity-and-NaN.html

and then explicitly test with isfinite() befor returning..
Title:
Post by: newdep on September 26, 2009, 12:56:53 PM
aaa Fixed it !



Lutz i sent you a PM on this... Ill post the solution inhere when you checked it..







Look Mam... no hands!





> (/ 0 0)



ERR: division by zero in function /

> (div 0)



ERR: division by zero in function div

> (div 0 0)

nan

> (log 0)

-inf

> (sqrt -1)

nan

> (div 0)

inf

>
Title:
Post by: Lutz on September 26, 2009, 01:24:22 PM
yes, seems to be solved. This will avoid the compiler warning:


#ifdef OS2
    case SIGFPE:
        errorProc(ERR_MATH);
        break;
#endif


so it runs qa-float (the one in 10.1.6 checking signed inf) well?



I will make a either a development release for 10.1.6, or perhaps wait until the next Release update and post just the affected files in the development directory.
Title:
Post by: newdep on September 26, 2009, 01:39:14 PM
It seems that the very first time a NaN or Inf orrceur it returns

the "ERR: division by zero"  message.. The next time you run the same

function again it returns the NaN or Inf..



So somehwere still the ERR: is in the way...

The qa-float now stops at ERR:


* fresh startup *

> (sqrt -1)

ERR: division by zero in function sqrt

> (sqrt -1)
nan



* fresh startup *

> (div 0)

ERR: division by zero in function div
> (div 0)
inf
>

Title:
Post by: Lutz on September 26, 2009, 02:17:26 PM
you have this in line 343 in function setupAllSignals() ?


#ifdef OS2
setupSignalHandler(SIGFPE, signal_handler);
#endif
Title:
Post by: Lutz on September 26, 2009, 02:19:52 PM
... perhaps you just take "errorProc(...)" out and let it catch it doing nothing:


#ifdef OS2
    case SIGFPE:
        break;
#endif


in line 412
Title:
Post by: newdep on September 26, 2009, 02:31:11 PM
No that doesnt work, I read somewhere that actualy catching the SIGFPE you need

a longjmp or create a function... I think thats now happening with the errorProc action..



Defining directly the PrintErrorMessage(...) only doesnt work..leaving it empty with a

break causes the real SIGFPE again.. So i need to re-route the Signal and clear

the ERR befor its displaying the NaN...





* added *



what does the return(nilCell); do in the errorProcAll ? I think i need that in

the SIGFPE..  Or a Signal Reset ?
Title:
Post by: newdep on September 26, 2009, 02:56:10 PM
Just wanted to see what happend with the signals actualy,



The first time when newlisp starts and seeing a division by zero it reports the ERR:

And I get a trap Number 8 (which is SIGFPE).. The second time NO signal! but directly

the "inf".. Is this just dumb Luck? Or is there realy something in between? The secondtime its not from the SIGFPE else I would have seems the Singal message again...Mmmm





newLisp v 10.1.6 ........



> (div 0)

Signal = 8



ERR: division by zero in function div



> (div 0)

inf

>
Title:
Post by: newdep on September 26, 2009, 03:13:54 PM
Ill try a different signal handler tomorrow..

GNU writes this about SIGFPE..




Quote— Macro: int SIGFPE



    The SIGFPE signal reports a fatal arithmetic error. Although the name is derived from "floating-point exception", this signal actually covers all arithmetic errors, including division by zero and overflow. If a program stores integer data in a location which is then used in a floating-point operation, this often causes an "invalid operation" exception, because the processor cannot recognize the data as a floating-point number. Actual floating-point exceptions are a complicated subject because there are many types of exceptions with subtly different meanings, and the SIGFPE signal doesn't distinguish between them. The IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985 and ANSI/IEEE Std 854-1987) defines various floating-point exceptions and requires conforming computer systems to report their occurrences. However, this standard does not specify how the exceptions are reported, or what kinds of handling and control the operating system can offer to the programmer.



BSD systems provide the SIGFPE handler with an extra argument that distinguishes various causes of the exception. In order to access this argument, you must define the handler to accept two arguments, which means you must cast it to a one-argument function type in order to establish the handler. The GNU library does provide this extra argument, but the value is meaningful only on operating systems that provide the information (BSD systems and GNU systems).



FPE_INTOVF_TRAP

    Integer overflow (impossible in a C program unless you enable overflow trapping in a hardware-specific fashion).

FPE_INTDIV_TRAP

    Integer division by zero.

FPE_SUBRNG_TRAP

    Subscript-range (something that C programs never check for).

FPE_FLTOVF_TRAP

    Floating overflow trap.

FPE_FLTDIV_TRAP

    Floating/decimal division by zero.

FPE_FLTUND_TRAP

    Floating underflow trap. (Trapping on floating underflow is not normally enabled.)

FPE_DECOVF_TRAP

    Decimal overflow trap. (Only a few machines have decimal arithmetic and C never uses it.)
Title:
Post by: newdep on September 27, 2009, 12:25:57 PM
Hi Lutz,



This works out of the box on my OS2 machine, no strange things at all.



In newlisp I still get the very first time the SIGFPE occeurs the ERR:... and then

after the second time the nan or inf..



Perhpas you know where the "ERR:" mixup could be in newlisp?

Because I cant find it...;-)






#include <stdio>
#include <float>
#include <signal>
#include <math>
#include <setjmp>

/* testing NaN and Inf return */


/* store stack */
jmp_buf errorJump;
int errorReg = 0;

void signal_handler(int sig)
{
switch(sig)
{
case SIGFPE:
/* signal(SIGFPE,SIG_DFL); */
printf("%s", "SIGFPE!n");
longjmp(errorJump,errorReg);
break;
default: return;
}

}

int main ()
{

/* save stack */
setjmp(errorJump);

/* nan-inf go through sigfpe */
signal(SIGFPE, signal_handler);

double nfloat;

nfloat = (sqrt (-1));
printf("sqrt=%fn", nfloat );

nfloat = (log (0));
printf("log=%fn", nfloat );

nfloat /= 0;
printf("div=%fn",  nfloat );

}




[E:PROGNLnewlisp-10.1.6]f
SIGFPE!
sqrt=nan
log=-inf
div=-inf






Here i removed the errorProc function and only used the longjmp.

That returns the first time nothing and then the nan. Thats also

not like my example, seems there is still some code messing around

in the results inside newlisp? But i cant fint it..
> (sqrt -1)
> (sqrt -1)
nan
>
Title:
Post by: newdep on September 28, 2009, 02:07:21 AM
I found the double entry that causes the problem..



Its inside the errorreg check...

Mmm actualy its a setjmp longjmp issue

where the int var is 0 or 1..because of the amount

of jmp's used now..




if((errorReg = setjmp(errorJump)) != 0)
    {
    printf("ErrorReg2=%dn", errorReg);

    if(errorReg && (errorEvent != nilSymbol) )
        executeSymbol(errorEvent, NULL, NULL);
    else  exit(-1);

    goto AFTER_ERROR_ENTRY;
    }


I first though there might be a difference in the gcc or OS2 regarding the setjmp

behaviour so i tested with this -> //http://www-personal.umich.edu/~williams/archive/computation/setjmp-fpmode.html

But thats identical on both my Linux and OS2..



a closer look returns this flow during newlisp ->

first setjmp = 0 (=errorReg) in the funtion above, then the (div 0) appears.

The errorReg in the signal_handler of SIGFPE sees the errorReg = 0 (initial)

Because there is a longjmp the next setjmp gets a 1 (from the longjmp).

so the errorReg checkup does  "goto AFTER_ERROR_ENTRY" with a new

setjmp but thatone returns ofcourse 1 (due to the last longjmp)..

At this point the signal_handler & the jmp_buf are both 1 at the stack is the same.





Oke.. im looking inside the code now for a fix because these "saved stacks" need to be in sync ;-)
Title:
Post by: newdep on September 28, 2009, 04:47:09 AM
I could cheet by putting a SIGNAL trigger like sqrt(-1); inside the C code.

But thats not what I would like to see, also not sure if the stacks are

in sync...
Title:
Post by: newdep on September 28, 2009, 05:11:58 AM
I dont see any workable way currently without cheeting on the SIGFPE

.. perhpas you have an extra clue ?



This is how the SIGFPE adjustment to 10.1.6 now looks ->





this is inside setupallsignals ->
#ifdef OS2
setupSignalHandler(SIGFPE, signal_handler);
/* force a SIGFPE trigger when newlisp starts */
/* this is to activate the NaN Inf returns!   */
(sqrt (-1));
/**********************************************/


this is inside the signal_handler
#ifdef OS2
     /* SIGFPE must be forced for a NaN Inf */
/* the longjmp returns 1 to setjmp when set */
case SIGFPE:
longjmp(errorJump,errorReg);
break;
#endif




the output of qa-float is this ->
operation on NaN result in NaN                
-----------------------------------------------
                     (NaN? (mul 1 aNan)) => true
                     (NaN? (div 1 aNan)) => true
                     (NaN? (add 1 aNan)) => true
                     (NaN? (sub 1 aNan)) => true
                       (NaN? (sin aNan)) => true
                       (NaN? (cos aNan)) => true
                       (NaN? (tan aNan)) => true
                      (NaN? (atan aNan)) => true

comparison with NaN is always nil              
-----------------------------------------------
                        (not (<1> true
                        (not (> 1 aNan)) => true
                       (not (>= 1 aNan)) => true
                       (not (<1> true
                     (not (= aNan aNan)) => true

NaN is not equal to itself                    
-----------------------------------------------
                     (not (= aNan aNan)) => true

integer operations assume NaN as 0            
-----------------------------------------------
                        (= (- 1 aNan) 1) => true
                        (= (+ 1 aNan) 1) => true
                        (= (* 1 aNan) 0) => true
         (not (catch (/ 1 aNan) 'error)) => true
                         (= (>> aNan) 0) => true
                         (= (<<aNan> true

integer operations assume inf as max-int      
-----------------------------------------------
      (= (* 1 aInf) 9223372036854775807) => true
      (= (- aInf 1) 9223372036854775806) => true
     (= (+ aInf 1) -9223372036854775808) => true

FP division by inf results in 0                
-----------------------------------------------
                        (= (/ 1 aInf) 0) => true
                      (= (div 1 aInf) 0) => true

inf specials                                  
-----------------------------------------------
                           (= aInf aInf) => true
                  (NaN? (sub aInf aInf)) => true

retain sign of -0.0                            
-----------------------------------------------
        (= (set 'tiny (div -1 aInf)) -0) => true
                      (= (sqrt tiny) -0) => true

inf is signed too                              
-----------------------------------------------
                  (= aNegInf (div -1 0)) => true
                  (!= aNegInf (div 1 0)) => true

mod with 0 divisor is NaN                      
-----------------------------------------------
                       (NaN? (mod 10 0)) => true

% with 0 divisor throws error                  
-----------------------------------------------
           (not (catch (% 10 0) 'error)) => true

support of subnormals: (0 4.940656458e-324) => (0 4.940656458e-324)
machine epsilon: 1.110223025e-16 => 1.110223025e-16

Title:
Post by: newdep on September 28, 2009, 05:58:06 AM
forgot the extra setjmp, this too i added..the extra errorReg = setjmp(errorJump); call.






if((errorReg = setjmp(errorJump)) != 0)
    {
    if(errorReg && (errorEvent != nilSymbol) )
        executeSymbol(errorEvent, NULL, NULL);
    else exit(-1);
    goto AFTER_ERROR_ENTRY;
    }

errorReg = setjmp(errorJump);
setupAllSignals();

Title:
Post by: Lutz on September 28, 2009, 06:48:34 AM
Quote#ifdef OS2
     /* SIGFPE must be forced for a NaN Inf */
   /* the longjmp returns 1 to setjmp when set */
   case SIGFPE:
      longjmp(errorJump,errorReg);
      break;
#endif


The setjmp() will return only 1 if errorReg was 1, but on program start and after reset it is set to 0, and I think 0 is, what setjmp() when doing the longjmp(). If it would make setjmp() return a 1, then we would see "Not enough memory" reported as error, which is defined as 1.



Can you try this?


#ifdef OS2
   case SIGFPE:
      longjmp(errorJump,0);
      break;
#endif


I believe it also will work.
Title:
Post by: newdep on September 28, 2009, 07:16:18 AM
No that results in the same "double" effect..

Also when moving it to 1 its of no use, there is always a mismatch

in the jmp_buf content.



How about sigsetjmp and siglongjmp and sigset_buf ?



This is how I see the flow in newlisp now with the sigfpe involved,

correct me here if im wrong ;-) only helps finding the itch...


main()
   |
errorReg = 0
setjmp(errorJump)
   |
setupAllsignals init (NOT SIGFPE, because its only triggered on exception)
   |
(sqrt -1)  (on the newlisp console)
   |
SIGFPE trigger with errorReg = 0 (from the first fresh init)
longJump(errorJump,errorReg)  (initial stack with errorReg = 0)
   |
errorReg = 1 (is always 1 when returns from LongJump!)
setjmp(erroJump) != 0
(no return on console (sqrt -1) because errorReg is now 1 which is a NEW stack)
   |
errorReg = setjmp(errorJump)  (is now 1 because of longjmp)
setupAllSignals (no trigger for SIGFPE)
   |
(sqrt -1)
   |
SIGFPE trigger with errorReg = 1 (new errorReg value from previous setjmp)
  |
(return "nan") (because the jmp_buf stack is now in sync)

Title:
Post by: newdep on September 28, 2009, 07:46:05 AM
this is what i mean, except in newlisp It stays on 1..it seems...

Is there extra memory management done somewhere?







#include <stdio>
#include <float>
#include <signal>
#include <math>
#include <setjmp>

/* testing NaN and Inf return */


/* store stack */
jmp_buf errorJump;

int errorReg = 0;

void signal_handler(int sig)
{
/* init */
signal(SIGFPE, signal_handler);

switch(sig)
{
case SIGFPE:
/* signal(SIGFPE,SIG_DFL); */
printf("%s", "SIGFPE!n");
longjmp(errorJump,errorReg);
break;
default: return;
}

}

int main ()
{
double nfloat;

printf("errorReg=%dn", errorReg);

errorReg = setjmp(errorJump);
signal(SIGFPE, signal_handler);
printf("errorReg=%dn", errorReg);

nfloat /= 0;
printf("div=%fn",  nfloat );

errorReg = setjmp(errorJump);
printf("errorReg=%dn", errorReg);

nfloat = (sqrt (-1));
printf("sqrt=%fn", nfloat );

errorReg = setjmp(errorJump);
printf("errorReg=%dn", errorReg);

nfloat = (log (0));
printf("log=%fn", nfloat );

}





outputs ->


[E:PROGNLnewlisp-10.1.6]f
errorReg=0
errorReg=0
SIGFPE!
errorReg=1
div=inf
errorReg=0
sqrt=nan
errorReg=0
log=-inf
Title:
Post by: Lutz on September 28, 2009, 07:50:39 AM
this all doesn't make sense to me ;-)



The following



- longjmp() doesn't return anything is void longjmp()

- setjmp() returns whatever the second arg of longjmp() was

- errorReg is always set to the return value of setjmp()



I think yor are saying is, that SIGFPE when it occurs needs a longjmp() to restore the stack environment saved previously with setjmp() in errorJump. That would be then the newlisp-reset-entry-point. So SIGFPE would always cause a reset and take newLISP to the command line.



just send me the your changed newlisp.c, perhaps then it makes more sense to me.



EDITt: didn't see you last post while writing this, now I understand, how you think errorReg is set.



But here a totally different approach:

=======================



in setupAllSignals(void) all signals do:


#ifdef OS2
setupSignalHandler(SIGFPE, SIG_IGN);
#endif


this tells OS/2 to simply ignore this exception. It may not let you do this overwrite, but we can try. In this case you can remove all other OS/2 specific signal code.
Title:
Post by: newdep on September 28, 2009, 07:54:59 AM
yes that would fit the GNU systems behaviour of SIGFPE.... let me try that...

btw I did try that previously but with the sigfpe stillinside the sigal_handler..(aaaggg)



lets see..
Title:
Post by: newdep on September 28, 2009, 07:59:29 AM
No..its doesnt eat it...

the SIGFPE needs to be triggered to catch the NaN's and Inf's it seems..
Title:
Post by: newdep on September 28, 2009, 08:05:50 AM
Quote from: "Lutz"


The following



- longjmp() doesn't return anything is void longjmp()

- setjmp() returns whatever the second arg of longjmp() was

- errorReg is always set to the return value of setjmp()




I make a correction to the above actualy..



Its correct because a longjmp doesnt return, but ! ->



setjmp always return != 0 when there was a previous longjmp,

not what longjmp had as second argument.
Title:
Post by: Lutz on September 28, 2009, 08:31:42 AM
Nope! From the man page of longjmp():



"Returns 0 after saving the stack environment. If setjmp() returns as a result of a longjmp() call, it returns the value argument of longjmp(), or if the value argument of longjmp() is 0, setjmp() returns 1."



That means setjmp() does return the second arg of longjmp(), except when that arg was 0, then setjmp() returns 1.



For that reason I wanted you to try longjmp(errorJmp, 0) in the error handler earlier.



Also, when you do this:
Quote
if((errorReg = setjmp(errorJump)) != 0)
    {
    if(errorReg && (errorEvent != nilSymbol) )
        executeSymbol(errorEvent, NULL, NULL);
    else exit(-1);
    goto AFTER_ERROR_ENTRY;
    }

errorReg = setjmp(errorJump); <=== will suppress all error messages
setupAllSignals();


You effectively suppress all error messages, because the jump buffer is now set to that point with no error treatment when an error occured. It will then just drop into the command line without error messages.



That line has to go, we have to make OS/2 FP work without it.
Title:
Post by: newdep on September 28, 2009, 08:52:25 AM
Quote from: "Lutz"Nope! From the man page of longjmp():



"Returns 0 after saving the stack environment. If setjmp() returns as a result of a longjmp() call, it returns the value argument of longjmp(), or if the value argument of longjmp() is 0, setjmp() returns 1."



That means setjmp() does return the second arg of longjmp(), except when that arg was 0, then setjmp() returns 1.



For that reason I wanted you to try longjmp(errorJmp, 0) in the error handler earlier.



Also, when you do this:
Quote
if((errorReg = setjmp(errorJump)) != 0)
    {
    if(errorReg && (errorEvent != nilSymbol) )
        executeSymbol(errorEvent, NULL, NULL);
    else exit(-1);
    goto AFTER_ERROR_ENTRY;
    }

errorReg = setjmp(errorJump); <=== will suppress all error messages
setupAllSignals();


You effectively suppress all error messages, because the jump buffer is now set to that point with no error treatment when an error occured. It will then just drop into the command line without error messages.



That line has to go, we have to make OS/2 FP work without it.


Aha..That dusty UNIX programming book of mine will now go into the bin basket! (its from 1987...uhum...)

Ill stick with the manpages ;-)





I suspected that indeed with the extra setjmp, I dont like it that way eighter...

Yes lets stick with the code as it is now...



The odd thing that keeps me awake..is my C-example versus the newlisp code.



What I could do is extract the SIGFPE from the generic handler in newlisp?

perhpas that helps.. thats the only thing i did not do yet..

And newlisp has more longjmp's and setjmp's and an explicit check on 0 or 1 on the

setjmp..
Title:
Post by: Lutz on September 28, 2009, 09:41:18 AM
QuoteWhat I could do is extract the SIGFPE from the generic handler in newlisp?


Yes, this is done for other signals on Sun OS, Tru64 and IBM Aix. E.g:


#ifdef OS2
void specialOS2_handler(int s)
{
/* Norman's OS/2 stuff */
}

setupSignalHandler(SIGFPE, specialOS2_handler);
#endif


Note that setupSignalHandler() is just setsig() with error checking.



Then there is also this:


setsig(SIGFPE, SIG_DFL)

its sets up some sort of OS-specific default handler.
Title:
Post by: newdep on September 28, 2009, 01:34:23 PM
... I stripped a long story here.. and will make it short..



It seems I keep running into a mixup of setjmp longjmp errorReg values.

(tested this by printing all the errorRegs inside newlisp, see previous posts)



From the SIGFPE handler point of view there are 2 options,

* use a long jump, which works in my c-code example.

* Exit the application with a message, this works in both newlisp and c-code example..



From the setjmp point of view:

* Very first call to setjmp results in a 0 return.

* All longjmp calls after the first 'direct' setjmp call will have != 0.



From the longjmp point of view there:

* Return the stack state last set by setjmp based on env value.

* make sure the initial function isnt finshed befor jumping.



Conclusion as it now is in newlisp:

* C-code example works on linux and on OS/2 gcc compiled.

* Nan and Inf only happen when SIGFPE is set.

* I.e. (div 0) Does initialy not return anything.  the errorReg = 0.

* I.e. (div 0) Only returns the second time, as it seems the longjmp  errorReg = 1.





I tried it all...I give myself a SIGSEGV...
Title:
Post by: newdep on September 29, 2009, 12:33:21 AM
oke fresh day fresh SIGHUP..



When starting newlisp freshly and entering (sqrt -1) i get the SIGNAL 8

the errorReg = 0 and returns to the code below where only E5=1 is displayed.



I found that the longjmp inside the SIGFPE handler always returns to this point ->

(which is the first setjmp call).


/* ======================= main entry on reset ====================== */
printf("E4=%dn", errorReg);
errorReg = setjmp(errorJump);
printf("E5=%dn", errorReg);
setupAllSignals();


So again from this test the SIGFPE has initialy (the very first time its called)

not the same set_buf content, only the second time they are in sync. Thats

why the result isnt displayed, at least thats what it looks like..



I have stripped down the signaling inside newlisp so it only does the C-code from my

exmaple. There is no signaling left else then SIGFPE.



because it never displays the E4= it directly jumps to the first setjmp from the longjmp.



If I enter a (/ 0 0) after newlisp initialy started the errorReg = 29 (div by zero integer)

then I enter the (sqrt -1) get the SIGNAL 8 errorReg = 29 and it returns to the same

code (see above) where the E5=29. The next (sqrt -1) (no signal trigger, its running already) is then working.



..Duke Nukem would say... "Where is it!..."
Title:
Post by: newdep on September 30, 2009, 03:32:21 AM
This it will be in newlisp/2.



#ifdef OS2

     /*

        longjmp(errorJump,errorReg) and the signal handling of SIGFPE

        are unreliable, course cant be traced. could be Libc063 or gcc or ..??

        (NaN? (..)) (Inf? (..)) dont work, no returns regarding nan inf...

        Therefor ERR: is returned with an exit. not very charming.

     */

   case SIGFPE:

          printErrorMessage(ERR_MATH, NULL, 0);

      exit(-1);

      break;

#endif
Title:
Post by: newdep on October 01, 2009, 06:16:18 AM
NaN and Inf are now returned by SIGFPE. Got it working. Finaly.

Lutz, I have posted you the code.
Title:
Post by: TedWalther on October 01, 2009, 09:59:25 AM
Well, don't leave me hanging!  Can you paste the diff in here?
Title:
Post by: Lutz on October 01, 2009, 10:04:06 AM
I uploaded the current newlisp-10.1.6.tgz to your place.
Title:
Post by: newdep on October 01, 2009, 10:05:48 AM
What kind of diff would you like? Just the changes? because I did it on the 10.1.6 release which isnt yet released yet..



But are you also building or do you only put it in a tree? Then you need the official diff, which is dont have from 10.1.5 to 10.1.6...
Title:
Post by: Lutz on October 01, 2009, 10:06:31 AM
The version I uploaded to Ted contains all of Norman's modifications.
Title:
Post by: TedWalther on October 01, 2009, 11:32:18 AM
Thanks Lutz.  I was just sort of excited by the way I saw the bug being chased down.  I thought it would be a good climax to the detective novel to see the solution viz the diff.  I'll check it out in the 10.1.6 tarball.  Thank you!



Ted
Title:
Post by: newdep on October 01, 2009, 12:21:43 PM
To make a long story even longer..



It might make sence it might not..Im sure I forgot most parts of the

story but here is the ending.. at least the end of the ending..not

the full ending..i mean.. Whats a start of it anyway..





Last week I had to come out and tell everyone on the internet;
"I have a problem, i want NaN's on my newlisp prompt in OS/2"

I use the OS/2 GCC port and the Klibc build and want NaN's.

I had actualy 3 problems, (1) Why is there no default FP ieee
return like NaN and Inf on Gcc for OS/2. (2) why does
newlisp need a double input to display the result of a
FPException. and (3) Why do I need to digg into my
long gone C knowledge to get this working..

First,..Well the simplest way is to point your finger,
I learned very quickly that this finger pointing is a
nice start of a 2 week struggle, actualy i got very
frustrated with the whole gcc port in OS/2 to get this
little irritating thing working.

Im using here a P4 system that run OS/2.. Fast enough
for coding and very nice for speed compare.

So why did newlisp not return the NaN or Inf? Could it
be GCC, could it be the P4? So I had to rule out.
First the P4.. Early bugs from Intel? I tested these.
No problems on my P4. That left me with GCC and the
precompiled Libc. So I dugg into the C code of gcc and
Libc. Well it was not what I hoped to find, speeking
of programming, i have never seen so much exceptions
in a bunch of code as here. Tracking was the only option.

I came across the solution by pure luck, actualy by
expirimenting with FloatingPoints in C and newlisp.
From what I read it seemed the original gcc on
gnu-systems has this ieee NaN Inf return behaviour
default and doesnt use the SIGFPE for this.

As OS/2 isnt a pure gnu system but uses a gcc to
compile gnu applications the results and methods are
actualy in a dark area, its not 'yet' well documented.
Also I could not find anywhere a good description on
what a SIGFPE actualy does in this OS/2 port, so i
assumed it all depends per OS...period

I could not find anywhere in the bug tracking system
of OS/2 gcc a problem with SIGFPE related to
triggering NaN results..So i had to simmulate to proof
concept.

The behaviour of SIGFPE on OS2 gcc is:
- Setting up the signal SIGFPE
- Trigger it
- Restore stack et voila..you application has ieee.

To get SIGFPE return to operational mode inside your
application, the only save way is the longjmp way.
(Thanks Richie, for the only decent C book).

From my tests I could see they all worked out of the
box, but not so inside newlisp.

(2) Why did they work in my C-code and not in newlisp?

Im now having a good idea when the NaN is triggered,
but why doesnt it work in newlisp. Newlisp uses some
very simple but effective stack returns to restore
errors.

I adjusted the newlisp code over and over and when you
look long enough to code you finaly dont see the
solution anymore. Finaly I was on the right track.
a sync problem with the stacks created with jmp_buf.

Newlisp uses an initial stack at startup, SIGFPE which
can occeure anytime and is only triggered when the
Exception happens, does a longjmp to that stack to
restore from the Exception the FP caused and returns
after the longjmp to the first setjmp. (officialy there
is no return from longjmp but i just call it this way)
Now these stacks where not in sync. I did the longjmp
all the time to the default stack.

I was even on Flag and Co-Processor level at one point,
far too deep! And I did not want to go there eighter.

I had to figure out the seqence newlisp has in
errorhandling using the stack created with jmp_buf.
There was at initialization a problem which I already
found very quickly in the beginning but did not
realize I was so close to the solution at that time.

Also I still had to find a way to trigger that SIGFPE.
Which wasnt that easy because newlisp has a tight
error checking. So a wrongly placed fooled sqrt(-1)
SIGFPE trigger inside the newlisp code and newlisp
would quit on me. I choose to fool the SIGFPE, there
was no other working way.

As SIGFPE does a longjmp and is initialized when the
Exception occeurs it will always longjmp to the stack
newlisp started with. Because the problem only happens
at the First time the SIGFPE happens. This stack does
not contain any SIGFPE signals/results/flags, so when
the SIGFPE happend the return to the newlisp console was
empty, which is correct because newlisp did just that,
restore the original stack. So I had to make sure that
the SIGFPE restored the correct stack when the Exception
happend but now including its own handler in the stack.
And that was the solution.

So there are still questions open like: What does the
stack actualy contain at setjmp init. because the
sigsetjmp takes care of the signals and flags. Which
isnt used here. Is the behaviour of the SIGFPE trigger
indeed a default way of catching the Ieee NaN's ?

The most irritating about this SIGFPE I found actualy
while searching the internet. There was not 1 good
answer on the whole internet regarding the SIGFPE.
(Yes not even on the Big names sites eighter..)
And 'no' I did not want to use siginfo or sigaction.
They all just touched the Simple parts of the SIGFPE.

Its like reading a Bad technical book, all the stuff
you actualy want to know is aways in the back of
the apendix or Reference chapter, just too short
to learn from.

So yes.. Simple.. I know.. Im a Wuzzy that cant
program.. a nothing a rooky a lame coder..
..yes folks I know I know.. But I got it running :-)

So now its time for a beer!

Title:
Post by: newdep on October 06, 2009, 08:03:37 AM
Here is a quotation from the EMX OS/2 documentation the GCC/2 version is leaning on..

Except the use of the _control87() that did not do anything in GCC for me.



By default, all floating point exceptions are masked: The coprocessor will perform a
default action (replace the result with a NaN, for instance) and continue without
generating SIGFPE. Use _control87() to enable floating point exceptions. However,
SIGFPE is not reliable
Title:
Post by: newdep on October 06, 2009, 08:09:15 AM
and I even found more insight IBM info...finaly some good documents...Perhpas

this sigfpe stack swapping will be changed... Im rechecking _control87()
Title:
Post by: newdep on October 06, 2009, 08:59:49 AM
Lutz,



GOOD NEWS!  



If got it running (tested it on 10.1.4) but will work on a clean 10.1.6 too.

The SIGFPE handler can be removed, im unable to test it on a different PC btw

at the moment..





newLISP v.10.1.4 on OS/2 IPv4, execute 'newlisp -h' for more info.

(MAIN)-> (sqrt -1)
nan
(MAIN)-> (log 0)
-inf
(MAIN)-> (log -1)
nan
(MAIN)-> (atan -1)
-0.7853981634
(MAIN)-> (mod -1)
-1
(MAIN)-> (% -1 0)

ERR: division by zero in function %
(MAIN)-> (/ 0 0)

ERR: division by zero in function /
(MAIN)->




This is what I added to the newlisp.c (above inside main())

I did test befor this section below, I dont know what I did differently

but now its working..



/* Unmask all floating-point exceptions. */
#ifdef OS2
/* _clear87(); */
 _fpreset();
/* _control87( PC_53, MCW_RC ); */
#endif


initLocale();
initNewlispDir();
IOchannel = stdin;

initialize();
initStacks();
initDefaultInAddr(); /* nl-sock.c */



Now my main thought still stays.. whats the cause of this behaviour..

(the 87? the OS/2 Kernel? the Libc/gcc? its again in the dark ;-)



Anyway it works now overhere without Stack swapping..





PS: perhpas this should be default anyway in newlisp for all machines?

As it might be a hardware issue..
Title:
Post by: Lutz on October 06, 2009, 09:39:45 AM
The latest 10.1.6 is in the usual location for you, perhaps you can work the changes into it then send me a zip of newlisp.c  to:  lutz  nuevatec  com



Just search for all: OS2 in that file, I assume that all the old #ifdefs OS2 referring to SIGFPE can be removed?
Title:
Post by: newdep on November 03, 2010, 02:15:03 PM
I have put the fixed files link in a PM to you..

Ill email them too ;-)