newLISP Fan Club

Forum => newLISP in the real world => Topic started by: kanen on May 02, 2010, 07:57:13 PM

Title: Fast find, between two numbers, in a list
Post by: kanen on May 02, 2010, 07:57:13 PM
I have a very large list, over 100,000 entries.



It's in the form: ("name" number1 number2)



I'd like to take a value and find the "name" that occurs between the two numbers.



For example;



("kanen" 100 200)



Such that, the number 150 would fall within the above range and return "kanen"



I've tried several different approaches, but non of them are fast enough.



Any ideas?



[edited to add helpful example]
Title: Re: Fast find, between two numbers, in a list
Post by: cormullion on May 03, 2010, 10:22:17 AM
Do you mean:


(("n1" 100 200)
 ("n2" 201 300)
 ("n3" 301 400)
)


and then looking for what matches, say, 360?



I remember asking a similar question (//http).



OK, it wasn't very similar... :)
Title: Re: Fast find, between two numbers, in a list
Post by: itistoday on May 03, 2010, 11:05:58 AM
One of newLISP's weaknesses is that its memory management model (ORO), which makes it difficult to implement advanced, efficient data-structures in the language itself. I would use a tree-like structure to solve this problem, but newLISP doesn't have one. What you can do is write one, or use an existing one, that's written in C, and then call that code from newLISP.



Use a BST on the first number (the lower bound), then when you find the node you're looking for, make it so that the result of that is another BST that does searching based on the second number (the upper bound), with the result being the string. I think that would be a quick way of getting what you want.
Title: Re: Fast find, between two numbers, in a list
Post by: Lutz on May 03, 2010, 12:41:25 PM
If data are as described by Cormullion and are sorted, you could use 'find' with a compare function, looking for the first entry where x is bigger than the first number and smaller than the second.



If this is to slow, you could make an array from the data list:
(set 'data-array (array (length data) 3 (flat data)))

Then use a binary search on the array:



http://en.wikipedia.org/wiki/Binary_search_algorithm



and if intervals are contiguous, you could throw out the second number.



If you tell us more about the nature of the data, perhaps that would lead to a solution in a whole different direction?
Title: Re: Fast find, between two numbers, in a list
Post by: kanen on May 03, 2010, 10:53:34 PM
Yes, the data looks like Cormullion suggested.



Here's an example;
(set 'loc '(
  ("41.0.0.0" "2097152" 687865856 689963008 "ZA")
  ("41.32.0.0" "1048576" 689963008 691011584 "EG")
  ("41.48.0.0" "524288" 691011584 691535872 "ZA")
  ("41.56.0.0" "65536" 691535872 691601408 "ZA")
  ("41.57.0.0" "65536" 691601408 691666944 "ZA")
  ("41.58.0.0" "65536" 691666944 691732480 "NG")
  ("41.59.0.0" "65536" 691732480 691798016 "TZ")
  ("41.64.0.0" "131072" 692060160 692191232 "EG")
  ("41.66.0.0" "16384" 692191232 692207616 "CI") ))


I am finding numbers between S and E and return R, as represented here: '(x x S E R)



So, 689963099 would return "EG"



How about a quick, example? My attempts are quite slow and painful.


Quote from: "Lutz"If data are as described by Cormullion and are sorted, you could use 'find' with a compare function, looking for the first entry where x is bigger than the first number and smaller than the second.



...
Title: Re: Fast find, between two numbers, in a list
Post by: Kazimir Majorinc on May 04, 2010, 04:13:04 AM
Something like this one?


(set 'loc '(
  ("41.0.0.0" "2097152" 687865856 689963008 "ZA")
  ("41.32.0.0" "1048576" 689963008 691011584 "EG")
  ("41.48.0.0" "524288" 691011584 691535872 "ZA")
  ("41.56.0.0" "65536" 691535872 691601408 "ZA")
  ("41.57.0.0" "65536" 691601408 691666944 "ZA")
  ("41.58.0.0" "65536" 691666944 691732480 "NG")
  ("41.59.0.0" "65536" 691732480 691798016 "TZ")
  ("41.64.0.0" "131072" 692060160 692191232 "EG")
  ("41.66.0.0" "16384" 692191232 692207616 "CI") ))

(set 'loca (array (length loc) loc))
(println loca)

(while true
   (setf found nil miny 0 maxy (- (length loca) 1))
   (set 'n (int (read-line)))
   (until found
      (setf t (div (add miny maxy) 2) rt (round t))
      (println "miny=" miny ", maxy=" maxy ", t=" t ", rt=" rt "-->" (loca rt))
      (if (and (<= ((loca rt) 2) n)
               (<= n ((loca rt) 3)))
          (begin (println "found!")
                 (set 'found true))
          (if (< n ((loca rt) 2))
              (setf maxy t)
              (setf miny t)))))
(exit)


(("41.0.0.0" "2097152" 687865856 689963008 "ZA") ("41.32.0.0" "1048576" 689963008

  691011584 "EG")

 ("41.48.0.0" "524288" 691011584 691535872 "ZA")

 ("41.56.0.0" "65536" 691535872 691601408 "ZA")

 ("41.57.0.0" "65536" 691601408 691666944 "ZA")

 ("41.58.0.0" "65536" 691666944 691732480 "NG")

 ("41.59.0.0" "65536" 691732480 691798016 "TZ")

 ("41.64.0.0" "131072" 692060160 692191232 "EG")

 ("41.66.0.0" "16384" 692191232 692207616 "CI"))

687865856

miny=0, maxy=8, t=4, rt=4-->("41.57.0.0" "65536" 691601408 691666944 "ZA")

miny=0, maxy=4, t=2, rt=2-->("41.48.0.0" "524288" 691011584 691535872 "ZA")

miny=0, maxy=2, t=1, rt=1-->("41.32.0.0" "1048576" 689963008 691011584 "EG")

miny=0, maxy=1, t=0.5, rt=0-->("41.0.0.0" "2097152" 687865856 689963008 "ZA")

found!

689963008

miny=0, maxy=8, t=4, rt=4-->("41.57.0.0" "65536" 691601408 691666944 "ZA")

miny=0, maxy=4, t=2, rt=2-->("41.48.0.0" "524288" 691011584 691535872 "ZA")

miny=0, maxy=2, t=1, rt=1-->("41.32.0.0" "1048576" 689963008 691011584 "EG")

found!

692191232

miny=0, maxy=8, t=4, rt=4-->("41.57.0.0" "65536" 691601408 691666944 "ZA")

miny=4, maxy=8, t=6, rt=6-->("41.59.0.0" "65536" 691732480 691798016 "TZ")

miny=6, maxy=8, t=7, rt=7-->("41.64.0.0" "131072" 692060160 692191232 "EG")

found!

692207616

miny=0, maxy=8, t=4, rt=4-->("41.57.0.0" "65536" 691601408 691666944 "ZA")

miny=4, maxy=8, t=6, rt=6-->("41.59.0.0" "65536" 691732480 691798016 "TZ")

miny=6, maxy=8, t=7, rt=7-->("41.64.0.0" "131072" 692060160 692191232 "EG")

miny=7, maxy=8, t=7.5, rt=8-->("41.66.0.0" "16384" 692191232 692207616 "CI")

found!

691601409

miny=0, maxy=8, t=4, rt=4-->("41.57.0.0" "65536" 691601408 691666944 "ZA")

found!

692060169

miny=0, maxy=8, t=4, rt=4-->("41.57.0.0" "65536" 691601408 691666944 "ZA")

miny=4, maxy=8, t=6, rt=6-->("41.59.0.0" "65536" 691732480 691798016 "TZ")

miny=6, maxy=8, t=7, rt=7-->("41.64.0.0" "131072" 692060160 692191232 "EG")

found!
Title: Re: Fast find, between two numbers, in a list
Post by: Lutz on May 04, 2010, 05:16:00 AM
Fast, binary search! but be careful with rounding, you could get stuck in a loop. It is better to either use 'floor' or use integer arithmetic, which will floor to the next lower integer automatically.



Another hazard are holes in the data, e.g. between these entries:


("41.59.0.0" "65536" 691732480 691798016 "TZ")
("41.64.0.0" "131072" 692060160 692191232 "EG")


... a natural for IP geo databases, but could get the binary search stuck in a loop when a number falls into such a hole, e.g.: 69180123.



You could either fill these holes with "N/A" entries or make the algorithm recursive (see my earlier WikiPedia link), and trap recursive depth. After 20 - for a milion size table - use a: (throw "N/A"). This saves you the work of filling out the holes in your IP database :-). A well written function should be able to serve thousands of IP-to-location lookups per second.
Title: Re: Fast find, between two numbers, in a list
Post by: Kazimir Majorinc on May 05, 2010, 02:54:25 PM
When we're at that - bug discovered:



newLISP v.10.2.5 on Win32 IPv4, execute 'newlisp -h' for more info.

> (round 0.5)

0

> (round 1.5)

2

>
Title: Re: Fast find, between two numbers, in a list
Post by: Lutz on May 05, 2010, 07:27:24 PM
Not a bug, but  a different rounding scheme for floating point numbers:



Floating point numbers are nor real decimals. Rounding depends on the bit pattern used to represent the floating point number as a binary number. Depending on the whole bit pattern the decimal it represents, may be bigger or smaller, which then affects rounding. In newLISP, as in C, rounding policy is always round-to-nearest.



For a more detailed explanation see also here:



http://en.wikipedia.org/wiki/Rounding#Round_half_to_even
Title: Re: Fast find, between two numbers, in a list
Post by: itistoday on May 05, 2010, 07:29:25 PM
What's the function to round 0.5 to 1?
Title: Re: Fast find, between two numbers, in a list
Post by: Lutz on May 05, 2010, 07:34:23 PM
... that would be:



http://en.wikipedia.org/wiki/Rounding#Round_half_away_from_zero


(floor (add x 0.5)) ; for positive x
Title: Re: Fast find, between two numbers, in a list
Post by: itistoday on May 05, 2010, 07:53:49 PM
So I suppose this will always behave the way Kazimir was expecting (for any x)?


(sub (round (add 1 x)) 1)
Title: Re: Fast find, between two numbers, in a list
Post by: Lutz on May 05, 2010, 08:11:17 PM
It is hard to make an obvious rule for the behavior of 'round' without going into the binary representation of the numbers when using round-to-nearest. Consider this:


> (round 0.05 -1)
0.1
> (round 0.15 -1)
0.1
> (round 0.25 -1)
0.2
> (round 0.35 -1)
0.3
> (round 0.45 -1)
0.5
>
Title: Re: Fast find, between two numbers, in a list
Post by: itistoday on May 05, 2010, 09:46:49 PM
Interesting, thanks for pointing that out. For the ones place it appears consistent though, I think (haven't tested all possibilities, obviously).
Title: Re: Fast find, between two numbers, in a list
Post by: Kazimir Majorinc on May 06, 2010, 06:09:07 AM
How about this one?


(set '[println.supressed] true)
 (load "http://www.instprog.com/Instprog.default-library.lsp") ; only for println=
 (set '[println.supressed] (not true))
 (println "---")
 
(define (round0 x)
  (if (< (mod x 1) 0.5) (floor x) (ceil x)))
 
(define (round2 x (i 0))
  (if (< x 0)
      (sub (round2 (sub x) i))
      (div (round0 (mul x (pow 10 (sub i))))
           (pow 10 (sub i)))))

(println= (round2 0.5))
(println= (round2 1.5))

(dolist (x '(0.05 0.15 0.25 0.35 0.45 0.95))
   (println= x (round2 x -1)))
   
(println= (round2 -1.49) (round -1.49))
(println= (round2 -1.5) (round -1.5))

(println= (round2 3.1415926 -2.5)); whatever it is

(exit)


---

(round2 0.5)=1;

(round2 1.5)=2;

x=0.05; (round2 x -1)=0.1;

x=0.15; (round2 x -1)=0.2;

x=0.25; (round2 x -1)=0.3;

x=0.35; (round2 x -1)=0.4;

x=0.45; (round2 x -1)=0.5;

x=0.95; (round2 x -1)=1;

(round2 -1.49)=-1; (round -1.49)=-1;

(round2 -1.5)=-2; (round -1.5)=-2;

(round2 3.1415926 -2.5)=3.140141717;
Title: Re: Fast find, between two numbers, in a list
Post by: Lutz on May 06, 2010, 07:51:20 AM
... but be aware that this round-half-away-from-zero rounding mode should only be used in a limited type of financial applications, not for financial modeling or for general mathematics type of problems.



In fact for financial accounting floating point numbers shouldn't be used at all but only decimal mode calculations, e.g. using the BCD mode in the i386 family of processors. This is also why Java has a BigDecimal class.



Rounding with above method introduces a systematic bias when adding many rounded numbers in only one, the negative or positive number area.



This bias is not present in newLISP or 'C' and other languages using round-to-nearest type, unbiased rounding as the individual bias evens itself out including when only rounding either positive or negative numbers.



See the Wikipedia article: http://en.wikipedia.org/wiki/Rounding