Fast find, between two numbers, in a list

Started by kanen, May 02, 2010, 07:57:13 PM

Previous topic - Next topic

kanen

I have a very large list, over 100,000 entries.



It's in the form: ("name" number1 number2)



I'd like to take a value and find the "name" that occurs between the two numbers.



For example;



("kanen" 100 200)



Such that, the number 150 would fall within the above range and return "kanen"



I've tried several different approaches, but non of them are fast enough.



Any ideas?



[edited to add helpful example]
. Kanen Flowers http://kanen.me[/url] .

cormullion

#1
Do you mean:


(("n1" 100 200)
 ("n2" 201 300)
 ("n3" 301 400)
)


and then looking for what matches, say, 360?



I remember http://newlispfanclub.ryon.webfactional.com/forum/viewtopic.php?f=5&t=2825&start=0&hilit=apply&sid=b47420d98545767710f92bc0252bdb32">asking a similar question.



OK, it wasn't very similar... :)

itistoday

#2
One of newLISP's weaknesses is that its memory management model (ORO), which makes it difficult to implement advanced, efficient data-structures in the language itself. I would use a tree-like structure to solve this problem, but newLISP doesn't have one. What you can do is write one, or use an existing one, that's written in C, and then call that code from newLISP.



Use a BST on the first number (the lower bound), then when you find the node you're looking for, make it so that the result of that is another BST that does searching based on the second number (the upper bound), with the result being the string. I think that would be a quick way of getting what you want.
Get your Objective newLISP groove on.

Lutz

#3
If data are as described by Cormullion and are sorted, you could use 'find' with a compare function, looking for the first entry where x is bigger than the first number and smaller than the second.



If this is to slow, you could make an array from the data list:
(set 'data-array (array (length data) 3 (flat data)))

Then use a binary search on the array:



http://en.wikipedia.org/wiki/Binary_search_algorithm">http://en.wikipedia.org/wiki/Binary_search_algorithm



and if intervals are contiguous, you could throw out the second number.



If you tell us more about the nature of the data, perhaps that would lead to a solution in a whole different direction?

kanen

#4
Yes, the data looks like Cormullion suggested.



Here's an example;
(set 'loc '(
  ("41.0.0.0" "2097152" 687865856 689963008 "ZA")
  ("41.32.0.0" "1048576" 689963008 691011584 "EG")
  ("41.48.0.0" "524288" 691011584 691535872 "ZA")
  ("41.56.0.0" "65536" 691535872 691601408 "ZA")
  ("41.57.0.0" "65536" 691601408 691666944 "ZA")
  ("41.58.0.0" "65536" 691666944 691732480 "NG")
  ("41.59.0.0" "65536" 691732480 691798016 "TZ")
  ("41.64.0.0" "131072" 692060160 692191232 "EG")
  ("41.66.0.0" "16384" 692191232 692207616 "CI") ))


I am finding numbers between S and E and return R, as represented here: '(x x S E R)



So, 689963099 would return "EG"



How about a quick, example? My attempts are quite slow and painful.


Quote from: "Lutz"If data are as described by Cormullion and are sorted, you could use 'find' with a compare function, looking for the first entry where x is bigger than the first number and smaller than the second.



...
. Kanen Flowers http://kanen.me[/url] .

Kazimir Majorinc

#5
Something like this one?


(set 'loc '(
  ("41.0.0.0" "2097152" 687865856 689963008 "ZA")
  ("41.32.0.0" "1048576" 689963008 691011584 "EG")
  ("41.48.0.0" "524288" 691011584 691535872 "ZA")
  ("41.56.0.0" "65536" 691535872 691601408 "ZA")
  ("41.57.0.0" "65536" 691601408 691666944 "ZA")
  ("41.58.0.0" "65536" 691666944 691732480 "NG")
  ("41.59.0.0" "65536" 691732480 691798016 "TZ")
  ("41.64.0.0" "131072" 692060160 692191232 "EG")
  ("41.66.0.0" "16384" 692191232 692207616 "CI") ))

(set 'loca (array (length loc) loc))
(println loca)

(while true
   (setf found nil miny 0 maxy (- (length loca) 1))
   (set 'n (int (read-line)))
   (until found
      (setf t (div (add miny maxy) 2) rt (round t))
      (println "miny=" miny ", maxy=" maxy ", t=" t ", rt=" rt "-->" (loca rt))
      (if (and (<= ((loca rt) 2) n)
               (<= n ((loca rt) 3)))
          (begin (println "found!")
                 (set 'found true))
          (if (< n ((loca rt) 2))
              (setf maxy t)
              (setf miny t)))))
(exit)


(("41.0.0.0" "2097152" 687865856 689963008 "ZA") ("41.32.0.0" "1048576" 689963008

  691011584 "EG")

 ("41.48.0.0" "524288" 691011584 691535872 "ZA")

 ("41.56.0.0" "65536" 691535872 691601408 "ZA")

 ("41.57.0.0" "65536" 691601408 691666944 "ZA")

 ("41.58.0.0" "65536" 691666944 691732480 "NG")

 ("41.59.0.0" "65536" 691732480 691798016 "TZ")

 ("41.64.0.0" "131072" 692060160 692191232 "EG")

 ("41.66.0.0" "16384" 692191232 692207616 "CI"))

687865856

miny=0, maxy=8, t=4, rt=4-->("41.57.0.0" "65536" 691601408 691666944 "ZA")

miny=0, maxy=4, t=2, rt=2-->("41.48.0.0" "524288" 691011584 691535872 "ZA")

miny=0, maxy=2, t=1, rt=1-->("41.32.0.0" "1048576" 689963008 691011584 "EG")

miny=0, maxy=1, t=0.5, rt=0-->("41.0.0.0" "2097152" 687865856 689963008 "ZA")

found!

689963008

miny=0, maxy=8, t=4, rt=4-->("41.57.0.0" "65536" 691601408 691666944 "ZA")

miny=0, maxy=4, t=2, rt=2-->("41.48.0.0" "524288" 691011584 691535872 "ZA")

miny=0, maxy=2, t=1, rt=1-->("41.32.0.0" "1048576" 689963008 691011584 "EG")

found!

692191232

miny=0, maxy=8, t=4, rt=4-->("41.57.0.0" "65536" 691601408 691666944 "ZA")

miny=4, maxy=8, t=6, rt=6-->("41.59.0.0" "65536" 691732480 691798016 "TZ")

miny=6, maxy=8, t=7, rt=7-->("41.64.0.0" "131072" 692060160 692191232 "EG")

found!

692207616

miny=0, maxy=8, t=4, rt=4-->("41.57.0.0" "65536" 691601408 691666944 "ZA")

miny=4, maxy=8, t=6, rt=6-->("41.59.0.0" "65536" 691732480 691798016 "TZ")

miny=6, maxy=8, t=7, rt=7-->("41.64.0.0" "131072" 692060160 692191232 "EG")

miny=7, maxy=8, t=7.5, rt=8-->("41.66.0.0" "16384" 692191232 692207616 "CI")

found!

691601409

miny=0, maxy=8, t=4, rt=4-->("41.57.0.0" "65536" 691601408 691666944 "ZA")

found!

692060169

miny=0, maxy=8, t=4, rt=4-->("41.57.0.0" "65536" 691601408 691666944 "ZA")

miny=4, maxy=8, t=6, rt=6-->("41.59.0.0" "65536" 691732480 691798016 "TZ")

miny=6, maxy=8, t=7, rt=7-->("41.64.0.0" "131072" 692060160 692191232 "EG")

found!
http://kazimirmajorinc.com/\">WWW site; http://kazimirmajorinc.blogspot.com\">blog.

Lutz

#6
Fast, binary search! but be careful with rounding, you could get stuck in a loop. It is better to either use 'floor' or use integer arithmetic, which will floor to the next lower integer automatically.



Another hazard are holes in the data, e.g. between these entries:


("41.59.0.0" "65536" 691732480 691798016 "TZ")
("41.64.0.0" "131072" 692060160 692191232 "EG")


... a natural for IP geo databases, but could get the binary search stuck in a loop when a number falls into such a hole, e.g.: 69180123.



You could either fill these holes with "N/A" entries or make the algorithm recursive (see my earlier WikiPedia link), and trap recursive depth. After 20 - for a milion size table - use a: (throw "N/A"). This saves you the work of filling out the holes in your IP database :-). A well written function should be able to serve thousands of IP-to-location lookups per second.

Kazimir Majorinc

#7
When we're at that - bug discovered:



newLISP v.10.2.5 on Win32 IPv4, execute 'newlisp -h' for more info.

> (round 0.5)

0

> (round 1.5)

2

>
http://kazimirmajorinc.com/\">WWW site; http://kazimirmajorinc.blogspot.com\">blog.

Lutz

#8
Not a bug, but  a different rounding scheme for floating point numbers:



Floating point numbers are nor real decimals. Rounding depends on the bit pattern used to represent the floating point number as a binary number. Depending on the whole bit pattern the decimal it represents, may be bigger or smaller, which then affects rounding. In newLISP, as in C, rounding policy is always round-to-nearest.



For a more detailed explanation see also here:



http://en.wikipedia.org/wiki/Rounding#Round_half_to_even">http://en.wikipedia.org/wiki/Rounding#R ... lf_to_even">http://en.wikipedia.org/wiki/Rounding#Round_half_to_even

itistoday

#9
What's the function to round 0.5 to 1?
Get your Objective newLISP groove on.

Lutz

#10
... that would be:



http://en.wikipedia.org/wiki/Rounding#Round_half_away_from_zero">http://en.wikipedia.org/wiki/Rounding#R ... _from_zero">http://en.wikipedia.org/wiki/Rounding#Round_half_away_from_zero


(floor (add x 0.5)) ; for positive x

itistoday

#11
So I suppose this will always behave the way Kazimir was expecting (for any x)?


(sub (round (add 1 x)) 1)
Get your Objective newLISP groove on.

Lutz

#12
It is hard to make an obvious rule for the behavior of 'round' without going into the binary representation of the numbers when using round-to-nearest. Consider this:


> (round 0.05 -1)
0.1
> (round 0.15 -1)
0.1
> (round 0.25 -1)
0.2
> (round 0.35 -1)
0.3
> (round 0.45 -1)
0.5
>

itistoday

#13
Interesting, thanks for pointing that out. For the ones place it appears consistent though, I think (haven't tested all possibilities, obviously).
Get your Objective newLISP groove on.

Kazimir Majorinc

#14
How about this one?


(set '[println.supressed] true)
 (load "http://www.instprog.com/Instprog.default-library.lsp") ; only for println=
 (set '[println.supressed] (not true))
 (println "---")
 
(define (round0 x)
  (if (< (mod x 1) 0.5) (floor x) (ceil x)))
 
(define (round2 x (i 0))
  (if (< x 0)
      (sub (round2 (sub x) i))
      (div (round0 (mul x (pow 10 (sub i))))
           (pow 10 (sub i)))))

(println= (round2 0.5))
(println= (round2 1.5))

(dolist (x '(0.05 0.15 0.25 0.35 0.45 0.95))
   (println= x (round2 x -1)))
   
(println= (round2 -1.49) (round -1.49))
(println= (round2 -1.5) (round -1.5))

(println= (round2 3.1415926 -2.5)); whatever it is

(exit)


---

(round2 0.5)=1;

(round2 1.5)=2;

x=0.05; (round2 x -1)=0.1;

x=0.15; (round2 x -1)=0.2;

x=0.25; (round2 x -1)=0.3;

x=0.35; (round2 x -1)=0.4;

x=0.45; (round2 x -1)=0.5;

x=0.95; (round2 x -1)=1;

(round2 -1.49)=-1; (round -1.49)=-1;

(round2 -1.5)=-2; (round -1.5)=-2;

(round2 3.1415926 -2.5)=3.140141717;
http://kazimirmajorinc.com/\">WWW site; http://kazimirmajorinc.blogspot.com\">blog.