Duplicate files on Your disk.

Started by alex, November 21, 2005, 11:43:16 AM

Previous topic - Next topic

alex

I wrote function "dup-files" for searching duplicate file names in directory,

including subdirectories.

;; dup-files
;;
;; Searching duplicate file names in directory, including subdirectories.
;;
;; syntax: (dup-files dir)
;; syntax: (dup-files dir str-pattern)
;;
;; In the first form returns a list of ALL pairs (filename directory)
;; where filename has duplicated file names in directory "dir"
;;
;; In the second form returns a list of pairs (filename directory)
;; where filename satisfy regular expression "str-pattern" and
;; has duplicated file names in directory "dir"
;;
;; ATTETION! Works only with newlisp-8.7.2 and later
;;
;; Example1:
;;  (load "dup-files.lsp")
;;  (println (dup-files "c:"))
;;
;; Example2:
;;  (load "dup-files.lsp")
;;  (setq li (dup-files "c:" ".jpg$"))
;;  (println (format "%20s  %20s  %10s  %s" "Filename" "Modification time" "Size" "Directory"))
;;  (dolist (x li)
;;    (setq tv (select (file-info (append (nth 1 x) (nth 0 x))) 0 6))
;;    (println
;;      (format "%20s  %20s  %10d  %s"
;;          (nth 0 x) (date (nth 1 tv) 0 "%Y-%m-%d %H:%M:%S") (nth 0 tv) (nth 1 x) ) ) )
;;
(context 'dup-files)
(setq file-mask nil)
(setq list1 '())

(define (dup-files:dup-files d f-mask)
  (if f-mask (setq file-mask f-mask) (setq file-mask ".*"))
  (replace "\" d "/")
  (unless (ends-with d "/") (setq d (append d "/" )))
  (file-tree d)
  (let (li (map (lambda (x) (nth 0 x)) list1))
    (setq li (map (lambda (x y) (if (!= y 1) x 0)) list1 (count li li) ))
    (sort (replace 0 li)) ) )

(define (file-tree d)
  (dolist (f (replace "." (replace ".." (directory d))))
    (if (directory? (append d f))
      (file-tree (append d f "/"))
      (if (regex file-mask f 1) (push (list f d) list1)) ) ) )

(context 'MAIN)

Have anybody remarks for purposes of improve the function?



Excuse me for my bad English.

Dmi

#1
(define (dup-files d file-mask)
  (replace "\" d "/")
  (unless (ends-with d "/") (setq d (append d "/" )))
  (let (lo '() grp-list '())
    (dolist (l (sort (file-tree d file-mask)))
      (if (= (lo 0) (l 0))
        (push (l 1) lo 1 -1)
        (begin
          (push lo grp-list -1)
          (set 'lo (list (l 0) (list (l 1)))))))
    (push lo grp-list -1)
    grp-list))

;; file-tree not using global variables and returns list of results
(define (file-tree d file-mask)
  (unless file-mask (setq file-mask ".*"))
  (let (list1 '())
    (dolist (f (replace "." (replace ".." (directory d))))
      (if (directory? (append d f))
        (set 'list1 (append list1 (file-tree (append d f "/"))))
        (if (regex file-mask f 1) (push (list f d) list1))))
    list1))

;; return name and list of places for duplicated entries
;(filter (fn (x) (> (length (x 1)) 1)) (dup-files "/home/dmi"))
;
;; return name and count for duplicate entries
;(filter (fn (x) (> (x 1) 1))
;  (map (fn (x) (list (x 0) (length (x 1)))) (dup-files "/home/dmi")))

Some notes.

- 'file-tree' doesn't really need in global variables.

- 'file-tree' returns a list of results, so we can later include it in nested  functional sequence.

- 'file-tree' has no side effects. It's good because we will not care about them now.

- there is no really reason to have distinct 'f-mask' and 'file-mask'.

- I slightly change duping algorithm, so it's not using 8.7.2 features and also using only one instance of a 'file-tree' result.

- 'dup-files' returns more complex information. newlisp has a perfect engine for easy transforming resulting data to any form we want. Examples are at the bottom of code.



Next improvements? :-)
WBR, Dmi

newdep

#2
Nice program!



1 remark in general, dont use resursive directory search on a Unix machine

with 'directory? because it will seek into Symlinks and seek and seek and seek...

I did not test that with your tool (i dont have windows) ...



Regards, Norman.
-- (define? (Cornflakes))

Dmi

#3
Indeed... just realize how 'count' works here... (I haven't 8.7.2 yet).

Nice! Better than mine in overall! :-))
WBR, Dmi

Lutz

#4
Instead of: (count li li) you just do: (count (unique li) li) on pre 8.7.2 and this way also save the (replace 0 li).



Lutz

alex

#5
Thanks all!



For some reason I have on my disk d: now 172756 files

and 30494 of them has 129349 duplicate names.

So d: is very good space for testing function dup-files :)

But not all is very good...



Preliminary testing:



My code (adapted)

(context 'dup-files)
(setq file-mask nil)
(setq list1 '())

(define (dup-files:dup-files d f-mask)
  (if f-mask (setq file-mask f-mask) (setq file-mask ".*"))
  (replace "\" d "/")
  (unless (ends-with d "/") (setq d (append d "/" )))

(println "time1="(time
  (file-tree d)
))

  (let (li nil)
(println "time2="(time  (begin
    (setq li (map (lambda (x) (nth 0 x)) list1))
    (setq li (map (lambda (x y) (if (!= y 1) x 0)) list1 (count li li) ))
    (setq li (sort (replace 0 li)))
)))
    li
  )
)

(define (file-tree d)
  (dolist (f (replace "." (replace ".." (directory d))))
    (if (directory? (append d f))
      (file-tree (append d f "/"))
      (if (regex file-mask f 1) (push (list f d) list1)) ) ) )

(context 'MAIN)


DMI code (adapted)

(define (dup-files d file-mask)
  (replace "\" d "/")
  (unless (ends-with d "/") (setq d (append d "/" )))
  (let (lo '() grp-list '() li nil)

(println "time1="(time
    (setq li (file-tree d file-mask))
))

(println "time2=" (time (begin
    (dolist (l (sort li))
      (if (= (lo 0) (l 0))
        (push (l 1) lo 1 -1)
        (begin
          (push lo grp-list -1)
          (set 'lo (list (l 0) (list (l 1)) )))))
    (push lo grp-list -1)
)))
    grp-list
  )
)

;; file-tree not using global variables and returns list of results
(define (file-tree d file-mask)
  (unless file-mask (setq file-mask ".*"))
  (let (list1 '())
    (dolist (f (replace "." (replace ".." (directory d))))
      (if (directory? (append d f))
        (set 'list1 (append list1 (file-tree (append d f "/") file-mask)))
        (if (regex file-mask f 1) (push (list f d) list1))))
    list1))


Test program

;; You can switch tests
(load "variant1.lsp")   ;; my first variant
#(load "variant2.lsp")   ;; DMI variant
#(load "variant3.lsp")   ;; new variant

(setq li (dup-files "d:"))
(exit)


My variant output:

time1=14375
time2=12000


DMI variant output:

time1=146109
time2=9000


We can see, that one part of DMI algorithm - "file-tree not using

global variables" - is very slowly, but another part is more common and fast

and, moreover, can be improved (see below). So I decide to stop on

"middle" variant now:



(context 'dup-files)
(setq file-list '())
(setq file-mask nil)

(define (dup-files:dup-files dir file-mask)
  (unless file-mask (setq file-mask ".*"))
  (replace "\" dir "/")
  (unless (ends-with dir "/") (setq dir (append dir "/" )))

(println "time1="(time
  (file-tree dir)
))

  (let (result '() tv nil)

(println "time2=" (time
    (dolist (x (sort file-list))
      (if (= tv (x 0))
        (push (x 1) result 0 1 0)
        (begin
          (setq tv (x 0))
          (push (list tv (list (x 1)) ) result) ) ) )
))
    (setq file-list '())   ;; restore original values for the
    (setq file-mask nil)   ;; purpose of "reduce" side effects :)
    result) )

(define (file-tree d)
  (dolist (f (replace "." (replace ".." (directory d))))
    (if (directory? (append d f))
      (file-tree (append d f "/"))
      (if (regex file-mask f 1) (push (list f d) file-list)) ) ) )

(context 'MAIN)

"middle" variant output:

time1=14469
time2=1906

More remarks?



PS. Final variant will be later.

alex

#6
Linux users can replace file-tree by something near

(ATTENTION! I hav no Linux, and can't test the code,

but idea must be clear)

(define (file-tree d file-mask)
  (let (list1 '() tv nil tv1 nil)
    (if file-mask
      (setq tv (append " -iname " file-mask) )
      (setq tv "") )
    (setq tv (exec (append "find.exe " d " -type f " tv)))
    (dolist (f tv)
      (setq tv1 (parse f "/"))
      (push (list (last tv1) (join (chop tv1) "/")) list1))
    list1) )

Dmi

#7
Nice timings. Nice improvements!



Some notes:

1. (setq file-mask nil) is completely useless because it is dynamically redefined as a 'dup-files function argument. Subsequent 'file-tree call will either use internal value of 'dup-files - not global one!



2. (setq list1 '()) can better be placed directly into 'dup-files function:

(define (dup-files:dup-files dir file-mask . list1)

  (setq list1 '())

So, no side effects either and this will automatically destroy unneeded data after end of executing.

And then the final part you  marked ";; restore original values" can be omited.



3. About "UNIX" variant: I don't think that 'find' in such wrap will be much faster. Just more unreadable ;-) Your 'native' 'file-tree code works fine with UNIX too.

Note:

In pure *NIX I will do something like:

$ find /dir -type f -name pattern >file-list

$ awk -F/ '{print $NF}' file-list|sort|uniq -cd >dup-filenames

... and then will grep or awk "file-list" file against entries in "dup-filenames" file by many available ways.

Pure standard *nix shell - no newlisp/basic/java/C at all.

*nix methods are wonderful ;-)



4. IMHO, it will be cool to have separate 'file-tree function under the hand in the MAIN context. I'll include one in my "funlib.lsp" till community will have establish a common library standards.
WBR, Dmi

alex

#8
Problem about dup-files transforms to two functions: list-assoc and file-tree

(define (list-assoc li)
;; version 2005.11.28
;; Transform list of lists to association list
;; with unique key (more precise - see code and examles)
;;
;; syntax: (dup-files list-of-lists)
;;
  (let (result '() tv nil);; syntax: (dup-files dir)
    (dolist (x (sort li))
      (if (= tv (x 0))
        (dolist (y (rest x)) (push y result 0 1))
        (begin
          (setq tv (x 0))
          (push x result)
        )
      )
    )
    result
  )
)


(define (file-tree dir mask)
;; version 2005.11.28
;; Returns a list of pairs (filename directory)
;; where filename satisfy regular expression "str-pattern"
;;(more precise - see code and examles)
;;
;; syntax: (file-tree dir [str-pattern])
;; syntax: (file-tree dir-list [str-pattern])
;;
  (unless mask (setq mask ".*"))
  (unless (list? dir) (setq dir (list dir)))
  (letn
    (
      result '()
      file-tree-utility (lambda (d)
       (dolist (f (replace "." (replace ".." (directory d))))
         (if (directory? (append d f))
          (file-tree-utility (append d f "/"))
          (if (regex mask f 1) (push (list f d) result)) ) ) )
    )  
    (dolist (d dir)
      (replace "\" d "/")
      (unless (ends-with d "/") (setq d (append d "/" )))
      (file-tree-utility d)    
    )
    result
  )
)

list-assoc and file-tree enough common, fast and useful (IMHO)

Now I can easy get list of files with duplicate names

(setq dupfiles (filter (lambda (x) (> (length x) 2)) (list-assoc (file-tree (list "c:" "d:" "e:")))))

alex

#9
It is interesting, that test-code

(define (main)
(println "time1=" (time
  (setq li (file-tree "d:"))
))
(println "time2=" (time
  (setq li (list-assoc li))
))

(println "time3=" (time
  (setq li1 (list-assoc (file-tree "d:")))
))
)

(setq t (time (main)))
(println "full-time=" t)
(exit)

produce output:



time1=13421

time2=5391

time3=31577

full-time=50405



Why time3 is so big?

Where are time3 - time1 - time2 = 12.745 s ?

Dmi

#10
You said, you have more than 170000 files, so about 350000 string cells for file-tree's result plus some memory for copying lists, plus overhead for unfreed-unused memory - say, big lists can leave big memory holes after freeing (newlisp can reuse them later, but can't free under windows, afaik).



On my linux,

(define (q1) (dup '("qwerty") 350000))

(set 'qq1 (q1))

used about 53m of vram and 49m of resident ram.

so, after time1 and time2 are passed, I think, more than 70 - 100m of ram is used. Such size may impact the OS, the filesystem cache, the fragmentation of something or so...



Try to check time3 separately from time1&time2



Just a guess... ;-)
WBR, Dmi

alex

#11
To DMI:

 yes, indeed.



Thanks all!

alex

#12
More powerful version of file-tree

(define-macro (file-tree _dir _filter)
;; by alex
;; version 2005.12.02
;; Returns a list of pairs (filename directory)
;; where filename satisfy to filter-function
;;(more precise - see code and examles)
;;
;; syntax: (file-tree dir [filter-function])
;; syntax: (file-tree dir-list [filter-function])
;;
;; Examples:
;;
;;  (setq list1 (file-tree "." (lambda (x) (starts-with x "b" nil))))
;;
;;  (setq list2 (file-tree "." ))
;;
;;  (define (filt x) (ends-with x ".htm" nil))
;;  (setq list3 (file-tree (list "c:" "d:") filt))
;;
  (if (symbol? _filter) (setq _filter (eval _filter)))
  (letn
    ( _result '()
      _file-tree-utility
        (lambda (_d)
          (dolist (_f (replace "." (replace ".." (directory _d))))
            (if (directory? (append _d _f))
              (_file-tree-utility (append _d _f "/"))
              (if (or (not _filter) (and _filter (_filter _f)))
                (push (list _f _d) _result)
              )  
            )
          )
        )
    )  

    (unless (list? _dir) (setq _dir (list _dir)))
    (dolist (_d _dir)
      (replace "\" _d "/")
      (unless (ends-with _d "/") (setq _d (append _d "/" )))
      (_file-tree-utility _d)    
    )
    _result
  )
)