date + parse problems

Started by kanen, March 25, 2011, 01:03:01 PM

Previous topic - Next topic

kanen

I use parse all over the place in my code. I recently ran into a very strange problem with parse.



> (set 'x 1301073325)
1301073325
> (date x)
"Fri Mar 25 10:15:25 2011"
> (parse (date x))
("Fri" "Mar" "25" "10" ":" "15" ":" "25" "2011")


Looks good. Parses correctly, but when the time changes...



> (set 'x 1301071976)
1301071976
> (date x)
"Fri Mar 25 09:52:56 2011"
> (parse (date x))
("Fri" "Mar" "25" "0" "9" ":" "52" ":" "56" "2011")


Notice "09:52" is being parsed as "0" "9" ":" "52" instead of "09" ":" "52"



I am running newLisp 10.3.0
. Kanen Flowers http://kanen.me[/url] .

Sammo

#1
You'll find that parse correctly handles "Fri Mar 25 07:52:56 2011" and other times with hours less than 8 am, but not "Fri Mar 25 08:52:56 2011". For an extreme case, try "Fri Mar 25 08:08:08 2011". I'll bet this is caused by newLISP's number parser which treats digit strings beginning with "0" as octal and breaking the parse on non-octal digits such as "8" and "9".

newdep

#2
Quote from: "Sammo"You'll find that parse correctly handles "Fri Mar 25 07:52:56 2011" and other times with hours less than 8 am, but not "Fri Mar 25 08:52:56 2011". For an extreme case, try "Fri Mar 25 08:08:08 2011". I'll bet this is caused by newLISP's number parser which treats digit strings beginning with "0" as octal and breaking the parse on non-octal digits such as "8" and "9".




does this help ?



(parse "Fri Mar 30 20:08:01 2011" {:s*|s+} 0 )



("Fri" "Mar" "30" "20" "08" "01" "2011")
-- (define? (Cornflakes))

newdep

#3
I think you found an issue there ;-)

Its indeed odd, i tried several options but this one is an odd one indeed..



 ("Fri" "Mar" "30" "0" "9" ":" "52" ":" "56" "2011")



Nice spotting!
-- (define? (Cornflakes))

newdep

#4
I think its a HEX issue...



it only happens from 08 - 0F



> (parse "0F:02:56" )

("0" "F" "02" ":" "56")



> (parse "0F:08:56" )

("0" "F" "0" "8" ":" "56")

> (parse "0F:A0:56" )

("0" "F" "A0" "56")

> (parse "0F:0F:56" )

("0" "F" "0" "F" "56")



and this is even odder..



> (parse ":0F:0F:56" )

(":" "0" "F" "0" "F" "56")
-- (define? (Cornflakes))

Lutz

#5
If you use 'parse' without the second parameter, 'parse' uses the algorithm to parse newLISP source. In your example, I would simply add the ":" as separator string and you get the expected results:



> (parse "0F:02:56" ":" )
("0F" "02" "56")
>


or use regular expressions like Sammo is suggesting for Kane's example.



look also into this:


(date-list (date-parse "2010.10.18 7:00" "%Y.%m.%d %H:%M"))
→ (2010 10 18 7 0 0 290 1)


to split specific date formats in components.

kanen

#6
Regular expressions are the answer, however...



I consider this a bug. I do not think (parse) should see anything above "08" and to "0F" as a HEX string to be split, unless you specifically ask for the string to be split as hex. This also breaks all over the place in my system because I'm actually parsing HEX in a lot of places (for TrustPipe).



Your example below doesn't actually solve the problem I'm having because I have both spaces and colons in my example.



It just seems that, if I have the following example strings (below), the results are unexpected.


> (parse "0F000801")
("0" "F000801")
> (parse "0F 00 08 01")
("0" "F" "00" "0" "8" "01")


Quote from: "Lutz"If you use 'parse' without the second parameter, 'parse' uses the algorithm to parse newLISP source. In your example, I would simply add the ":" as separator string and you get the expected results:



> (parse "0F:02:56" ":" )
("0F" "02" "56")
>


or use regular expressions like Sammo is suggesting for Kane(n)'s example.
. Kanen Flowers http://kanen.me[/url] .

cormullion

#7
Ha, I've done this too. It's a well-known pitfall - I even wrote about it, since I spent such a long time looking for a bug that appeared after the code ran perfectly for 10 months ... http://newlisper.wordpress.com/2006/09/18/my-mistake-2/">//http://newlisper.wordpress.com/2006/09/18/my-mistake-2/



It's sensible to make the default action for parse use newLISP syntax - what would be a better choice? Spaces, with or without tabs and/or returns and/or linefeeds? What about hyphens and quotes? So it makes sense to leave the precise specification of the parsing to the programmer. Be warned that using the default parse also eliminates semicolon-headed strings as well...



BTW - I once tried to write an 'intelligent' date-parser:


> (ParseTime "Fri Mar 25 10:15:25 2011")
((2011 3 25 10 15 25))
> (ParseTime "Fri Mar 25 09:52:56 2011")
((2011 3 25 9 52 56))
> (ParseTime "Fri Mar 25 9:52:56 2011")
((2011 3 25 9 52 56))
> (ParseTime "Fri March 08 09:08:56 2011")
((2011 3 8 9 8 56))
 > (ParseTime "Tuesday, March 08, 2011 3:51 PM")
((2011 3 8 15 51 0))
> (ParseTime "Tuesday, March 08, 2011 09:08:56")
((2011 3 8 21 8 56))
> (ParseTime "Wednesday, March 08, 2011 09:08:56")
((2011 3 8 21 8 56))
> (ParseTime "Wednesday, March 08, 2011 9:08:56 PM")
((2011 3 8 21 8 56))


but as you see, it's a hard problem... too hard for me, anyway :)

newdep

#8
I do think its inconsistant..using a regular parse.. Its about the 08 - 0F thats makes it odd..

I do understand the logic of the octal here but if you dont know this then the result is not as expected.

And the parse description in the manual does not say anything about this eighter..
-- (define? (Cornflakes))

Lutz

#9
Under the premise that 'parse' without the second parameter behaves like the newLISP source parser, there is nothing unexpected in the behavior of 'parse'. The confusion arises when octal numbers are discovered. See the same examples changing to octal:



> (parse "0F000801")
("0" "F000801")

> (parse "06000801") ; change F to a valid octal
("06000" "801")

> (parse "06000701") ; change 8 to valid octal
("06000701")
>

> (parse "0F 00 08 01")
("0" "F" "00" "0" "8" "01")

> (parse "06 00 07 01") ; all string are valid octal
("06" "00" "07" "01")
>


This octal confusion is a well known phenomenon in any programming language, because virtually all of them follow the same rules when parsing numbers: Numbers have certain valid start characters and and the parser ends them and restarts when an invalid character for that specific number format is found - octal numbers start with a '0' -



As Cormullion mentions: "what would be a better choice?" (for the default parse behavior). There are just too much possibilities, therefore I think that newLISP-source parsing as a default is a sensible choice. Yes, perhaps adding something in the manual will alleviate the confusion. Currently 'parse' mentions "newLISP parsing rules" in the description. Perhaps a chapter about "newLISP parsing rules" has to be added and the description of 'parse' would link to it.



Last not least: in many cases where parse is used, 'find-all' would be the better choice. While 'parse' takes break strings or regex expressions in the optional second parameter, 'find-all' describes the tokens itself:


> (find-all  "[^:]+" "0F:02:56")
("0F" "02" "56")

kanen

#10
A better choice would be to parse as a string, including all the normal items (space, colon, etc.) and not do the hex/octal parsing by default.



I do like the find-all example below and believe you are absolutely right and I should change my habits to use find-all with regex.


Quote from: "Lutz"As Cormullion mentions: "what would be a better choice?" (for the default parse behavior). There are just too much possibilities, therefore I think that newLISP-source parsing as a default is a sensible choice. Yes, perhaps adding something in the manual will alleviate the confusion. Currently 'parse' mentions "newLISP parsing rules" in the description. Perhaps a chapter about "newLISP parsing rules" has to be added and the description of 'parse' would link to it.



Last not least: in many cases where parse is used, 'find-all' would be the better choice. While 'parse' takes break strings or regex expressions in the optional second parameter, 'find-all' describes the tokens itself:


> (find-all  "[^:]+" "0F:02:56")
("0F" "02" "56")
. Kanen Flowers http://kanen.me[/url] .