Bayes query results just 0.5 or 1

Started by hilti, February 01, 2014, 02:55:11 PM

Previous topic - Next topic

hilti

I don't get it ... since two hours I'm trying to train some website data, but don't get the expected results.

(bayes-query) returns 0.5 or 1



I'm using the complex pattern in (parse) to split up german texts, too.



Here's my code:


(setq textdata "Now think about your brain. It's a long running program running on very complex and error prone hardware. How does your brain keep itself sane over time? The answer may be found in something we spend a third of our lives doing. Sleep.")

(setq text (parse (lower-case textdata) "[^a-z0-9äöüß]+" 0))
(bayes-train text 'DICT)
(bayes-query (parse (lower-case "dsd skjsd ksdjkds sdkj") "[^a-z0-9äöüß]+" 0) 'DICT)


And the result


"Now think about your brain. It226128153s a long running program running on very complex and error prone hardware. How does your brain keep itself sane over time? The answer may be found in something we spend a third of our lives doing. Sleep."
("now" "think" "about" "your" "brain" "it" "s" "a" "long" "running" "program" "running"
 "on" "very" "complex" "and" "error" "prone" "hardware" "how" "does" "your" "brain"
 "keep" "itself" "sane" "over" "time" "the" "answer" "may" "be" "found" "in" "something"
 "we" "spend" "a" "third" "of" "our" "lives" "doing" "sleep" "")
(45)
(0.5)


I'm expecting 0, because this phrase doesn't exist in my training data.



Am I missing some switch?
--()o Dragonfly web framework for newLISP

http://dragonfly.apptruck.de\">http://dragonfly.apptruck.de

Lutz

#1
You should have at least two groups in your dictionary to get meaningful data and both should have approximately the same number of tokens and have much more tokens than we have in this example. The numbers are not exact probabilities, but should be looked at by comparing the numbers in the different groups to each other. Also experiment using the 'true' flag for chain-based Bayesian versus Fishers Chi2 of calculating probabilities.



(setq textdata "Now think about your brain. It's a long running program running on very complex and error prone hardware. How does your brain keep itself sane over time? The answer may be found in something we spend a third of our lives doing. Sleep.")

(setq textdata2 "Now think about your next vacation. It's a long time that you didn't have one. How does your brain stay healthy without vacation. There are so many things, you can do in a summer vacation. It's better then to sleep our lives away.")

(setq text (parse (lower-case textdata) "[^a-z0-9äöüß]+" 0))
(setq text2 (parse (lower-case textdata2) "[^a-z0-9äöüß]+" 0))
(bayes-train text text2 'DICT)

; probs from Fisher's Chi2
(bayes-query (parse (lower-case "How does your brain keep itself sane")) DICT)
(bayes-query (parse (lower-case "How does your brain stay healthy")) DICT)
(bayes-query (parse (lower-case "How does your brain")) DICT)
; now the same with chain Bayesian
(bayes-query (parse (lower-case "How does your brain keep itself sane")) DICT true)
(bayes-query (parse (lower-case "How does your brain stay healthy")) DICT true)
(bayes-query (parse (lower-case "How does your brain")) DICT true)

gives you this:

(45 47) <- number of tokens

(0.994129159386854 0.00587084061314597) <- phrase comes from 1st group
(0.0569603080654478 0.943039691934552) <- phrase comes from 2nd group
(0.595589294104956 0.404410705895044) <- phrase occurs in both groups

(1 0) <- chain Bayesian reacts strongly non-existing tokens
(0 1) <- chain Bayesian reacts strongly non-existing tokens
(0.695000518791985 0.304999481208016) <- sharper distinction in chain Bayesian method


When doing the query "How does your brain", the numbers for the first group are slightly higher, although the same phrase occurs in both groups. This happens because certain words, e.g. "brain" occur more frequently in group one.