kmeans and nan

Started by rrq, July 19, 2014, 06:28:54 AM

Previous topic - Next topic

rrq

I started investigating using kmeans for vector classification, and has the problem of getting more 'nan' than I'm happy with, and I hope there's a simple answer to what I'm doing wrong:



E.g. I run kmean-train to form 5 clusters for a collection of ~50 vectors of 100 elements. Then some centroids are of all -nan values. What does that mean?



Next, I run a sequence of kmeans-query using these centroids (including the -nan ones), and then often but not always some measures are 'nan'. What does that mean?



Alternatively, is it sensible to revise the centroids, eg map -nan to something small or large, to maybe avoid the nan classifications?

Lutz

#1
First a general observation:

Do you have 50 data records (rows) in a 100 dimensions (columns) or 100 data records in 50 dimensions? In any case the number of dimensions (columns) seems very big for the number of data records (relatively speaking). But that in itself is not responsible for the nans, just a general observation.



But now about the nans (or NaNs depending on the platform). In the kmeans-train syntax:


(kmeans-train <matrix-data> <int-k> <context> [<matrix-centroids>])
int-k is used for the number of clusters to generate. This is a tentative number. If the number is to big, kmeans-train will leave some clusters unpopulated with a 0 frequency of data records in it and all values in the cetroid set to nan. When using these invalid centroids, results in distance vectors from calculating with nans will be nans again. In other words: The distance of a data point (record or row) from a nan centroid is nan.



For 50 data records, I would probably start out with an int-k of no more than 5 or for 100 with 10. If some of the clusters have un-proportional big memberships, I would increase that number, trying to split up those relatively large clusters.



You could repeat calculation using a smaller int-k, trying to eliminate nan centroids. Look also in K:labels for centroids with very little membership. Do these centroids describe real data outliers? Or are they just a sign that your int-k number is still a bit high.



When looking into K:labels (the cluster membership where K is the context), you will see that nan centroids are ignored. The expression:
(count (unique K:labels) K:labels)
 can give you a count of data records in each cluster.

rrq

#2
Great. Thanks. That makes good sense, and is very helpful.



Now, a follow on question, which requires me to elaborate the scenario a bit: basically I'm developing a rolling classification scheme for a growing time series of data, where each time point has a 100 element characterisation vector. The purpose is to classify new data points as they come in against the past time series of data in a timely manner.



The past time series is quite large, but the timeliness requirement prohibits running clustering an all of it for each input. And, at the same time, the fundamental proposal is that there is a reasonably small number of clusters, which occur repetetively (though irregularly), and distinctly (i.e., distinctly either present or absent in shorter time periods). Apparently, 50 point periods usually have ~3 clusters.



So, my present thinking is, to repeatedly use the last (say) ~50 points every so often (say, at every ~30 points), obtain centroids for that, and collate these centroids into the "symbol set" for the rolling classification.



Would you have any comment on this approach?



Maybe centroids should be collated by means of its own clustering scheme? Or perhaps it is significantly better to form larger cluster sets by training with more data points? (As you might notice, I'm not very read-in on the underlying theory)

rrq

#3
I'm still fascinated with newlisp. It let's me formulate, test and discard a stack of stupid ideas in a single sitting, without forcing piles and piles of infrastructural coding. Now I just wish I had better ideas :-)

Lutz

#4
It's hard to make any recommendation without knowing more about the data and the purpose of classifying them. But here are some general thoughts.



Dealing with time series, perhaps you want to find classes of behavior over time. Currently your rows in the matrix seem to represent time points and the columns some other elements changing over time. Transpose the matrix! So now you have rows of elements changing over time (columns). Clustering now will give you types of time movement behavior patterns. Those could also be useful. Again it all depends where your data come from.



What is your classification for? Classifying new time points? Develop some kind of quantifiable validation when applying results of a previous cluster analysis to new data.


Quotetest and discard a stack of stupid ideas in a single sitting, without forcing piles and piles of infrastructural coding


This is the way many, especially creative people, are using newLISP.

rrq

#5
You just opened a new door for me! No, you gave me a whole new game level!



The transposition idea indeed is an interesting and useful perspective, provided the measurement dimensions are "compatible", or can be duly normalized. I think it applies for my purpose (which I have to stay apologetically secretive about).



In particular, I can see its cluster memberships being a useful suggestion of which dimensions "move together in a similar way", which is a quite interesting/important aspect, as well as the point of labelling which kinds of motions there are. And, I see how this perspective easily extends to an abstracted, long-term motion analysis simply by stepping away from the time series unit time step, and consider (transposed) sub series of every n:th point.



... this will keep me off the streets a fair while :-)

rickyboy

#6
Quote from: "ralph.ronnquist"... this will keep me off the streets a fair while :-)

New tagline for newLISP: "Use of newLISP may lower vagrancy in your hometown." :)
(λx. x x) (λx. x x)