Here's a fun pet project I've been working on: udreamed.com/. It is a dream analytics app. Here is the RU-vid channel where we post a new video almost three times per week: ru-vid.com/show-UCiujxblFduQz8V4xHjMzyzQ Also available on iOS: apps.apple.com/us/app/udreamed/id1054428074 And Android: play.google.com/store/apps/details?id=com.unconsciouscognitioninc.unconsciouscognition&hl=en Check it out! Thanks!
Thank you! I enjoyed the back and forth of your problem shooting at the start for which variables to use. Made it more real, and gave some context from a theory perspective.
Great presentation! Moreover the "not suitable" variables you chose in the beginning, really helped a lot to understand more on the cluster analysis. Thanks
If I may, at 9:01 I would like to correct your reference to the boxplot: the middle line does indeed represent the median, but the left and right edges of the box lie at the first and third quantile respectively. So, rather than representing one standard deviation below and above the mean, the box represents the middle 50% of the observations. Thank you very much for the video, very lucid explanation of swamping variables, still very useful in 2019!
You're a great guy!! I study SPSS in College in three levels.. Introduction to Data Analysis, Univariate Data Analysis and Multi-Variate Data Analysis for 3rd level. In this moment i'm on 3rd and this process is really usefull! Thank You!!
I do not have questions, but I found your video extremely helpful with very good explanation So I only wanted to say thank you. Your video was a great help. =)
Funny you should ask! I was just considering doing this yesterday. I will probably do a K-means cluster, and also show how to segment the data and explore clusters for sub-populations. This is definitely on my to do list.
Not a stupid question because I had to look up the answer :) The SPSS help manual says that the two-step cluster analysis assumes normally distributed data for all continuous variables, but that tests have shown it to be robust enough to handle non-normal data fairly well.
great simple video on 2 step clustering (great for categorical variables or binary ones) with some continuous variables.But I like 2 step since it creates it's own clusters of which I don't have to specify (unlike in K-means)
Look at the sig value. If it is less than 0.05, then it is the groups are significantly different for that variables of comparison. If it is poor quality, then you might try a three factor model. Not sure you can rely on the cluster groups when they are poor. This means that the membership assignment was inconsistent based on the indicators used for the clustering. e.g., sometimes males went into cluster 1, sometimes in cluster 2.
You can certainly try k-means. It just depends on what your research intentions are. I actually prefer k-means over two-step. I just learned two-step first, so that's what I made the video for. I should probably make one for k-means sometime...
That's great indee! Well, I also have some ideas on how you could make it better from learner point of view. 1. Explaining why use certain/specific methodology for clustering 2. Producing it from basic to advanced methodology 3. Probably using data across industry/sector I dont know how much time you have to spend on these and you would want to, however I can provide you data which will enhance your quality of analysis. (and off course your self marketing value)
1. No references come to mind. When you run comparisons later on between clusters, if one cluster is much larger than another, then this will affect the critical ratio (t, f, or z statistic) since critical ratios are sensitive to sample size. Thus, working with similar sizes is ideal when making comparisons. 2. SPSS makes n+1 groups, where the extra 1 is those who did not fit in anywhere else. To figure out which clusters are which, look at the cluster output number in the output window.
Thanks for the ideas. I just do these when the need arises or when I have the time. I'll probably have some time to do a couple next week. I have some data that has grouping variables, so no need to send me yours. Thank you though.
Hello James, can Two-step Cluster Analysis handle mixed variable type? Eg. some variables that are output of factor analysis (that will have negative values too), and some binary variables?
Yes. The two-step method can handle all types of variables. The only thing you need to watch out for is highly skewed or kurtote variables, or discrete (categorical/nominal) variable without adequate representation from each group/category.
That is what I meant, but those are undesirable sample sizes. You might also look at indicator importance to see if one variable is swamping out the others. If so, you might consider removing it. Or you can try K-means clustering... I haven't made any video for that yet...
Very helpful, Prof. I did clustering for 2 continuous variables and 4 clusters, but how can I represent them in a high-high, high-low, low-high, and low-low matrix? Also, the two variables are highly correlated, will it be bad for clustering? Thanks.
1. How to represent them: If you click on one of the button options in the table of clusters, you will see their distributions along the scale of measurement (low/high). The button looks like a distribution. This should help you represent them. 2. If they are highly correlated, then it might just be difficult to find low-high and high-low since they are probably mostly low-low and high-high.
bayan khalifa Just look at the distributions it shows you. If the bulk of the distribution is on the right, then it is high (assuming your scale went from low to high), if it is on the left, then it is low.
I have a dumb problem and I wonder if someone could help me. The SPSS shows the cluster comparisons only for the inputs, but NOT for the descriptive variables. It just shows a message: "the cluster comparison view encountered a problem and cannot display correctly" or something like that. Why? I can't figure out.
spss-for-research I'm not sure. It may have something to do with the variables included. Try removing one variable at a time to see if you can identify which one is causing the problem. If it isn't that, then it may be a conflict in one of the libraries being utilized to run the analysis. If that is the case, then you might need to reinstall SPSS, or you might need to update your java or .NET version (not sure which one SPSS uses).
Hi James How can I get the Cubic Criterion Values at different number of clusters under consideration?? I think it's also a good way to justify why X number of clusters instead of Y, right??
Hello again James, can you explain how the analysis actually creates the clusters? I've tried using it for categorical variables and I'm not fully understanding just how it determines the clusters. Thank you
Here are some resources to help you understand 2 step cluster analysis better: 1. www.ibm.com/support/knowledgecenter/SSLVMB_21.0.0/com.ibm.spss.statistics.help/idh_twostep_main.htm 2. www.spss.ch/upload/1122644952_The%20SPSS%20TwoStep%20Cluster%20Component.pdf 3. www.ryerson.ca/~rmichon/mkt700/SPSS/TwoStep%20Cluster%20Analysis.htm 4. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-2Lz2bU-sBGA.html
Dear James, What references have you used on this occasion ? Besides, what would be most appropriate : K-means or Two-steps. In the paper I am working on, I have used both sets of analysis and, if the number of clusters remains the same, the number of respondants in each cluster differs quite significantly depending on which technique I use. Any tips ?
I'm not much of an expert on cluster analysis. I've just used the Hair et al 2010 book. As for which approach to use, I think two-step is considered the most useful and valid, since t combines hierchical and non-hierarchical methods.
I'm not sure what you mean by model index. Do you mean you are not getting the silhouette index? I'm not sure what might be causing that either way though... Sorry about that.
Hi James, did you say "swarming variable" or "swapping variable"? I couldn't figure it out, and I have tried looking for definitions for both, only found "swapping variable" for computer science, were you talking about the same ?
Is it possible to analyse cluster NOT around central concepts like intelligence or years on the job but upon family relationship (binary relationship closeness in a network with the absence of commonalities, as is the case in real families).
That's an interesting idea, but I don't know how to do it in a two-step. You might be able to do it with multiple alignment algorithms, but I'm not sure if SPSS has those...
Thank you very much indeed. I have found a partial solution in the software here socnetv.org/downloads which has a network analysis network community detection algorithm which can be used on the correlation matrix produced by SPSS factor analysis. Others have had the idea before journals.plos.org/plosone/article?id=10.1371/journal.pone.0051558 using a different community detection algorithm Full statement of problem and partial solution www.talkstats.com/showthread.php/69145-Family-Relationship-version-of-Factor-analysis-for-Japanese-Groups?p=199672&highlight=#post199672
Great video. I just want to check whether the variables you put both continuous and categorical, do you standardize them? Standardize I mean Z Normal variables as you are putting scale, binary, categorical variables together
+Rajesh Pandit SPSS automatically standardizes all continuous variables when doing a 2-step cluster analysis. You can see this in the options area when doing the 2-step.
Hi James, thanks for your valuable sharing. However, is there any source for the acceptable size of smallest cluster and threshold of ratio of sizes? Thanks in advance.
I'm not sure. I'm really not an expert on cluster analysis. Those numbers just "feel" right, which I realize is not very scientific of me. I guess they feel right because they are practically useful - i.e., clusters of those sizes are usable in subsequent analyses and cluster ratios of that proportion break the data up into roughly equivalent groups.
I can't doublecklick since the model viewer doesn't show up it all. It writes the clusters in the column but that's it - even though I activated the option...Any ideas what could be wrong? Thanks a lot in advance!
Hi James - This video is very helpful, thank you! Within the model viewer, I can see the average silhouette statistic for the cluster result. My understanding is this number is the average fit across item in the cluster. Is there a way to find the silhouette data for each item separately? For context, I'm using cluster analysis to identify exemplar scenarios for different types of behavior. I'm clustering scenarios based on participant ratings (e.g., this scenario represents X behavior, yes/no). I'd like to compare fit across a few different types of participant groups using an ANOVA of the silhouettes for each item. Thanks in advance!
+Thomas Chan Evaluation fields are used to see differences in evaluation variables based on cluster membership. It is sort of like doing an ANOVA on those variables, using the cluster membership as the factoring variable. The evaluation variables will not be used to determine cluster membership.
Hi James! Very helpful video - you saved me a lot of time. :-) Unfortunately, I have two additional questions, and it would be great if you could help me. I am sure, you are the expert who can help me! 1) Lets assume SPSS program proposes 3 clusters based on a set of variables. What statistical tests are used for the selection of 3 clusters instead of 2 or 4 in the background? I read in some papers that e.g., likelihood-ratio (L2) and its p-value, the Bayesian Information Criterion (BIC) and the number of parameters (Npar) could be examples for these statistical tests (there are for sure others)? And if some of these tests are conducted by SPSS in the background, is there a way how I can create an output-chart of these statistical parameters in SPSS? In other words, since SPSS tells me 3 clusters, I would like to show why 3 clusters and not 4 based on a few statistical tests. 2) Lets assume we still have these 3 clusters from question 1 which were created based on a set of variables. But I have another variable (e.g., age) which I did not use for the cluster analysis. How (if there is any option in SPSS) can I calculate the mean of variable age for each of the 3 identified cluster and show it in an output table (best case for more than 1 additional variable). I hope you understand my questions. I would appreciate your help and guidance!! Thanks a lot in advance! Regards, Alfons
1. SPSS let's you choose the AIC or the BIC as the clustering criterion, or you can use the silhouette measure that shows in the output. The silhouette is considered fairly robust. You can force it to 2 or 4 clusters as well to see what the silhouette score is for those. 2. Watch this video at the 2:16 mark. It will show how to do this using the Output button.
Can you please explain or suggest for likert sclae ordinal data which cluster analysis should apply ? Is it K-Means Cluster/Hierarchical/ Two step. Its it necessary to conduct CATPCA (categorical principal component analysis) prior to starting the cluster analysis, and can you please tell me after CATPCA how can I proceed for cluster analysis apparently the method. As I have four exogenous variable which contains 20 items.
Usually we would use factor analysis for this kind of data. However, if you want to do a cluster, then I would do the EFA first and generate factor scores for each construct. Then use these factor scores in a cluster analysis. 2-step or k-means each offer slightly different features and analyses, so you could try both.
tnx for the videos. Can you please tell me if a set of data can be clsutered only by one variable? and if yes is the two-step cluster more probable or the k-mean clustering? I want to categorize a set of data based on one variable in to three groups and i don't know how to define the cut-off or range for each categorie. I would be glad if you can help me
Nassim Fard If it is just one variable, then clustering algorithms won't help. If the variable is categorical, then just group them based on the category values. For example, if the variable is religion, then group them by which religion they affiliate with. If the variable is continous or ordinal, then make logical cutoff points into low, med, high.
Thank you for the quick tutorial. I am performing two step clustering on a data from a recent study but wants to somehow fit this new data in the clusters generated from past data. Kind of like supervised learning, but neither the coefficients of the model of past data is not available nor the data, unfortunately. Is there a way to solve this or is this case hopeless? p.s. To get the project done in time, without access to any tools, I tried to put the new records in clusters, manually, respecting the features/characteristics of the previously generated clusters. Since the time is my major constraint and the data is just 40 new entries, I have already performed it (could you give me some idea about my options to justify the job done this way). But I am just curious to know the right way.
If the new data is using the exact same variables as the original data, then you can simply add the new rows to the dataset and re-run the cluster analysis. That is the easiest way. If the new data is not using the same variables, then there is no statistical way to cluster them along the same lines.
I don't think I understand your question. Do you mean you want to do datamining on a dataset that has multiple IDs that are the same? If so, then no, you should combine those rows that have the same ID (if they are actually the same case and not a different one with just a duplicate ID), or create unique IDs for unique cases.
Kayode Kumapayi If you have issues you simply cannot resolve, I might be able to guide you a bit. I receive dozens of requests per day though, so please only email me if you are really stuck. Thanks!
You mention that when having SPSS determine clusters automatically, Euclidean distance measurement is more appropriate but when specifying the number of clusters, Log-likelihood is preferred. Could you perhaps elaborate on why this is the case? Would you know any papers that go into a bit of detail about this?
oooh, this has been a while. The literature I read at the time suggested these things, but I can't remember which articles and books I read, or what they had to say about it. Sorry about that. If cluster analysis was something I did more often, I would have a better answer for you. But I haven't done a cluster analysis again since making this video...
Hi James, Thank you for that Video. It was very helpful. Do you know what actually happens "inside" SPSS when you this "Two-Step-Cluster"? Which forms of clustering are used? Single Linkage and hierarchial cluster analysis?
Can I use cluster analysis in step wise classification like first classify asymptomatic and symptomatic , then in asymptomatic classify in terms of symptoms? ??
I think it should be possible. You could do the classification and save cluster membership number. Then, filter the dataset so that not all rows remain, but only remain those that are part of asymptomatic clusters. Then, cluster again to see if they cluster by symptom. Another route would be to just use evaluation variables in the two-step clustering. These variables aren't used to determine membership in clusters, but each cluster is evaluated post-hoc by symptoms.
Marcelo Gabriel I was not aware you could generate AIC and BIC in SPSS during a 2-step cluster analysis. I've gone back to it to fiddle with it, but I can't figure it out if it is possible.
James, thanks for your reply. At least on versions 20 and 22, you must check the "Clustering Criterion" by choosing BIC or AIC. I'm more inclined to consider AIC than BIC due to its characteristics. Your comment would be nice. Regards
Marcelo Gabriel Thanks for pointing me to that. I played with it and looked into it and it appears that the results are often the same (with my data), but that in general, AIC is preferred to BIC. Here is an informative explanation of why as well as some useful references: en.wikipedia.org/wiki/Akaike_information_criterion#Comparison_with_BIC
Thank you for this video I have done 4 different kmeans clustering and I need a method that choose the best clusteranalyses.Can I do it with twosteps or another method?
What if one of the item after applying post hoc shows a non significant p value e.g. you differentiate clusters on a variable, and then find that two of the clusters do not significantly differ on one item.
Hi! How can I choose variables that are significant to use on it? There´s a statistical test to help? I have a lot of variables and I wanna know how I should choose them, if it has a criteria.
Thank you for responding! I have several variables to draw a social and demographic profile of my population. Theoretically all these variables are important, but when I do the analysis with all of them, the results are not good. In other versions of SPSS there was a cut in those variables, a critical value, but I do not know how to identify this in SPSS 22. Can you help me, please?
Jéssica Rodrigues you can look at the cluster quality or at the variable importance graph. These will give you indications of the overall value of the variables for clustering into groups.
Is here a video that provides more detail on interpreting the clusters themselves? It would be helpful to understand how the clusters are being selected and how the clusters are developed.
The only other two-step cluster analysis video that I have is part of the Rosen College SEM Boot Camp: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-2Lz2bU-sBGA.html
Sir, Please upload detail lectures on Optimal scaling in SPSS (i.e. MCA, CATPCA and non-linear canonical correlation). These lectures are not available on RU-vid. I searched in your channel , with the hope ... , but unfortunately ....
I have never done those, so I cannot make videos on them. Any time I learn a new analysis, I make a video for it. If I ever have occasion to do these, I'll make videos for them. Best of luck to you.
Can these profiles really be used as a moderator in SEM analysis? Because I thought SEM only uses continuous variables since it analyzes relationship between multiple variables through regression analysis. For a while, I thought you were referring to Hierarchical Regression Analysis. Thank you!
Thank you for your video. I have a problem with my statistics. I run two factor analysis for two questions in my questionnaire. After that I run a cluster analysis with factor scores of the second factor analysis which i have already done. The first time i got 2 clusters, but I saw that in the column of cluster membership which automatically created by the system there was ( -1) as a cluter membership, i didn't understand why? I run other time the cluster analysis but this time i deleted the scores for the first factor analysis i did, i kept just factor scores for the second analysis i needed to run the cluster analysis, this time i got 3 clusters? My question is there any relation between the two factor analysis i did before? In my cluster i just use the score for one analysis? Why i got different results if the scores were just as variables and no interaction between them?
James Gaskin Thank you for your answer. For other questions i found why i had different results. I want to know if it's posible to explain more about clusters generated by using factor scores and not variables of our variable list. Thank you
Hello, if i have the ratio of sizes 3,05 can i keep the 3 cluster i got, or the size of clusters is not wel adjusted because this ratio is greater than 3. Thank you
Louize Kahina That ratio is fine. Also, as for using factor scores in cluster analysis, this is fine because the factor scores are just weighted averages based on the factor loadings. So, this is totally fine and requires no special interpretation.
James Gaskin Thank you. I asked for the interpretation of the clusters regarding to the two score factors i used. I didn't understand what means exaclty the medians for each factor? are these clusters' centers ?
Hi James, I'm doing a two-step cluster analysis and my ratio of size was nearly to 18.0. Is it something in literature talking about? Thank you so much!
James Gaskin I have 20 observations and there is one cluster with 19 and another with just 1. Okay, it is China in this one, and my theme is international competitiveness. I think it’s fine, have some test that i can do to make sure that the clustering it’s great? Thank you again!
If you only have 20 responses, and all but 1 are part of a single cluster, then this is not a good cluster solution. You might try removing the nationality variable to see if that fixes the clustering.
13? That is very old. SPSS is now on version 24. My version 24 runs the two step just fine. I don't have an installer though, as I'm not a licensed distributor.
Awesome! So clear and informational :) James, what would be the major differences between cluster analysis and factor analysis? Is it the profiling aspect? Can CA do things that FA cannot? Thanks again!
If you mean to assign them to an existing cluster, yes. You can do this with Multiple Discriminant Analysis. Here is a video on it: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-70vGOdEvYaM.html
I think I learned it from Hair et al 2010 Multivariate Data Analysis, but I'm not sure. It is not my primary methodology, so I'm not too familiar with the literature around it.
James, Very informative. You mention the need for over 30 in the smallest cluster and between 2-3 for the largest: smallest ratio. I am dong a Phd and wondered where these numbers came from. Do you have an academic reference(s) I could cite. Also, at the end of the video when you ran an ANOVA from the newly formed variables in SPSS. I ran different analysis, and never had more than 4 clusters but there were 5 new variables, all with uniformative names. How do I know which ones to use?
Hi, sir. I hope you are doing well and have a wonderful holiday and merry christmas sir! I want to ask a question related to steps of doing two step cluster : Do we have to use CF Tree first for PRECLUSTERING phase before doing the final clustering using BIC/AIC? I really hope you can answer me this time sir :') thank you so much
@@Gaskination thank you! can i know what factors that effect some variable can contribute to create the cluster? im talking about the purple table in predictor importance
@@larasgilangrahmany289 It is determined by the shared variance among all variables, and can be influenced strongly by discrete values, such as binary (e.g., single/married) or multinomial (e.g., age group: child, young adult, adult, middle-aged, post-middle-aged).
and it is also determined by how various the responses choose the options right? I observe, if all of the cluster choose almost the same option (ex: woman), its less than 0.5, but if each cluster choose different options (ex : woman and man), the value will be more than 0.5. Is it right?
Thanks for the great video - very useful! I was just wondering if you could explain (in a nutshell) the difference between this Two-Step cluster analysis and k-means? Thanks
The main difference is that two-step allows you to distinguish between categorical and continuous variables, and it processes them differently. Whereas k-means just treats them all the same. So, if you have categorical variables, two-step would be a more accurate clustering.
Thanks for your reply. So with continuous data like domestic energy use, would k means be more appropriate? And is it right to say that k means treats each variable as independent to the next, which in the case of domestic energy use is not quite the case? Many thanks again!
Nicholas Samson Unfortunately, I'm not an expert in cluster analyses. So your question surpasses my immediate knowledge. I would just have to look it up. I know that there are some good documents and articles that discuss the differences between two-step and k-means. I just googled it. Best of luck to you.
Very informative video and extremely helpful as usual. I have only one concern is that when I did it the first time it gave me 3 groups, I ran it again it gave me 2 groups,…I did it many times and I noticed that the results are not stable! How come that the same steps and same algorithm gave different results! Did anyone face this issue with the two steps cluster analysis? Thanks.
By the way, I was performing cluster analysis based on your video. However, I have few questions to ask you 1. Is it possible to assign weightage to individual record while performing segmentation? 2. If there is already weightage available for individual record (based on other criterion) how to make use of that in the segmentation process?