###Introduction During this report an effort was be made to cluster US universities according to several different characteristic measures, such as graduate income and annual tuition revenue per student. The data collection on which the analysis was based is known as the ‘College Scorecard dataset’. After the publication of the data by the US Department of Education, they were accessed via the Kaggle website for the purposes of the present analysis[1]. From the above set of data, only those from year 2009 were used, as they contain values for the fields regarding graduate earnings, which for many other years are not known. An exception to that is geographical coordinates from 2013 data, used to generate the location coordinates of the institutions, since no such data exists for year 2009. Thus, for the purpose of determining the location of the institutions, the respective data from year 2013 were used. The main goal of this report was to generate a grouping of all active universities in the US in 2009, based on the available data fields. As the dataset is multidimensional, containing 2019 rows and 1729 columns, a major problem arose, apart from the missing data: it would be extremely difficult to iterate through the variables manually and decide about which ones are significant to setting universities apart. Therefore, an unsupervised method of analysis was deployed instead, in order to treat the dataset as a single entity and cast any uninteresting fields aside via computational means, rather than doing so manually. The goal was to enable sourcing information from the dataset without the need to examine available fields one by one to form a particular hypothesis. Exploring and discussing different analyses strategies for achieving a sensible grouping was an additional objective during the latter stages of this study. More specifically, the structure of the present analysis initiated with the pre-processing of the data in order to create a data frame ready for further analysis. The main task then was to create labels for each institution, according to the K-Means clustering algorithm, which was considered a simple and straightforward option for clustering. For this, all numerical fields would have been taken into account, unless they were completely meaningless (e.g. postal code numbers). This action of clustering was expected to create a basic grouping among higher educational US institutions. In order to represent the grouping visually and assess the results of the selected algorithm, dimension reduction was additionally required. The reduction of all factors into a two-dimensional space was achieved by the implementation of Principal Component Analysis (PCA). After the first clustering was implemented, an additional means of applying the same clustering algorithm was explored. This method of clustering took place after the PCA had been implemented. In this case two separate K-Means models were created, one for each of the first 2 PCA components. Each of the models would take into account the most ‘significant’ factors, as defined by their respective factor loadings. The results were expected to be a pair of cluster labels for each institution. This manner of clustering would in the end produce more clusters than the initial clustering mentioned previously. Finally, the results were represented visually as well, and a comparison was made between the two techniques. Although there is no explicit way of evaluating the clustering models that were applied, there will be a discussion about which of the two strategies performed better. Conclusions on this comparison will be derived from common knowledge around higher education and will be enabled through exploratory analysis of the clustering results. The main question to be answered is if the second and more complicated post-PCA clustering produced at least sensible results, or even if it produced a more thorough clustering than the initial and simpler application of K-Means. This will be discussed at the end of the report, with the hope that at least some conclusions will be reached. Regarding the difficulties faced throughout this analysis, three are the main impediments. Initially, missing values within the dataset created problems on different levels: data were simply unavailable for certain institutions or were missing from one or more rows or columns. While in the first case, missing values could easily be replaced; in the second case such rows or columns had to be removed so as not to interfere with the analysis process. The most important problem was that missing values, even if replaced, would interfere with the algorithms’ results. In the case of clustering, false categorization may have resulted from such missing values. A second issue, as mentioned previously, is the magnitude of the data set. Such a large data collection allows two options for analysis, either a very specific question to be answered using factors known to be relevant to this question’s answer, or a very broad approach. In this case, the latter was used to gain a global picture of the data and enable exploring the information in an unbiased way. The reason for this is that instead of testing intuitive relationships between factors, the aim was in fact to capture any relationships that could not easily be hypothesized by the user, due to the lack of any direct logical links. Lastly, the absence of prior indication of groupings within the dataset did not allow for training and testing. Instead, it has already been discussed that the final evaluation of the two approaches was manual, given the lack of such prior data (clusters). Ideally, to properly evaluate the two strategies, groupings should have been pre-set and the K-Means models should be trained and tested on different slices of the dataset. Then the evaluation of the performance of the algorithm on each occasion would be much easier and more robust. However, as this option was not available at the time of the analysis, an effort was be made to intuitively evaluate the two strategies based on the understanding of higher education and the knowledge of any natural clusters already existing within the higher education landscape, such as Ivy League universities, medical schools etc. ###Data Preparation In order to prepare the data for the analysis, several steps were taken. Initially the data were filtered only for institutions of higher education active in 2009. This data was stored as a new data frame. This was done to improve performance by dropping data entries that would anyway not be considered during reporting. It was achieved through indexing pandas data frames. Moreover, it was noticed that in many instances institutions were unwilling to publicize certain pieces of information; these cells contained the string “PrivacySuppressed” instead of a numerical or missing value. Any cells containing this string were subsequently converted to missing values using the numpy module, so that the column types may be correctly identified when containing integers or floats and not strings. As previously mentioned, data from year 2009 was used for the main analysis. However, this year’ s entries were missing all values on geographical coordinates. Therefore these data fields were derived from year’ s 2013 data, which was merged with the 2009 data frame on Institution names. The columns for longitude and latitude were thus incorporated in the 2009 data, which the present report would be manipulating. One of the main problems for reporting was missing data, as clearly stated before. Therefore, columns missing all their values were completely omitted from the data frame. However, the resulting dataset was still sparse. In order to tackle this issue, the missing values in each column were all replaced with the average of the respective column. This manipulation should be considered when interpreting the conclusions that are drawn from this analysis, as the replacement of missing values in the dataset can seriously distort the relationships among data points. In spite of that, this action was considered necessary due to the large number of columns in the dataset. If rows containing missing values were simply removed, given the magnitude of the variables’ number, most rows would have to be deleted from the data frame. Therefore, the decision was made to introduce some bias to the analysis, rather than accept such a major loss of information. Additionally, any infinite values that may have existed in the data were removed, in order to not create a problem with models used later on. For the same reason, all column values apart from the institution name were converted to floats. Those for which this was not possible were excluded from the data frame. Finally, some of the columns that would obviously not be of value to the study, such as IDs of different types, were manually excluded from the data frame by indexing. These had managed to remain in the data frame until this last step of data cleansing due to their entry values being encoded as numbers. ###Results&Discussion Universities in the US were successfully clustered into 4 groups, according to the principal components that explain more than 80% of the variation that characterises these universities. This was a compact, and yet spherical view of the data, and its implementation was effective. Moreover, the problem of visualising the data to perform exploratory analysis was overcome to a great extent once the data dimensionality was reduced by PCA. While the second strategy that was implemented was not fruitful in capturing additional value in terms of the clusters produced, this may have been due to the particular dataset, as most of the variability was contained in the first component of the PCA, whereas the second, as well as all others, was negligible. In any case, the study was able to demonstrate that the implementation of the second strategy is feasible and produces K*K clusters aligned on a 2-dimensional grid, even if in this case it did not prove beneficial. The segmentation of the institutions is naturally helpful to candidate students in search of the most promising option for their higher education. However, apart from that, institutions themselves can use this information to understand their currently closest competitors. The clustering can enable institutions to position themselves in the market and target the right groups of candidates, according to what they are in place to offer and what the candidates are seeking for. The components, especially the first one in this case, can be used as an indicator of the features that are important to the clustering and to setting institutions apart. This way, by identifying the features where clusters differ to a statistically significant degree, institutions can select which features to invest in and improve. As a conclusion, institutions can rely on the derived clustering to make strategic decisions about the future, by imitating more successful competitors. In the future, it would be interesting to evaluate the second strategy on other data sets. This will enable a more comprehensive conclusion about the strategy’ s results and will be necessary to establish whether or not the strategy is beneficial as a methodology for multivariate analysis in general. This particular objective was not met by the current study, due to the complexity of the data, as well as the many unknown parameters, such as the lack of understanding of the inherent clusters formed and the reason they were derived as such. Thus, further validation of the method is deemed necessary.
-
Notifications
You must be signed in to change notification settings - Fork 1
ilektram/EducationDataUS
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published