Classifying stars based on raw spectra is virtually impossible. Spectra contain intensity values for thousands of data points, making them difficult to analyze. To combat this problem, we applying dimensionality reduction algorithms to spectra to extract a small number of values that can accurately describe the entire spectra. Stars can then be classified based on these values. The algorithms used to reduce the dimensionality of the data can have a big effect on the classification results, so we compared the classification accuracy of 3 different dimensionality reduction methods. |
What is SDSS?The Sloan Digital Sky Survey (or SDSS) is a large-scale sky survey that provides imaging and spectral data for more than three million astronomical objects. The survey covers approximately 1/3 of the sky, which is greater coverage than any previous survey. All data produced by SDSS is publicly available for download. For more information on SDSS, click here. Our DataWe used stellar spectra selected from SDSS according to the following specifications (following those used in Bu et al.):
|
We then shifted each spectrum to rest-frame wavelengths (to account for redshift) and normalized the total flux. We randomly selected 4 samples of 1500 spectra from the original data, each sample containing 500 K1, 300 K3, 400 K5, and 300 K7 stars. Half of each sample was used as a training set for the classifier, and half was used as a testing set. |
Dimensionality reduction algorithms embed high-dimensional data into lower dimension surfaces, preserving the inherent structure of the data. We examined three such algorithms in our experiment: Principal Component Analysis (PCA), Isometric Mapping (Isomap), and Locally Linear Embedding (LLE). We implemented PCA, LLE and Isomap using the scikit-learn module in Python. For more information on this module, click here. |
Principal Component AnalysisPCA is a linear dimensionality reduction algorithm which projects data onto principal components, or eigenvectors. The principal components that preserve a specified amount of the original variance are selected to make up the lower dimensional space. |
Locally Linear EmbeddingLLE is a nonlinear embedding algorithm that assumes the original data lies on a locally linear manifold. The algorithm calculates embedding vectors corresponding to this manifold, and then projects the original data onto these embedding vectors. |
Isometric MappingIsomap is also a nonlinear embedding algorithm. It uses a partially connected graph to calculate local distances between points in a way that preserves the structure locally and globally. |
We then projected each of the 4 samples was into 1-10 dimensional space using the PCA, LLE and Isomap algorithms. We examined 2D projections of all of the original data for each algorithm in order to visually examine how well the algorithms separated spectra by subclass. We found that each algorithm noticeably separated subclasses, but there is still significant overlap between each of the subclasses. |
After we reduced the dimensionality of our spectral data, we used a Support Vector Machine (SVM) to classify the spectra. We trained our classifiers on the training data sets and corresponding subclasses, and then predicted the subclasses of spectra in the testing sets. |
We then determined the accuracy of the classifiers by comparing the assigned subclasses of the testing set to the subclasses given by SDSS. Our results differed significantly to those found in Bu et al. We expected the accuracy of our classifications to peak at around 3 dimensions for Isomap and 4 or 5 dimensions for PCA, and then decrease as the number of output dimensions increases. Additionally, they found Isomap to be more accurate than PCA for all dimensions. |
Our results do not reflect this, which could be for many reasons. We found PCA to be more accurate than Isomap and LLE (which had similar accuracies), and did not decrease as number of output dimensions increased. Even if spectra are overall nonlinear, the distinction between K subclasses could be due to purely linear characteristics, which might explain the high accuracy of PCA. However, this still would not explain why our results are so different from those found by Bu et al., so more investigation is needed. |
Our classification results differed from the results obtained by Bu et al. fairly significantly, so we would like to do more work to investigate the accuracy of our results. There are many possible extensions to be done, which could all give more insight into the results we found. We could perform this same experiment on different types of stars, like M stars for example, or on larger data sets consisting of multiple types of stars, to see how effective our algorithms are in these cases. It would also be interesting to vary the size of our training sets and determine if using smaller testing sets and larger training sets reduces the accuracy of our algorithms. There are many classification algorithms other than SVM, so experimenting with different classification algorithms would be interesting as well. The LLE and Isomap modules that we used also have training methods that can be used to compute internal parameters before embedding the entire data set. Using these training methods would increase the computational speed, so it would be interesting to see how this might affect our results. |
|
I am a math and physics major entering my senior year at Gettysburg College, a liberal arts college in Pennyslvania. I hope to attend graduate school next year in applied mathematics, and I am interested many of the interdisciplinary aspects of math and physics. This summer I participated in an NSF funded REU program through the Center for Interdisciplinary Exploration and Research in Astrophysics (CIERA) at Northwestern University. I am working on an interdisciplinary astronomy project under the supervision of Dr. Ying Wu, a professor in the Electrical Engineering and Computer Science (EECS) department. Contact: thorte01 {at} gettysburg.edu |
This material is based upon work supported by the National Science Foundation under Grant No. AST-1359462, a Research Experiences for Undergraduates (REU) grant awarded to CIERA at Northwestern University. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.