Statistical Optimization of Training Data for Semi-Supervised Text Document Clustering
Statistical Optimization of Training Data for Semi-Supervised Text Document Clustering
dc.contributor.advisor | Phillips, Joshua | |
dc.contributor.author | Newbold, Cody Renae | |
dc.contributor.committeemember | Pettey, Chrisila | |
dc.contributor.committeemember | Li, Cen | |
dc.contributor.department | Computer Science | en_US |
dc.date.accessioned | 2017-10-04T20:13:31Z | |
dc.date.available | 2017-10-04T20:13:31Z | |
dc.date.issued | 2017-06-22 | |
dc.description.abstract | Unsupervised machine learning algorithms suffer from uncertainty that results are accurate or useful. In particular, text document clustering algorithms such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) give no guarantee that documents are clustered in a manner similar to human readers. Using a semi-supervised approach on text document clustering, we show that the selection of training data can be statistically optimized using LDA and LSA. Using this method, a human reader categorizes a percentage of the data as an analysis step, then feeds the partially-labeled data into bootstrap training and testing steps. Using mutual information to discover which documents were better for training, the algorithm does a post-processing step using the optimized training set. The results show that mutual information values are higher when the statistically optimized training set is used and indicate that human-like performance is better achieved with optimized training data. | |
dc.description.degree | M.S. | |
dc.identifier.uri | http://jewlscholar.mtsu.edu/xmlui/handle/mtsu/5393 | |
dc.publisher | Middle Tennessee State University | |
dc.subject | Afghan war diary | |
dc.subject | Latent dirichlet allocation | |
dc.subject | Latent semantic analysis | |
dc.subject | Topic modeling | |
dc.subject.umi | Computer science | |
dc.thesis.degreegrantor | Middle Tennessee State University | |
dc.thesis.degreelevel | Masters | |
dc.title | Statistical Optimization of Training Data for Semi-Supervised Text Document Clustering | |
dc.type | Thesis |
Files
Original bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- Newbold_mtsu_0170N_10848.pdf
- Size:
- 958.54 KB
- Format:
- Adobe Portable Document Format
- Description: