Dirichlet Compound Multinomials Statistical Models

Paola Cerchiello; Paolo Giudici

doi:10.4236/am.2012.312A288

Applied Mathematics > Vol.3 No.12A, December 2012

Dirichlet Compound Multinomials Statistical Models

Paola Cerchiello, Paolo Giudici
Department of Economics and Management, University of Pavia, Pavia, Italy.
DOI: 10.4236/am.2012.312A288 PDF HTML 6,140 Downloads 8,659 Views Citations

Abstract

This contribution deals with a generative approach for the analysis of textual data. Instead of creating heuristic rules forthe representation of documents and word counts, we employ a distribution able to model words along texts considering different topics. In this regard, following Minka proposal (2003), we implement a Dirichlet Compound Multinomial (DCM) distribution, then we propose an extension called sbDCM that takes explicitly into account the different latent topics that compound the document. We follow two alternative approaches: on one hand the topics can be unknown, thus to be estimated on the basis of the data, on the other hand topics are determined in advance on the basis of a predefined ontological schema. The two possible approaches are assessed on the basis of real data.

Keywords

Textual Data Analysis; Mixture Models; Ontology Schema; Reputational Risk

Share and Cite:

P. Cerchiello and P. Giudici, "Dirichlet Compound Multinomials Statistical Models," Applied Mathematics, Vol. 3 No. 12A, 2012, pp. 2089-2097. doi: 10.4236/am.2012.312A288.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	S. Deerwester, S. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, Vol. 41, No. 6, 1990, pp. 391-407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
[2]	T. Hofmann, “Probabilistic Latent Semantic Indexing,” Proceedings of Special Interest Group on Information Retrieval, New York, 1999, pp. 50-57.
[3]	D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, Vol. 3, 2003, pp. 993-1022.
[4]	M. Girolami and A. Kaban, “On an Equivalence between PLSI and LDA,” Proceedings of Special Interest Group on Information Retrieval, New York, 2003, pp. 433-434.
[5]	D. M. Blei and J. D. Lafferty, “Correlated Topic Models,” Advances in Neural Information Processing Systems, Vol. 18, 2006, pp. 1-47.
[6]	D. Putthividhya, H. T. Attias and S. S. Nagarajan, “Independent Factor Topic Models,” Proceeding of International Conference on Machine Learning, New York, 2009, pp. 833-840.
[7]	J. E. Mosimann, “On the Compound Multinomial Distribution, the Multivariate B-Distribution, and Correlations among Proportions,” Biometrika, Vol. 49, No. 1-2, 1962, pp. 65-82.
[8]	K. Sjolander, K. Karplus, M. Brown, R. Hughey, A. Krogh, I. S. Mian and D. Haussler, “Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology,” Computer Applications in the Biosciences, Vol. 12, No. 4, 1996, pp. 327-345.
[9]	D. J. C. Mackay and L. Peto, “A Hierarchical Dirichlet Language Model,” Natural Language Engineering, Vol. 1, No. 3, 1994, pp. 1-19.
[10]	T. Minka, “Estimating a Dirichlet distribution,” Unpublished Paper, 2003. http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/
[11]	R. E. Madsen, D. Kauchak and C. Elkan, “Modeling Word Burstiness Using the Dirichlet Distribution,” Proceeding of the 22nd International Conference on Machine Learning, New York, 2005, pp. 545-552.
[12]	G. Doyle and C. Elkan, “Accounting for Burstiness in Topic Models,” Proceeding of International Conference on Machine Learning, New York, 2009, pp. 281-288.
[13]	J. D. M. Rennie, L. Shih, J. Teevan and D. R. Karge, “Tackling the Poor Assumptions of Naive Bayes Text Classifier,” Proceeding of the 20th International Conference on Machine Learning, Washington DC, 2003, 6 p.
[14]	A. P. Dempster, M. N. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, Vol. 39, No. 1, 1977, pp. 1-38.
[15]	D. B?hning, “The EM Algorithm with Gradient Function Update for Discrete Mixture with Know (Fixed) Number of Components,” Statistics and Computing, Vol. 13, No. 3, 2003, pp. 257-265. doi:10.1023/A:1024222817645
[16]	S. Staab and R. Studer, “Handbook on Ontologies, International Handbooks on Information Systems,” 2nd Edition, Springer, Berlin, 2009.
[17]	P. Cerchiello, “Statistical Models to Measure Corporate Reputation,” Journal of Applied Quantitative Methods, Vol. 6, No. 4, 2011, pp. 58-71.

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies