Dirichlet Compound Multinomials Statistical Models

Abstract

This contribution deals with a generative approach for the analysis of textual data. Instead of creating heuristic rules forthe representation of documents and word counts, we employ a distribution able to model words along texts considering different topics. In this regard, following Minka proposal (2003), we implement a Dirichlet Compound Multinomial (DCM) distribution, then we propose an extension called sbDCM that takes explicitly into account the different latent topics that compound the document. We follow two alternative approaches: on one hand the topics can be unknown, thus to be estimated on the basis of the data, on the other hand topics are determined in advance on the basis of a predefined ontological schema. The two possible approaches are assessed on the basis of real data.

Share and Cite:

P. Cerchiello and P. Giudici, "Dirichlet Compound Multinomials Statistical Models," Applied Mathematics, Vol. 3 No. 12A, 2012, pp. 2089-2097. doi: 10.4236/am.2012.312A288.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] S. Deerwester, S. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, Vol. 41, No. 6, 1990, pp. 391-407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
[2] T. Hofmann, “Probabilistic Latent Semantic Indexing,” Proceedings of Special Interest Group on Information Retrieval, New York, 1999, pp. 50-57.
[3] D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, Vol. 3, 2003, pp. 993-1022.
[4] M. Girolami and A. Kaban, “On an Equivalence between PLSI and LDA,” Proceedings of Special Interest Group on Information Retrieval, New York, 2003, pp. 433-434.
[5] D. M. Blei and J. D. Lafferty, “Correlated Topic Models,” Advances in Neural Information Processing Systems, Vol. 18, 2006, pp. 1-47.
[6] D. Putthividhya, H. T. Attias and S. S. Nagarajan, “Independent Factor Topic Models,” Proceeding of International Conference on Machine Learning, New York, 2009, pp. 833-840.
[7] J. E. Mosimann, “On the Compound Multinomial Distribution, the Multivariate B-Distribution, and Correlations among Proportions,” Biometrika, Vol. 49, No. 1-2, 1962, pp. 65-82.
[8] K. Sjolander, K. Karplus, M. Brown, R. Hughey, A. Krogh, I. S. Mian and D. Haussler, “Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology,” Computer Applications in the Biosciences, Vol. 12, No. 4, 1996, pp. 327-345.
[9] D. J. C. Mackay and L. Peto, “A Hierarchical Dirichlet Language Model,” Natural Language Engineering, Vol. 1, No. 3, 1994, pp. 1-19.
[10] T. Minka, “Estimating a Dirichlet distribution,” Unpublished Paper, 2003. http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/
[11] R. E. Madsen, D. Kauchak and C. Elkan, “Modeling Word Burstiness Using the Dirichlet Distribution,” Proceeding of the 22nd International Conference on Machine Learning, New York, 2005, pp. 545-552.
[12] G. Doyle and C. Elkan, “Accounting for Burstiness in Topic Models,” Proceeding of International Conference on Machine Learning, New York, 2009, pp. 281-288.
[13] J. D. M. Rennie, L. Shih, J. Teevan and D. R. Karge, “Tackling the Poor Assumptions of Naive Bayes Text Classifier,” Proceeding of the 20th International Conference on Machine Learning, Washington DC, 2003, 6 p.
[14] A. P. Dempster, M. N. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, Vol. 39, No. 1, 1977, pp. 1-38.
[15] D. B?hning, “The EM Algorithm with Gradient Function Update for Discrete Mixture with Know (Fixed) Number of Components,” Statistics and Computing, Vol. 13, No. 3, 2003, pp. 257-265. doi:10.1023/A:1024222817645
[16] S. Staab and R. Studer, “Handbook on Ontologies, International Handbooks on Information Systems,” 2nd Edition, Springer, Berlin, 2009.
[17] P. Cerchiello, “Statistical Models to Measure Corporate Reputation,” Journal of Applied Quantitative Methods, Vol. 6, No. 4, 2011, pp. 58-71.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.