TITLE:
Dirichlet Compound Multinomials Statistical Models
AUTHORS:
Paola Cerchiello, Paolo Giudici
KEYWORDS:
Textual Data Analysis; Mixture Models; Ontology Schema; Reputational Risk
JOURNAL NAME:
Applied Mathematics,
Vol.3 No.12A,
December
31,
2012
ABSTRACT:
This
contribution deals with a generative approach for the analysis of textual data.
Instead of creating heuristic rules forthe representation
of documents and word counts, we employ a distribution able to model words
along texts considering different topics. In this regard, following Minka
proposal (2003), we implement a Dirichlet Compound Multinomial (DCM) distribution, then we propose an
extension called sbDCM that takes
explicitly into account the different latent topics that compound the document.
We follow two alternative approaches: on one hand the topics can be unknown,
thus to be estimated on the basis of the data, on the other hand topics are
determined in advance on the basis of a predefined ontological schema. The two
possible approaches are assessed on the basis of real data.