May 23, 2015

MARGENTO @ FLAIRS-28: Multilabel Subject-based Classification of Poetry

Multilabel Subject-based Classification of Poetry
Andres Lou, Diana Inkpen and Chris Tanasescu (MARGENTO)

[This paper is part of the larger ongoing MARGENTO project "The Graph Poem"]

Oftentimes, the question “what is this poem about?” has no
trivial answer, regardless of length, style, author, or context
in which the poem is found. We propose a simple system
of multi-label classification of poems based on their subjects
following the categories and subcategories as laid out by the
Poetry Foundation. We make use of a model that combines
the methodologies of tf-idf and Latent Dirichlet Allocation
for feature extraction, and a Support Vector Machine model
for the classification task. We determine how likely it is for
our models to correctly classify each poem they read into one
or more main categories and subcategories. Our contribution
is, thus, a new method to automatically classify poetry given
a set and various subsets of categories.

Classifying Poetry
In this work, we focus on how the vocabulary of a poem
determines its subject. While seemingly intuitive, this
notion is a much more difficult task to perform than what
it seems at first glance. As an example, let us consider the
following excerpt from "The Love Song of J. Alfred Prufrock,"
by T. S. Eliot:

Let us go then, you and I,
When the evening is spread out against the sky
Like a patient etherized upon a table;
Let us go, through certain half-deserted streets,
The muttering retreats
Of restless nights in one-night cheap hotels
And sawdust restaurants with oyster-shells:
Streets that follow like a tedious argument
Of insidious intent
To lead you to an overwhelming question ...
Oh, do not ask, “What is it?”
Let us go and make our visit.

As is the case with many modern and contemporary poems,
the subject of this celebrated high modernist piece is
problematic, elusive, and multilayered. The question of
what category this poem belongs to has a nebulous answer.
The title, while indicative, cannot be used to readily classify
it as a “Love” poem. Furthermore, the fact that it belongs
to a certain category such as “Love” does not imply that it
does not belong to a different category as well, such as “Living”,
nor does it imply whether it belongs to a subcategory
thereof, specifically, the subcategory of “Marriage & Companionship”
(indeed, as we will see, unequivocal single categorization
is rare). Furthermore, is the speaker’s insistent
urge to travel and discover (new?) places actually a facetious
one, as some of his diction strongly suggests, and then
what is the target of his irony? Are possibly capital existential
questions as the one in the penultimate line muffled by
the modern condition of pointless rambling, undiscriminating
consumerism, and chronic disorientation? And where is
the announced love in the “tedious argument” of the alienating
placeless cityscape? The task of determining whether
a poem belongs to any given number of categories and subcategories,
by means of analyzing its lexical content, is the
objective of our work.

Our methodology involves three distinct phases: 1) Determining
the number of categories and subcategories, and their
nature, in which to place each poem; 2) Determine a method
to extract relevant features from each document, and 3) Selecting
an appropriate classifying algorithm.

Feature Extraction
The content-based nature of the classification task makes it
ideal to use two models to extract features from our corpus:
Term Frequency-Inverse Document Frequency (tf-idf)
as applied to a Bag-of-Words model, and Latent Dirichlet
Allocation (LDA).

Check out the full Program of FLAIRS-28 here

Related Posts Plugin for WordPress, Blogger...