Anison classification from unsupervised lexical clustering

Ohisashiburi desu, etc. There are probably many more important things to talk about first, but I've been having fun playing with some tools this week-end, so here goes.

If you've ever paid attention to anime song lyrics, you've probably noticed that the same words tend to come up over and over again. And more precisely, that the same words tend to be used in songs with the same mood or belonging to the same genre. So I figured we should be able to establish a classification of anime song by simply looking at there lyrics, that might teach us something about them or even about the shows they are used in.

And the results are indeed relatively interesting.

But before I describe them, let me first present the way they were obtained. You can think of this post as advertising the application of natural language processing (NLP) techniques to the analysis of anime-related texts. I think they have the potential of being quite fruitful.

Methodology

“Classifying anime songs by looking at the lyrics” can be viewed as a common problem in NLP called text clustering: you have a large number of texts and you want to divide them into several groups according to some criterion like subject matter. In some cases, you know in advance which subject matters will be covered, and you can then try to search the texts for relevant keywords. But in this case, I wanted to approach the problem with no a priori idea about what the classification would look like, and this can be done with something called unsupervised clustering. Basically, you define some notion of proximity between different texts, and you try to automatically divide them into smaller collections, such that the texts in a collection are close to one another and far from the other collections.

This sounds complicated, but fortunately, there are readily available tools that will do most of the job for you. Note that I'm a complete NLP layman myself, but these tools are reasonably user-friendly. Here's what the process looks like.

1. Corpus collection

Gather the texts you are going to analyze. In this case, I used the large database of anison lyrics from this site, whose webmasters calls themselves Kokoro yasashiki senshitachi. So I guess they won't mind despite the JASRAC warnings, uhm.

This provided me with over 12,500 anime songs to play with! This database is not completely exhaustive even for recent anime (for example, anison which come from games, like Tori no uta and Last regrets, seem not to be included), but it's still pretty solid and reasonably easy to process.

2. Word segmentation

Break up each of your texts (in this case about 12,500 anime songs) into separate words, and count the number of times these words appear in each text.

For texts in a language like English, this is a relatively easy task, because word breaks are usually marked by spaces. All you have to do is remove punctuation, possibly use some sort of “stemming” to ensure that singular and plural or conjugated verbs are considered the same word, and you're done. But Japanese doesn't use spaces, which makes the problem quite a bit more complicated.

Fortunately, it's also a very extensively studied problem for which a lot of software exists. Examples include mecab and JUMAN (if you're a Linux user, they are probably already packaged by your distribution; otherwise, just visit the webpage and click on the download link). These are in fact morphological analyzers of Japanese, meaning that they will not only break up the texts in different words, but also tell you that this word is a noun and that word is an adjective and 見られませんでした is a conjugated form of the verb 見る. A nice additional feature is canonicalization: as you may know, a given Japanese word can be written in many different ways (e.g. 寂しい, 淋しい, さみしい, さびしい are all different ways of writing the word samishii). What canonicalization provides is one standardized orthography for each word, so that we can treat all the different variants as really the same word. We may be missing a slight nuance that way, but in this case I'm always treating two words as either exactly the same or completely different, so it definitely makes sense to consider them the same (even in borderline cases like 思う vs. 想う).

So run mecab on your 12,500 songs, keep only canonicalized words (which conveniently eliminates particles, copulas and other very common elements which you don't want to clutter up your files), count occurrences, and you have all the data you need to run the clustering tool.

At this point, you can also answer simpler but still interesting questions, like “What are the most common words appearing in anison?” The top 20 is as follows (the number in parentheses is the total number of times it appears throughout the whole corpus, multiple occurrences in one song included):

  1. kimi (10527)
  2. watashi (9106)
  3. anata (8934)
  4. yume (8806)
  5. kokoro (7412)
  6. sora (5904)
  7. te (5707)
  8. ashita/asu (5258)
  9. omou (5186)
  10. iku (5087)
  11. ai (4869)
  12. kaze (4818)
  13. mune (4763)
  14. boku (4668)
  15. ii (4584)
  16. kitto (4472)
  17. mou (4158)
  18. miru (4137)
  19. hi (3964)
  20. mirai (3810)

I bet you're not really surprised! Note that you can also learn things about specific subcorpora of interest; for example, the “most common words” question can be applied to something like songs by Mizuki Ichirou (the top 5 is ore, tatakau, kimi, kokoro and kurikaesu) or to songs from the two seasons of K-ON! (now it's watashi, ai, ii, yume, minna).

3. Unsupervised clustering

Now that we have compiled a list of words and know how many times they appear in each song, all we have to do is feed all this data into a clustering tool to get back our classification automagically. Several such clustering tools can be download freely from their authors' webpages, including F. Dellaert's Matlab Clustering, G. Karipys' CLUTO and A. McCallum's Bow toolkit. Interested readers might want to also look at that nice post (in English) on a random Chinese blog, full of NLP resources.

In this case I only tried CLUTO, and more precisely the enclosed program vcluster, which I found quite convenient and easy to use. It simply takes two arguments: one file describing your data (a big array of numbers, with each line corresponding to a document, listing the number of occurrences of each word in it) and one parameter specifying the number of clusters you want it to divide your data into. Plus many optional parameters to tweak everything to your needs, but not being a specialist, I mostly sticked with the defaults, which are well-suited for document analysis (the technical terms are cosine distance, idf scaling and repeated bisections). To ask for, say, 4 clusters, you invoke the program with a line like:

vcluster  -clabelfile=anison.clabel -showfeatures anison.mat 4

where that .clabel file is the list of actual words in your documents, used for pretty printing; the program itself only works with the .mat file (the big array of numbers). It takes a couple of seconds to complete its computations and replies with:

Matrix Information
-----------------------------------------------------------
  Name: anison.mat, #Rows: 12533, #Columns: 16659, #NonZeros: 698007

Solution
------------------------------------------------------------------------
4-way clustering: [I2=2.28e+03] [12531 of 12533]
------------------------------------------------------------------------
cid  Size  ISim  ISdev   ESim  ESdev  | 
------------------------------------------------------------------------
  0  2151 +0.059 +0.020 +0.025 +0.007 | 
  1  2525 +0.046 +0.016 +0.024 +0.007 | 
  2  4391 +0.030 +0.010 +0.021 +0.008 | 
  3  3464 +0.018 +0.007 +0.016 +0.008 | 
------------------------------------------------------------------------
4-way clustering solution - Descriptive & Discriminating Features...
--------------------------------------------------------------------------------
Cluster   0, Size:  2151, ISim: 0.059, ESim: 0.025
      Descriptive:  君 36.3%, 僕 11.3%, 僕等  0.8%, 思う  0.7%, ずっと  0.6% 
   Discriminating:  君 51.0%, 僕 15.0%, あなた  7.1%, 私  2.8%, 御  1.7%

Cluster   1, Size:  2525, ISim: 0.046, ESim: 0.024
      Descriptive:  あなた 27.1%, 私  8.0%, 恋  1.8%, ずっと  1.2%, 愛  1.1% 
   Discriminating:  あなた 43.3%, 君  9.9%, 私  7.8%, 僕  3.4%, 俺  2.0%

Cluster   2, Size:  4391, ISim: 0.030, ESim: 0.021
      Descriptive:  俺  3.3%, 空  1.3%, 夢  1.2%, ゆく  1.2%, 愛  1.2% 
   Discriminating:  君 10.1%, あなた  8.6%, 俺  5.9%, 私  3.8%, 御  3.5%

Cluster   3, Size:  3464, ISim: 0.018, ESim: 0.016
      Descriptive:  御 10.1%, 皆  3.2%, 私  1.7%, 名  1.7%, いい  1.4% 
   Discriminating:  御 13.3%, 君  5.6%, あなた  3.8%, 皆  3.0%, 僕  1.2%
--------------------------------------------------------------------------------

This means I have 12,533 songs with 16,659 unique words occurring a total of 698,007 times. The program can divide all of the songs into 4 clusters except for two of them that won't really fit. The first cluster is characterized by words like kimi, boku, bokura, omou, zutto; the second by anata, watashi, koi, zutto, ai; the third by ore, sora, yume, yuku, ai; and the fourth by o, minna, watashi, na, ii. The ISim and ESim numbers are measures of the quality of your clustering: they indicate of how close the texts within a cluster are from each other, and how far they are from other clusters, respectively.

In addition, of course, there is a file written out that indicates for each song which cluster it belongs to, so analyzing the result amounts to browsing that data.

Limitations and caveats

There are several limitations to the approach I have taken here, that could probably be overcome by more careful work. As we usually say in academic papers, this is left for further research. Here's a short list of problems:

  • You have choose the number of clusters that CLUTO produces yourself. So it's a bit arbitrary. You can always divide your clusters into even smaller sets (that's what repeated bisection does: take one on the existing clusters and divide it further). An ideal clustering would have good quality scores and meaningful, easy to describe results. That's what I tried to achieve but I can't say with confidence that I have succeeded.

  • CLUTO has certain data visualization features. It can produce graphical representation of how often the important keywords it singles out appear in all the texts, or in average in the clusters, etc. Unfortunately, it doesn't work with Japanese characters at all (and our corpus here is too much large many of the visualization tools to produce manageable output anyway). There's a companion program called gCLUTO that is supposed to have a GUI and do 3D graphs and other nifty stuff, and might solve some of those issues, but it's a pretty old piece of software that doesn't run on my computer (requires a 32-bit userland), so I didn't have a chance to try it out yet.

  • Songs are small texts, with not so many words in them, so there's an intrinsic difficulty involved in comparing them word for word. We may be missing similarities between songs which express the same concept in not quite the same words. One could argue that it's a feature, not a bug, since we wanted to draw comparisons based on just words to begin with; but applied to small texts it's not that robust (you could have three love songs 1, 2, 3 with 1 and 2 using mostly the same words, 2 and 3 also, but not 1 and 3, for example; by default CLUTO helps such problems to some extent, but if you want it to handle them on a large scale, it becomes considerably slower).

    One alternate approach that is sometimes used is to consider that certain words have similar meanings and adjust the distance between texts by taking that into account. It can be done more or less automatically as well, but it's very time-consuming for large datasets (probably impossible for a corpus of 12,000 texts, though I admittedly haven't tried).

  • You may be wondering about the 名, or even the 御, in the description of cluster 3 above, as one wouldn't expect those characters to be particularly common or descriptive of a style of anison. For the most part, they come from erroneous parsing in the segmentation stage.

    The main problem is that it's relatively common for anison lyrics to use a somewhat fancy writing style which can confuse automatic analyzers. For example, if the word 最強 appears as さいきょう in hiragana, it is recognized correctly, but さいきょー will probably give completely spurious results (mecab analyzes that as 差, 依拠, ー, and juman as 差異, 挙, ー, both of which are total nonsense). There's not much you can do about it short of delving into the code of those programs (no thanks), but one can reasonably hope that those misreadings are rare enough not to matter too much. As a simple trick to keep the most of common of such misreadings to influence results, I simply choose to use a stoplist, i.e. remove them altogether after segmentation.

    Another approach I did try was to restrict the parts-of-speech I was taking into account, to only verbs and adjectives, say; this effectively eliminates small spurious words automatically, but doesn't give very convincing results because it makes the “small texts” problem much worse.

So yeah, take what follows with a grain of salt. I still think it's interesting.

Results

In the end, I arrived at a reasonably robust classification of anison into five categories. For the purpose of easy reference, I suggest the following, hopefully descriptive names:

  1. Boy dreams girl
  2. Romance
  3. Nekketsu
  4. Light and shadows
  5. Happy-go-lucky

I'll try to describe each of them in some details, but before that, let's draw some charts and have a look at how songs overall and interesting keywords are distributed among these categories.

Five categories of anison
Distribution of anime songs among the five categories.
Pronouns
First- and second-person pronouns in the five categories (colors as above).
Chrononyms
Common time references in the five categories.
Feelings
Feeling words in the five categories.
Verbs
Different verbs in the five categories.

OK, so let's proceed with the categories themselves. For each category, I'll discuss the following points.

Descriptive word cloud:
A word cloud of the most descriptive words in the category. These aren't the most common words in it, but the words that characterize it best (in the sense that they are significantly more common within than without).
Cluster size:
The number and percentage of songs in the category.
Quality score:
The ISim and ESim values returned by CLUTO for that category. A high ISim value means that the songs in the category are very similar to each other. A low ESim value means that the songs in the category are very different from the ones in other categories.
Descriptive words:
Some of the descriptive words singled out.
Representative singers:
Some regular anison singers whose work is particularly focused on that category. Each example comes with an “anomality score” indicating the degree to which their songs fall more often in that category than one would expect considering the size of the category. For example, if a category contains 20% of all songs but 40%, resp. 80% of the songs by that singer, the singer has an anomality score of +1 (twice as many), resp. +2 (4 times as many) with respect to that category.
Representative shows:
Some anime series whose songs fall heavily in that category (together with an anomality score as above). For the result to be meaningful, we have only considered anime series with many songs (at least about 20), which skews the examples towards either long-running franchises (e.g. Gundam) or series that put out a huge number of CDs (e.g. K-On!).
Interpretation:
Attempt at describing the category overall.
One typical example:
That I will choose arbitrarily, with a link to animelyrics.com in case a hypothetical reader with less-than-functional Japanese wants a translated example of what I'm talking about.

1. Boy dreams girl

Descriptive word cloud for Boy dreams girl
Descriptive word cloud for “Boy dreams girl”.
Cluster size:
2170/12530 (17%).
Quality score:
ISim = 0.059; ESim = 0.025 (very clear-cut).
Descriptive words:
Pronouns: kimi, boku, bokura.
Time references: hi, ashita, itsuka, zutto, mirai.
Verbs: yuku, omou, miru, aruku.
Nouns: sora, kaze, yume; te, kokoro, mune, namida.
Adjectives: tsuyoi.
Adverbs: kitto.
Representative singers:
Anison units like marble (anomality +1.71), Kalafina (+1.68), GARNET CROW (+1.60) or ZARD (+1.53) fall heavily in that category, as do singers like Hayashimoto Mayumi (+1.53) and Sakamoto Maaya (+1.19).
Representative shows:
Many shounen shows with a young boy protagonist and/or a relatively cross-gender appeal fall in this category. This includes Touch (+1.66), the Meitantei Conan series (+1.60), Kateikyoushi hitman REBORN (+1.44), Gintama (+1.35) or even Naruto (+1.21). Dramas for an older audience can also have many songs in this category, e.g. NHK ni youkoso (+1.61) or the Da Capo series (+1.06).
Interpretation:
Songs in this category have a first-person protagonist saying boku (or bokura), often addressing a more or less abstract feminine figure as kimi. There is a lot of dreaming and looking-at-the-sky, walking-in-the-wind and growing and contemplating-the-future. There may also be some love involved, but only the platonic ai: you don't have koi here, or any touchy-feely vocabulary refering to actual physical contact; the protagonist is either reaching for that kimi figure in his dreams or considering how they might be together in a distant tomorrow. Hence “Boy dreams girl”. “Coming of age” may also have been a largely thematically-appropriate label.
One typical example:
The first opening of Hikaru no go, Get over.

2. Romance

Descriptive word cloud for Romance
Descriptive word cloud for “Romance”.
Cluster size:
2420/12530 (19%).
Quality score:
ISim = 0.048; ESim = 0.024 (very clear-cut).
Descriptive words:
Pronouns: anata, watashi.
Time references: mou, zutto, hi.
Verbs: au, aeru, dakishimeru, mitsumeru, ai suru, miru.
Nouns: yume, koi, ai, kimochi, soba; mune, kokoro, hitomi, hito.
Adjectives: suki da, ii, hoshii, setsunai.
Adverbs: sotto, kitto.
Representative singers:
Seiyuu like Kouda Mariko (+1.89), Kasahara Hiroko (+1.23), Kuribayashi Minami (+1.18) and to a lesser extent Kawasumi Ayako (+0.91) are particularly notable. Anison by Ishida Youko (+1.37) also largely fall in this category.
Representative shows:
Primarily romance shows of all stripes, old and new, for boys and girls: Amagami (+1.68), Marmalade boy (+1.46), Sister Princess (+1.37), Aa megami-sama (+1.25), etc. More action-oriented shoujo shows can also have numerous songs in this category, like Mahou kishi Rayearth (+1.41) and Card captor Sakura (+0.91). In a different style, an interesting example is the original Macross series, which, together with the movies and OVAs up to Macross Plus, features heavily in this category (+1.60). This is in sharp contrast with the subsequent Macross 7.
Interpretation:
In this category, the protagonist is a feminine character saying watashi, and addressing her lover as anata. Compared to the previous category, the time markers are more concrete (it's mou, not itsuka), as is the vocabulary in general. The stronger sort of love, koi, is a central theme, and the protagonist is not limited to abstract activities like omou; there is also au and soba ni iru and dakishimeru and gentle touching (sotto). There can also be a setsunai reality as a counterpoint to the hoped-for eternity of zutto. So yeah, the actual, typical romance.
One typical example:
The ending of Gundam dai-08 MS shoutai, 10 years after.

3. Nekketsu

Descriptive word cloud for Nekketsu
Descriptive word cloud for “Nekketsu”.
Cluster size:
1098/12530 (9%).
Quality score:
ISim = 0.036; ESim = 0.015 (quite clear-cut).
Descriptive words:
Pronouns: ore, warera, omae.
Time references: ashita.
Verbs: tatakau, moeru, ikiru, mamoru, makeru, kurikaesu, yaru.
Nouns: senshi, seigi, honoo, chikara, uchuu, chikyuu, chi, teki, aku, inochi, tamashii, yatsu, otoko.
Adjectives: atsui, heiwa da.
Representative singers:
Not surprisingly, almost all songs by Sasaki “daiou” Isao (+2.86) and Mizuki “aniki” Ichirou (+2.82) fall in this category, as do those by Kageyama Hironobu (+2.21) and JAM Project in general (+2.48). Other strong contenders in this battle of testosterone include Ishihara Shinichi (+2.51) and the fictional band FIRE BOMBER from Macross 7 (+2.32).
Representative shows:
Those about manly men fighting for JUSTICE and FRIENDSHIP. Like, erm, Kinnikuman (+2.88)? Or kids and sissy boys pretending to be manly men: Saint Seiya (+2.02), Yuu yuu hakusho (+1.85), One Piece (+1.37), the whole Dragon Ball franchise (+1.27), etc.
Interpretation:
The distinctive words say it all: this category is about being a real otoko, a senshi for seigi who tatakau against aku to mamoru the chikyuu or sometimes the entire uchuu. It's about moeru atsui honoo in your chi and tamashii. NEKKETSU! (Ok now, someone rap this w).
One typical example:
Mazinger Z.

4. Light and shadows

Descriptive word cloud for Light and shadows
Descriptive word cloud for “Light and shadows”.
Cluster size:
3136/12530 (25%).
Quality score:
ISim = 0.035; ESim = 0.023 (reasonably clear-cut).
Descriptive words:
Pronouns: bokura, jibun.
Time references: mirai, ashita, itsuka, hi.
Verbs: kagayaku, shinjiru, omou, yuku, ikiru.
Nouns: sora, yume, kaze, hoshi, hikari, yami, michi, basho, sekai, subete; te, kokoro, mune.
Adjectives: tooi, tsuyoi.
Adverbs: kitto.
Representative singers:
KOTOKO (+1.23), angela (+1.00), eufonius (+0.92) have most of their songs in this category. Similarly for Ishikawa Chiaki (+0.74), and also Okui Masami (+0.71) if we take into account her work as a lyricist as well as a singer. Popular seiyuu singers, from Hayashibara Megumi (+0.66) to Mizuki Nana (+0.18), tend to tilt towards this category too.
Representative shows:
Girl fighting shows for guys, like Slayers (+1.63), Nanoha (+1.06), Mai Otome (+0.88), or Kiddy Grade and its spin-off (+0.82), tend to feature many songs in this category. Conversely, the same is true of boy fighting shows for girls, like Saiyuuki (+1.51), Tennis no oujisama (+0.78) and the whole non-U.C. Gundam chronology (+1.24).
Interpretation:
Songs in this category are usually the combination of a fantasy, aerial setting (sora, kaze, hoshi, sekai, tooi, etc.), a strife between light and darkness (kagayaku hikari vs. yami), and a protagonist of not necessarily asserted gender (jibun) that lives through this strife while keeping the belief that good shall someday prevail (shinjiru, itsuka, kitto). It might not be this exact structure all the time, but these thematic element are certainly ubiquitous in this category (which seems to include, in particular, most contemporary late-night anime that are not primarily comedic or romantic).
One typical example:
The opening to the first season of Slayers, Get along.

5. Happy-go-lucky

Descriptive word cloud for Happy-go-lucky
Descriptive word cloud for “Happy-go-lucky”.
Cluster size:
3706/12530 (30%).
Quality score:
ISim = 0.017; ESim = 0.016 (not very clear-cut).
Descriptive words:
Pronouns: watashi, minna.
Time references: kyou, ashita.
Verbs: iu, ageru, warau, iku, kurikaesu, yaru.
Nouns: yume, tomodachi, ko, ki, koi, tokoro; egao, haato (heart), te.
Adjectives: daisuki da, genki da, issho da, tanoshii, ii.
Adverbs: chotto, motto.
Interjections: hora.
Representative singers:
Classical female anison singers like Oosugi Kumiko (+0.80), Hashimoto Ushio (+1.17) or Yokoyama Chisa (+0.96) often fall in this category. More recently, Gojou Mayumi (+1.16) of Doremi and Precure fame is also quite emblematic. However, the act most strongly skewed towards this category seems to be K-On!'s Houkago Tea-time (+1.56).
Representative shows:
Basically, comedies. Family-targeted shows are especially numerous in this category: Doraemon (+1.39), Crayon Shin-chan (+1.34), Chibi Maruko-chan (+1.15) are notable examples. Similarly for shows aimed at younger boys and girls: Hamtarou (+1.65), Ojamajo Doremi (+1.48), Dr. Slump (+1.33), etc. But late-night otaku comedies are also high on the list: K-On! (+1.54), Lucky Star (+1.34), Azumanga daioh (+1.25), Hayate no gotoku (+1.14; and yes, Hayate is a late-night show, it aired on Saturday nights at 34:00) among many others.
Interpretation:
Having fun with friends, laughing together, being happy and upbeat... Note how this category has a lot of daisuki da, which roughly corresponds to all the non-romantic uses of the word “love” in English. Note also how it's the only one that prominently features the word kyou, when all others are about mirai, itsuka and zutto. This is in fact the category of everyday life and how to enjoy it. For example, about 75% of the 542 occurrences of the character 食 (“eating”) in the corpus appear in this category (anomality +1.32). Happy-go-lucky!
One typical example:
Obviously Ichigo complete.

Future work

It is customary in works with a title like “Anison classification from unsupervised lexical clustering” to suggest some open problems and research directions at the end, so let me do that too. I think there's actually a lot that one could do with extensions of this general approach. Here are a few examples.

  • Can we assess the quality of that classification by better indicators than simple statistical values? Do other clustering algorithms (aggregative clustering, scatter/gather, etc.) produce similar, or better, results? Can we determine what the “right” number of clusters is in an automatic way?
  • What can we say about the evolution of anison among those categories through the past 40 or so years? Are some categories more common today? What about the distinction between openings, endings, insert songs, character songs, etc.: is it possible to detect it lexically? [These questions would require a slightly better annotated corpus].
  • If we look at the broader spectrum of Japanese pop songs, how do anime songs fit? Can we perhaps recognize them by their lyrics, or do the categories distinguished here actually reflect larger families within Japanese pop at large?
  • Could we constitute corpora to apply those methods to entirely different texts that anison lyrics, like manga and anime scripts? The corpus constitution sounds a bit daunting, but there is probably a lot to be learned from similar NLP techniques in such other contexts. Also, if we just look at anison, clustering can probably also be applied to music scores (insofar as they can be segmented and compared in a fashion similar to textual documents, which I think is possible): would that give related or completely different results?

Yay. If you've read until this point, congratulations on your tl;dr resistance, and see you next time (hopefully not in half a year again)!

UPDATE (2011-03-23): since one reader asked, I am attaching the CLUTO files (mat, clabel, rlabel) used in this experiment. They might help if you're interested in looking into some of the questions at the end. As explained previously, the mat file consists in one line per song containing a list of numbers describing the words in that song. The correspondance between those numbers and actual words is given in the clabel file, while the rlabel files indicates which song is on each line (they're represented by a code referring to their URL on the source website). Have fun playing with this.

AttachmentSize
anison-clean.clabel150.28 KB
anison-clean.rlabel219.78 KB
anison-clean.mat4.17 MB

24 comments for ‘Anison classification from unsupervised lexical clustering’.

[...] Anison classification from unsupervised lexical clustering | tsurupeta.info tsurupeta.info/content/anison-classification-from-unsupervised-lexical-clustering – view page – cached Ohisashiburi desu, etc. There are probably many more important things to talk about first, but I've been having fun playing with some tools this week-end, so here goes. [...]

Nice job!
Well it seems that are less about love than your everyday pop song. But "Light and shadows" or "Happy-go-lucky" songs can be indirectly about it, so it's hard to tell.
Also I would have choose another name for the "Nekketsu" category, in order to include those nihilistic robot anison like VOTOMS themes songs or the good ol' Gundam Hymns, like the eternal Ai Senshi or 0083's Men of Destiny, that goes like "I've seen Hell, now I'm trying to come back from it". Their semantic field is clearly "Nekketsu" but their meaning is quite the opposite.

(And I'm sad to see that you forgot Okazaki Ritsuko in the "Happy-go-lucky" category. Nobody did it better than her because she added a little touch of melancholy to nuance it, and it made all the difference. But that's just me being a fanboy.)

Actually, if you look at the lyrics of Okazaki Ritsuko anison, they fall much more often in the Romance category than in Happy-go-lucky. She sang many from Fruits Basket and even more from, erm, SisPri RePure.

As for the distinction between real burning nekketsu and the more nihilistic sort, maybe paying attention to verb tense, again, could help, as I guess the latter use the past tense much more often.

Man, I love this sooooooooooooo much. It's totally the sort of thing that seems "obvious" but is awesome to actually see quantified.

Do you have uh, stereotype-adherance rankings of the songs? I'm morbidly curious how random songs that I think of in each category rank, like how my first thought for the dreaming one was totally Touch, which looks like it scored pretty highly.

There are several ways to measure that, but yes, it can be done, and indeed songs from Touch tend to rank pretty high. A bit surprisingly, if I use the simplest ordering, the highest ranking is Are kara, kimi wa from the Miss Lonely Yesterday special, at rank 25 out of 2170. Even though it's more about the past than the future (but then again, verb tense is lost in the processing).

Also I totally rapped the nekketsu section.

Best fucking post ever

39!

Man, I hope everything we'll talk about at karaoke will produce posts like this =')

Just plain brilliant.

This is pretty great, though I'm going to nitpick a fairly irrelevant detail.

If Hayate aired on Saturdays at 34:00... doesn't it mean it airs on Sunday mornings at 10am? At that point, I'm not sure if you could still call it late night... you're already past the morning news block of the day. Though I understand the implication of using the 24+ hour time notation.

That aside, I think it would be interesting if this data were used for an app that could, say, fetch anime songs off your last.fm or something and tell you which cluster you listen out of most, or something (though I realize that'd be complicated by the fact that last.fm doesn't strictly do ani-songs, nor are song names/artist names used for played music files consistent).

Indeed, late-night 34:00 is a way to jokingly refer to those Sunday morning TV Tokyo shows that have a primarily otaku following (mainly Hayate, but also Zettai Karen Children or Otogi juushi Akazukin).

As for last.fm, I don't use it myself. But if someone neeeds to build a database with this data, no problem of course.

It would be nice to compare this with other genres (j-pop, Vocaloid, regular English songs if we can have matching words) but definitely a great job this. Anisongs have IMO simpler lyrics than jpop, this is based on my unscientific knowledge of Japanese songs and my music playlist in general. I have limited Japanese knowledge, and I tend to understand the lyrics of a song easier generally if it were an anisong.

And about Ritsuko Okazaki, she was a very notable lyricist for me, even her stuff in Love Hina are not of lyrically generic anisong quality.

Completely agree that it would be nice to compare with other genres. As I said, I'm not really sure what the outcome would be. And it's actually easy to do if we can assemble a large collection of song lyrics (that's the tricky part).

Regarding English words in roman letters (or in fact all words not in Japanese scripts), I should have noted that they are completely ignored in my analysis. In particular, songs in a foreign language are thrown away from the corpus. If we consider that to be a problem it's possible to fix, but I don't see occasional English words playing a significant role in the classification, and songs entirely in English would just form a cluster of their own.

A lot of the technical stuff went over my head, but this is a very interesting break down that I don't think anyone has done before!

Basically this. I have noticed that it seems like the same themes in anison keep cropping up over and over again, so this is a great way to quantify that. Were there other websites that you were considering using this to analyze anison lyrics? In Japanese/English/any other language?

Other websites? Do you mean, as a source of lyrics?

There are quite a few websites out there with Japanese song lyrics, but many are (purposefully) somewhat inconvenient to fetch those lyrics from. One huge site is Uta-Net, with over 100,000 songs across many genres, including anison. But they use Flash to display the lyrics and prevent you from even copy-and-pasting them, let alone expose them to a web crawler. Another is Kasi time, with a smaller database but quite a few recent seiyuu/anison/game song lyrics. But again they try to prevent copying in weird ways.

So it's probably possible to collect a larger corpus, but in most cases it looks like it would involve bypassing some paranoid copy-protection scheme. One would probably need to do that to compare anison to other flavors of J-pop, though.

I hadn't considered using English-language fan sites a source, though, as I suspect they're both less complete and less reliable (though I haven't actually checked that).

Thanks! Of course it's fine to skip the nitty-gritty technical details.

Mind...smoke...

Damn, nice work!

Here's an idea from IRC:

the obvious next step is to use those clusterings to generate markov chains and write the 5 perfect anisons in each category

Funny idea. You would need to write something like a generative grammar of anime songs first, though (and then use clustering score to assign transition probabilities and get a Markov chain). That's probably doable (especially if we're fine with generating only very generic sounding songs, which I guess is fine), but not by me.

Most trivially it could be done by using the bigram probabilities for transition probabilities ("What's the probability that this token comes after that token?"). I have some code lying around. I'd give it a shot if I knew Japanese.

I love the post, by the way. I've been wanting to do some text mining along these lines, but never got the chance.

That was awesome. I'm happy my little over a year of Japanese proved useful in comprehending most of the Japanese.

Now, if we tried Touhou songs next... (can't think of a similar database off the top of my head for Touhou lyrics though)

http://www.youtube.com/watch?v=VOqn28R_BoQ

I heard this song for the first time very recently, and it made me think of this post.

I’m really interested with your NLP research on Anison. I want to replicate your method and use it with large sums of Japanese Anime Subs DB. Now I come to a problem with clustering, I’d like to know the format of raw data used before it converted into the .mat file. Basically what I’m trying to do is to convert a csv containing word and its occurrence frequencies on each line but find no luck when clustering them.

Post new comment

The content of this field is kept private and will not be shown publicly. If you have a Gravatar account associated with the e-mail address you provide, it will be used to display your avatar.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
Syndicate content