STATISTILISED MEETODID MURDEKORPUSE ÜHENDVERBIDE TUVASTAMISEL. (Estonian)
In: Eesti Rakenduslingvistika Ühingu Aastaraamat, Jg. 6 (2010), S. 307-326
academicJournal
Zugriff:
The aim of this study was to assess different statistical methods of automatic collocations extraction from the corpus. To extract the collocations, association measures (AM) were applied and the association scores (AS) for the collocation candidates found in the corpus were calculated. An AS indicates the collocational strength between two words. An advantage of the AMs is the fact that in addition to the co-occurrence frequency, the marginal frequencies of collocating words are also taken into account. To calculate the AS, the following data is needed: co-occurrence frequency, marginal frequencies of collocating words, expected frequency and the sample size. There are different approaches to applying AMs: words can be considered collocational only if they appear in the same collocational span, in one text unit (clause, sentence, utterance), or if they carry together some syntactic function. This paper attempts to apply AMs for phrasal verb detection from the Corpus of Estonian Dialects (CED). Texts of CED were morphologically tagged and parsed. Combinations of adverbs and verbs were extracted and AS was calculated for every collocation candidate. Experiments were run on three different dialect groups applying four different association scores: t-score, Mutual Information, chi-squared test and log-likelihood. The results indicate that log-likelihood and t-score outperform MI and chi-squared test. The outcomes of different measures vary the most in the Northern dialect group. The best measure for dialect data in general is log-likelihood. However, MI and chi-squared test work well with low frequency data. In the Northern dialect group the best AM for low-frequency phrasal verb detection is MI, however, in the North-Eastern and Southern groups chi-square test works well for the same purpose. To achieve better results different scores should be combined. [ABSTRACT FROM AUTHOR]
Sõnadevahelise seose tugevuse mõõtmise statistikuid kasutatakse arvutilingvistikas püsiühendite tuvastamisel. Statistikud võimaldavad korpuses kahele sõnale arvutada nendevahelise seose tugevuse väärtuse, mille põhjal võib otsustada, kas tegemist on püsiühendiga või mitte. Statistikute kasutamise eelis on, et arvesse ei võeta ainult sõnade koosesinemise, vaid ka ühendit moodustavate sõnade eraldiesinemise sagedusi. Artiklis teen katse rakendada statistikuid Eesti murrete korpuse kaheliikmeliste ühendverbide automaatsel tuvastamisel. Katsetatud on kolme murderühma peal eraldi nelja statistikut: t-skoori, vastastikuse informatsiooni väärtust MI, hii-ruut statistikut ning log-tõepära funktsiooni. [ABSTRACT FROM AUTHOR]
Copyright of Eesti Rakenduslingvistika Ühingu Aastaraamat is the property of Eesti Rakenduslingvistika Uhing (ERU) / Estonian Association for Applied Linguistics and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Titel: |
STATISTILISED MEETODID MURDEKORPUSE ÜHENDVERBIDE TUVASTAMISEL. (Estonian)
|
---|---|
Autor/in / Beteiligte Person: | Uiboaed, Kristel |
Zeitschrift: | Eesti Rakenduslingvistika Ühingu Aastaraamat, Jg. 6 (2010), S. 307-326 |
Veröffentlichung: | 2010 |
Medientyp: | academicJournal |
ISSN: | 1736-2563 (print) |
DOI: | 10.5128/ERYa6.19 |
Schlagwort: |
|
Sonstiges: |
|