台語文語料庫蒐集及語料庫為本台語書面語音節詞頻統計     
     最近修改日期: 2011/2/18      最近閱讀日期:2017/11/22

台文摘要
     Chü-chiông 2001 nî Kok-bîn tiong-sió-hãk chiong hiong-th¯ gú-bûn khò-têng liãt chòe chèng-sek khò-têng, Tâi-gú-bûn ê tiöng-iàu-sèng töa-töa thê-seng. ‹-koh, Tâi-gú-bûn ê thui-tián, bõk-chiân b„n-tùi n¤g-ê chú-iàu bün-tôe : phiau-im hë-thóng ê sóan-tëng kah iöng-j„ bün-tôe.
     Pún kè-öe siü°-beh thàu-kè Tâi-gú-bûn im-chiat kah gú-sû pîn-lýt thóng-kè, tùi Tâi-gú-bûn iöng-j„ bün-tôe thê-chhut kiàn-g„.
     Pún kè-öe àn-s¢g beh iöng cheng-sîn s¬-ch…p 300 bän im-chiat í-siöng ê Tâi-gú-bûn bûn-pún, l„-iöng chin-s…t gú-liäu, têng-hiän Tâi-gú-bûn su-siá ê hiän-s…t.
     Kè-öe s¬-ch…p--tiõh ê gú-liäu, tû-liáu t„ pún kè-öe kè-s¢g im-chiat kah gú-sû pîn-lýt, thê-kiong Tâi-gú-bûn iöng-j„ ê chham-khó í-göa, mä hi-bäng chiâ ê gú-liäu ë-tàng chiâ°-chòe j…t-äu Tâi-gú-bûn siong-koan chü-jiân gú-giân gián-kiù ê ki-chh¬, chhin-chhiü° kàu-châi pian-siá, sû-tián pian-ch…p, chü-töng sû-sèng piau-s„, gú-sû kiám-sek, gú-sû tah-phòe, bûn-kù hun-thiah, chü-töng kàu-chèng, su-j…p-hoat, chü-töng bûn-kiä° tiah-iàu, ... téng.
     自從2001年國民中小學將鄉土語文課程列做正式課程,台語文ê重要性大大提昇。‹-koh,台語文ê推展,目前面對兩個主要問題:標音系統ê選定kah用字問題。
    本計畫想beh透過台語文音節kah語詞頻率統計,對台語文用字問題提出建議。
     本計畫按算beh用精神蒐集300萬音節以上ê台語文文本,利用真實語料,呈現出台語文書寫ê現實。
     計畫蒐集著ê語料,除了t„本計畫計算音節kah語詞頻率,提供台語文用字ê參考以外,希望chiâ ê語料會tàng成做日後台語文相關自然語言研究ê基礎,親像教材編寫、辭典編輯、自動詞性標示、語詞檢索、語詞搭配、文句分拆、自動校正、輸入法、自動文件摘要、...等。
Koan-kiàn-sû : gú-liäu-kh±, Tâi-gú-bûn, im-chiat pîn-lýt, sû-pîn
關鍵詞:語料庫,台語文,音節頻率,詞頻
英文摘要
     Taiwanese language and literature course became the formal course in junior high school and primary school of Taiwan since year 2001, and the importance of written Taiwanese has been promoted considerably. But the furtherance of written Taiwanese confronts two main problems : the selection of spelling systems and the characters usage. This project intends to suggest the characters usage of written Taiwanese via the syllables and words count from the Taiwanese corpus.
     One of main effort in this project is to colloect at least 3,000,000 syllables Taiwanese corpus which can unfold the reality of Taiwanese writing style from the above raw material.
     We use this corpus to count the syllables and words frequency for the purpose of the Taiwanese characters usage suggestion. We also hope it will be the most important basis of the following related Taiwanese natural language processing research such as Taiwanese teaching material editing, lexicography, automatic part-of-speech tagging, concordancer, collocation, sentense parsing, auto-correction, input method, automatic document abstraction ... etc.
Keyword: corpus, written Taiwanese, syllable frequency, word frequency
華文摘要
     自從2001年國民中小學將鄉土語文課程列為正式課程,台語文的重要性大幅提昇。然而台語文的推展,目前面對兩個主要問題:標音系統的選定及用字問題。本計畫將透過台語文音節及語詞頻率統計,對台語文用字問題提出建議。
     本計畫要花功夫蒐集300萬音節以上的台語文文本,利用真實語料,呈現出台語文書寫的現實。
     計畫蒐集到的語料,除了在本計畫計算音節及語詞頻率,提供台語文用字的參考外,也希望這些語料成為往後台語文相關自然語言研究的基礎,如教材編寫、辭典編纂、自動詞性標示、語詞檢索、語詞搭配、文句剖析、自動校正、輸入法、自動文件摘要、...等。
關鍵詞: 語料庫,台語文,音節頻率,詞頻