Sintesis pertuturan

Daripada Wikipedia, ensiklopedia bebas.
(Dilencongkan dari Lafal buatan)
Lompat ke: pandu arah, cari
Stephen Hawking ialah salah seorang pengguna sintesis pertuturan yang masyhur

Sintesis pertuturan (speech synthesis) adalah penghasilan pertuturan manusia tanpa mengggunakan suara manusia secara langsung.

Secara umum, pensintesis pertuturan (speech synthesizer) adalah perisian atau perkakasan yang mampu menghasilkan "ujaran buatan" (artificial speech).

Sistem ujaran buatan "sintesis pertuturan", yang sering dipanggil sistem teks-ke-pertuturan (text-to-speech, TTS), merujuk kepada keupayaannya untuk menukar teks kepada pertuturan. Bagaimanapun, terdapat sistem yang hanya menghasilkan wakil simbol linguistik seperti transkripsi fonetik kepada pertuturan.

Seimbas berkenaan dengan teknologi sintesis pertuturan[sunting | sunting sumber]

Gambaran am sebuah sistem TTS tipikal

Sistem teks-ke-pertuturan (atau enjin) terdiri daripada dua bahagian: bahagian depan dan bahagian belakang. Umumnya, bahagian hadapan mengambil input dalam bentuk teks dan output wakil simbol linguistik. Bahagian belakang mengambil wakil simbol linguistik sebagai input dan menghasilkan gelombang ujaran buatan. Keaslian pensintesis pertuturan biasanya merujuk kepada berapa tepat bunyi output kedengaran seperti suara manusia sebenar.

Bahagian hadapan mempunyai dua tugas utama. Pertama, ia mengambil teks mentah dan menukar sebahagian daripadanya seperti nombor dan ringkasan kepada perkataan bertulis yang setara. Proses ini dikenali sebagai "penormalan teks" (text normalization), "prapemprosesan", atau "pembuatan token" (tokenization). Kemudian ia memberikan transkripsi fonetik kepada setiap perkataan, dan menandakan teks kepada pelbagai unit prosodi, seperti frasa, klausa, dan ayat. Proses pemberian transkripsi fonetik kepada perkataan ini dikenali sebagai "teks-ke-fonem" (text-to-phoneme, TTP) atau penukaran "grafem-ke-fonem" (grapheme-to-phoneme GTP). Gabungan transkripsi fonetik dan maklumat mengenai unit prosodi membentuk output wakil simbol linguistik pada bahagian hadapan.

Bahagian lain, bahagian belakang, mengambil wakil simbol linguistik dan menukarkannya kepada output bunyi sebenar. Bahagian belakang sering dirujuk sebagai "pensintesis". Teknik pensintesis yang berlainan dibincangkan di bawah.

Sejarah[sunting | sunting sumber]

Sejak awal lagi sebelum pemproses signal eletronik moden dicipta, penyelidik pertuturan cuba membina mesin yang menghasilkan pertuturan manusia. Contoh awal 'kepala bercakap' dibuat oleh Gerbert of Aurillac (m. 1003), Albertus Magnus (1198-1280), dan Roger Bacon (1214-1294).

Pada tahun 1779, Christian Kratzenstein dari St. Petersburg membina model peti suara manusia yang mampu menghasilkan lima bunyi vowel panjang (a, e, i, o dan u). Ini diikuti dengan 'Mesin Pertuturan Mekanikal Akustik - Acoustic-Mechanical Speech Machine' berkuasa penghembus "bellows-operated" oleh Wolfgang von Kempelen dari Vienna, Austria, yang digambarkan dalam kertas kerjanya pada tahun 1791 Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine (J.B. Degen, Wien). Mesin ini menambahkan model lidah dan bibir, membolehkan ia menghasilkan bunyi consonant dan vowels. Pada tahun 1837 Charles Wheatstone menghasilkan 'mesin bertutur' berasaskan reka bentuk von Kempelen, dan pada tahun 1857 M. Faber membina 'Euphonia'. Reka bentuk Wheatstone dihidupkan kembali pada tahun 1923 by Paget.

Voder operator
Bell Labs VODER

Pada tahun 1930s, Bell Labs memajukan VOCODER, penganalisa dan pensyintesis eletronik menggunakan papan kekunci yang dikatakan jelas difahami. Homer Dudley memajukan lagi peranti ini kepada VODER, yang dipamernya di pesta Dunia New York 1939 (1939 New York World's Fair).

Pensintesis pertuturan awal berbunyi seperti robot dan sering sukar difahami. Output dari sistem TTS terkini kadang-kala sukar dibezakan dengan pertuturan manusia sebenar.

Sungguhpun dengan kejayaan sintesis pertuturan eletronik, penyelidikan masih dilakukan mengenai sintesis pertuturan eletronik bagi kegunaan robot seperti manusia. Sungguhpun sintesis pertuturan eletronik sempurna dihadkan oleh kualiti transducer (biasanya pembesar suara) yang menghasilkan bunyi, dengan itu system mekanikal robot mungkin mampu menghasilkan bunyi lebih normal berbanding pembesar suara kecil.

Pensintesis pertuturan eletronik berasaskan computer pertama dicipta pada akhir 1950-an dan system teks kepada pertuturan lengkap pertama disiapkan pada 1968. Sejak itu, terdapat banyak kemajuan dalam teknologi yang digunakan bagi penghasilan pertuturan eletronik, dan system teks-kepada-pertuturan moden sering kali mampu menghasilkan bunyi yang sukar dibezakan berbanding pertuturan manusia sebenar. Lihat #Contoh system kini di bawah bagi system teks-kepada-pertuturan perdagangan tercanggih dan yang bebas guna.

Teknologi-teknologi Buatan[sunting | sunting sumber]

Terdapat dua tekologi utama yang digunaka bagi menghasilkan bentuk gelombang pertuturan buatan: sintetik jeraitan dan sintesis forman

Sintetis jeraitan (Concantenative synthesis)[sunting | sunting sumber]

Sintetis jeraitan berasaskan penyatuan (atau jeraitan) bahagian-bahagian pertuturan yang dirakam. Biasanya, sjeraitan memberikan pertuturan sintesis paling asli. Tetapi variasi normal dalam pertuturan dan teknik automatik bagi memecah bentuk gelombang kadangkala menghasilkan herotan boleh dikesan dalam output. Terdapat tiga jenis intesis jeraitan.

  • Sintesis unit pilihan menggunakan sejumlah besar pangkalan data (lebih daripada sejam pertuturan yang dirakam). Semasa penciptaan pangkalan data, setiap sebutan yang dirakam menjadi bahagian yang membentuk sebahagian atau keseluruhan berikut: fon individu, suku kata, morfem, perkataan, frasa, dan ayat. Pembahagian ke dalam segmen dapat dilkaukan dengan beberapa kaedah, seperti The division into segments can be done using a number of techniques, like penggugusan, dengan menggunakan alat pengecam pertuturan yang diubah suai, ataupun dengan tangan, dengan penggunaan wakil visula seperti gelombang dan spektrogram. Sebuah indeks unit dalam pangkalan data pertuturan kemudiannya dicipta berlandaskan segmen dan parameter akustik seperti frekuensi asas (pic). Pada waktu jalan, ungkapan sasaran dicipta dengan menetapkan rangkaian unit calon daripada pangkalan data. Teknik ini menghasilkan keaslian terbaik kerana ia tidak mengenakan teknik pemprosesan isyarat digital terhadap pertuturan terakam, yang sering membuat pertuturan terakam berbunyi kurang asli. Malah, output daripada sistem pemilihan terbaik sukar untuk dibezakan daripada sura manusia sebenar, terutamanya dalam konteks penalaan sistem TTS. Walau bagaimanapun, keaslian maksimum sering memerlukan pangkalan data pertuturan yang besar, dan dalam beberapa sistem, saiz pangkalan data yang dirakam mencecah gigabait ddan rakaman pertuturan berjam-jam lamanya.
  • Sintesis difon menggunakan pangkalan data minimum yang mengandungi semua difon (transkripsi bunyi-ke-bunyi) yang berlaku dalam sesuatu bahasa. Bilangan difon bergantung pada phonotactics bahasa itu: Bahasa Sepanyol mempunyai lebih kurang 800 difon, sementara Bahasa Jerman mempunyai 2,500. Dalam sintesis difon, hanya satu contoh setiap difon hadir dalam pangkalan data pertuturan. Pada waktu jalan, prosodi sasaran ayat dikenakan terhadap unit-unit minimum ini dengan keadah-kaedah pemprosesan isyarat digital seperti pengekodan ramalan linear, PSOLA atau MBROLA. Mutu pertuturan hasilan secara amnya tidak sebaik perturuan hasilan pilihan unit, akan tetapi ia berbunyi lebih asli daripada output pensintesis forman. Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining, although it continues to be used in research because there are a number of freely available implementations.
  • Domain-specific synthesis concatenates pre-recorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports. This technology is very simple to implement, and has been in commercial use for a long time: this is the technology used by things like talking clocks and calculators. The naturalness of these systems can potentially be very high because the variety of sentence types is limited and closely matches the prosody and intonation of the original recordings. However, because these systems are limited by the words and phrases in its database, they are not general-purpose and can only synthesize the combinations of words and phrases they have been pre-programmed with.

Sintesis forman[sunting | sunting sumber]

Sintesis forman tidak menggunakan apa-apa sampel pertuturan manusia pada waktu jalan. Sebaliknya, Instead, the output synthesized speech is created using an acoustic model. Parameter-parameter seperti tahap-tahap frekuensi asas, penyuaraan dan bunyi diubahkan mengikut masa bagi menghasilkan sebuah gelombang ijaran buatan. Kaedah ini kadangkala disebut "sintesis berlandaskan aturan" (Rule-based synthesis) tetapi ada yang menyatakan oleh sebab banyak sistem jeraitan (concatenative system) menggunakan komponen berlandaskan aturan buat beberpa bahagian sistem, seperti bahagian depan, istilah ini tidak berapa tepat.

Banyak sistem yang berasaskan teknologi sistem forman menjana pertuturan yang berbunyi bautan dan mirip pertuturan robot; justeru outputnya tidak mungkin akan dianggap sebagai pertuturan seorang manusia. Walau bagaimanapun, keaslian maksimum bukan selalunya matlamat sistem sintesis pertuturan, dan sistem-sistem memiliki beberapa kelebihan berbanding dengan sistem jeraitan.

Pertuturan hasil sintesis forman dapat didengar dan difahami, mahupun pada kelajuan tinggi, dan mengelak daripada kekacauan akustik yang sering terjadi pada sistem jeraitan. Pertuturan buatan yang laju sering digunakan oleh orang yang kurang upaya kelihatan untuk memandu arah komputer dengan bantuan pembaca skrin. Kedua, pensintesis forman lazimnya merupakan perisian yang lebih kecil daripada sistem penjeraitan oleh sebab sistem forman tidak memiliki pangkalan data sampel pertuturan. Oleh sebab itu, perisian-perisian ini dapat digunakan dalam pengkomputan terbenam yang mempunyai kuasa pemproses serta ingatan yang terhad. Akhir sekali, sistem-sistem berdasarkan forman memiliki kawalan terhadap semua aspek perturan yang dihasilkan dan dengan itu sebilangan besar prosodi atau intonasi dapat dihasilkan, sekaligus bukan sahaj dapat menggambarkan soalan dan kenyataan, malah juga pelbagai emosi dan nada suara.

Kaedah sintesis lain[sunting | sunting sumber]

  • Sintesis Articulatory synthesis is a synthesis method mostly of academic interest at the moment. It is based on computational models of the human vocal tract and the articulation processes occurring there. These models are currently not sufficiently advanced or computationally efficient to be used in commercial speech synthesis systems.
  • Hybrid synthesis marries aspects of formant and concatenative synthesis to minimize the acoustic glitches when speech segments are concatenated.

Front-end challenges[sunting | sunting sumber]

Text normalization challenges[sunting | sunting sumber]

The process of normalizing text is rarely straightforward. Texts are full of homographs, numbers and abbreviations that all ultimately require expansion into a phonetic representation.

There are many words in English which are pronounced differently based on context. Some examples:

  • project: My latest project is to learn how to better project my voice.
  • bow: The girl with the bow in her hair was told to bow deeply when greeting her superiors.

Most TTS systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well-understood, or computationally effective. As a result, various heuristic techniques are used to guess the proper way to disambiguate homographs, like looking at neighboring words and using statistics about frequency of occurrence.

Deciding how to convert numbers is another problem TTS systems have to address. It is a fairly simple programming challenge to convert a number into words, like 1325 becoming "one thousand three hundred twenty-five". However, numbers occur in many different contexts in texts, and 1325 should probably be read as "thirteen twenty-five" when part of an address (1325 Main St.) and as "one three two five" if it is the last four digits of a social security number. Often a TTS system can infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the systems provide a way to specify the type of context if it is ambiguous.

Similarly, abbreviations like "etc." are easily rendered as "et cetera", but often abbreviations can be ambiguous. For example, the abbreviation "in." in the following example: "Yesterday it rained 3 in. Take 1 out, then put 3 in." "St." can also be ambiguous: "St. John St." TTS systems with intelligent front ends can make educated guesses about how to deal with ambiguous abbreviations, while others do the same thing in all cases, resulting in nonsensical but sometimes comical outputs: "Yesterday it rained three in." or "Take one out, then put three inches."

Text-to-phoneme challenges[sunting | sunting sumber]

Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion, as phoneme is the term used by linguists to describe distinctive sounds in a language.

The simplest approach to text-to-phoneme conversion is the dictionary-based approach. In this approach, a large dictionary containing all the words of a language and their correct pronunciation is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary.

The other approach used for text-to-phoneme conversion is the rule-based approach. In this approach, rules for the pronunciations of words are applied to words to work out their pronunciations based on their spellings. This is similar to the "sounding out" approach to learning reading.

Each approach has advantages and drawbacks. The dictionary-based approach has the advantages of being quick and accurate, but it completely fails if it is given a word which is not in its dictionary, and as dictionary size grows, so too does the memory space requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as it takes into account irregular spellings or pronunciations. As a result, nearly all speech synthesis systems use a combination of both approaches.

Some languages, like Spanish, have a very regular writing system, and the prediction of the pronunciation of words based on the spelling works correctly in nearly all instances. Speech synthesis systems for languages like this often use the rule-based approach as the core approach for text-to-phoneme conversion, resorting to dictionaries only for those few words, like foreign names and borrowings, whose pronunciation is not obvious from the spelling. On the other hand, speech synthesis for languages like English, which have extremely irregular spelling systems, often rely mostly on dictionaries and use rule-based approaches only for unusual words or names that aren't in the dictionary.

Examples of current systems[sunting | sunting sumber]

Some freely available text-to-speech systems:

  • Festival is a freely available complete diphone concatenation and unit selection TTS system.
  • Flite (Festival-lite) is a smaller, faster alterative version of Festival designed for embedded systems and high volume servers.
  • MBROLA is a freely available diphone concatenation system (back end).
  • Gnuspeech is an extensible, text-to-speech package, based on real-time, articulatory, speech-synthesis-by-rules.

Some very natural sounding commercial concatenative TTS systems with online demos: All of these have US English, most have other languages available.

ASY is an articulatory synthesis program developed at Haskins Laboratories.

The Klatt Synthesizer, developed in 1980 by Dennis Klatt, is a cascade/parallel formant synthesizer whose basic approach still serves as the waveform synthesizer of many formant synthesis systems.

Well known external hardware devices:

Recently Available Hardware devices:

Speech synthesis markup languages[sunting | sunting sumber]

A number of markup languages for rendition of text as speech in an XML compliant format, have been established, most recently the SSML proposed by the W3C (still in draft status at the time of this writing). Older speech synthesis markup languages include SABLE and JSML. Although each of these was proposed as a new standard, still none of them has been widely adopted.

A subset of the Cascading Style Sheets 2 specification includes Aural Cascading Style Sheets.

Speech synthesis markup languages should be distinguished from dialogue markup languages such as VoiceXML, which includes, in addition to text-to-speech markup, tags related to speech recognition, dialogue management and touchtone dialing.

Lihat juga[sunting | sunting sumber]

Rujukan[sunting | sunting sumber]

Pautan luar[sunting | sunting sumber]

  • Contoh system perdagangan TTS.
  • Free sistem Lafal Buatan bagi mereka yang mengalami permasaalahan pertuturan, dengan pautan bagi teknologi berkait pertuturan lain dan sumber bagi PALS.
  • Text to Speech dalam Bahasa