[Home] [Puzzles & Projects] [Delphi Techniques] [Math topics] [Library] [Utilities]
|
|
Problem DescriptionThis program process input English words and displays their syllabization (words split into syllables). Background & TechniquesLike many of the projects developed here at DFF, this one started as the result of a user inquiry. This one asking how to count the number of syllables in a text document. I didn't even ask why this might be important but just wondering the problem could be solved. It turns out it's a hard problem because speech is a human endeavor. The phoneticists frequently disagree with print publishers. The phonics guys would prefer to divide "harder" AS "har-der" based on pronunciation but the publishers prefer "hard=er" based on easier understanding if the word gets split across two lines. So "hard-er" wins in that case but "win-dow" beats "wind-ow" every time even though windows do keep out the wind J. The bottom line is that the pure "rules" approach for syllabizing is complex with the many exceptions in English affecting accuracy, so I chose the data-based approach augmented with internal rules for words not found in the words file. There is a large 180,000 word publicly available file of syllabized mostly English words by Grady Ward (mhyph.txt). It's available at http://www.gutenberg.org/ebooks/3204. and is the file from which I derived my data base.
For various reasons, I decided to "prune" the master file to include only those
words which appear in Full.dic, the largest dictionary of my DFF
dictionary processing unit which contains about 63,000 words. About 41,000 of
those can be resolved using the mhyph.txt master file and are saved as
file Syllables.txt. Another 22,000 are resolved using internal rules. The
final 500 or so are resolved by manually created entries in file
SyllablesUpdate.txt. These lines are merged with the Syllables.txt file to produce the final
SyllablesList.txt file used by the program in its syllabizing. Both files,
Syllables and SyllablesUpdate have similar formatting; one line
per word with the word followed by an equal sign with followed by the syllabized version. In Syllables the syllable separators are "center dot'
characters (decimal 187 or hexadecimal B7). For easy of manual editing, "space" characters separate syllables in
the SyllablesUpdate file. These are
replaced the the center dot character in the final SyllableList file. Vocabulary
The Files
UsageBoth the source or the executable downloads include Syllables and SyllableUpdate files The several lookup buttons on the "Testing" page recognize when required SyllableList file is missing or out of date and will automatically rebuild it before attempting a search. There is a button on the "Settings" page which rebuilds the Syllables file by matching words from the Full.dic dictionary file against the Mhyph.txt file. The other button forces a merge between performs the Syllables and SyllablesUpdate files as described above.
Programmer's Notes:What was supposed to be a one week project but it turned into two weeks with about 1000 lines of code. But two satisfying weeks based on overcoming lots of smallish problems with only a few dead ends requiring backing up to find a different route to the solution. I ended up creating a TSyllables class in unit USyllables to handle the file updating and searching. To make the list sorted by the un-syllabized word it was necessary to separate the search word from the syllabized version. I created a TStrings object class serves to hold the syllabized word in the stringlist's Objects fields. A significant part of the lookup problem was creating and applying rules to assist when the word is not in the Syllable List. I suspect that consonants can be categorized as "hard" or "soft" controlling whether they create additional syllables when suffixes are added. My approach was empirical; scan a dictionary and examine results for words not syllabized or syllabized incorrectly. Rules are generalized and defined by 3 strings:
Running/Exploring the ProgramWith 60,000 words syllabilized, including several hundred manually created entries and 20,000 processed by rules, there are probably dozens, perhaps hundred of syllabization errors. I will be happy to receive corrections and suggestions for additions from users. Note: Dictionaries are not required to check individual words or to scan text files but will be required to run the "Scan dictionary" option used for testing. Also the Full.dic dictionary is used to select the words to retained from the mhyph.txt file into Syllables.txt so any significant changes to the dictionary will require rebuilding Syllables.txt to reflect what mhyph.txt says about the new or changed words.
Suggestions for Further Explorations
|
[Feedback] [Newsletters (subscribe/view)] [About me]Copyright © 2000-2018, Gary Darby All rights reserved. |