Syllables

[Home]   [Puzzles & Projects]    [Delphi Techniques]   [Math topics]   [Library]   [Utilities]

 

Search

Search WWW

Search DelphiForFun.org

As of October, 2016, Embarcadero is offering a free release of Delphi (Delphi 10.1 Berlin Starter Edition ).     There are a few restrictions, but it is a welcome step toward making more programmers aware of the joys of Delphi.  They do say "Offer may be withdrawn at any time", so don't delay if you want to check it out.  Please use the feedback link to let me know if the link stops working.

 

Support DFF - Shop

 If you shop at Amazon anyway,  consider using this link. We receive a few cents from each purchase.   Thanks.

 

Support DFF - Donate

 If you benefit from the website,  in terms of knowledge, entertainment value, or something otherwise useful, consider making a donation via PayPal  to help defray the costs.  (No PayPal account necessary to donate via credit card.)  Transaction is secure.

Contact

Feedback:  Send an e-mail with your comments about this program (or anything else).

Search DelphiForFun.org only

 

 

Problem Description

This program  process input English words and displays their syllabization (words split into syllables).

Background & Techniques

Like many of the projects developed here at DFF, this one started as the result of a user inquiry. This one asking how to count the number of syllables in a text document. I didn't even ask why this might be important but just wondering the problem could be solved. It turns out it's a hard problem because speech is a human endeavor. The phoneticists frequently disagree with print publishers. The phonics guys would prefer to divide "harder" AS "har-der" based on pronunciation but the publishers prefer "hard=er" based on easier understanding if the word gets split across two lines.  So "hard-er" wins in that case but "win-dow" beats "wind-ow" every time even though windows do keep out the wind J.

The bottom line is that the pure "rules" approach for syllabizing is complex with the many exceptions in English affecting accuracy, so I chose the data-based approach augmented with internal rules for words not found in the words file. There is a large 180,000 word publicly available file of syllabized mostly English words by Grady Ward  (mhyph.txt).  It's available at  http://www.gutenberg.org/ebooks/3204. and is the file from which I derived my data base.  

For various reasons, I decided to "prune" the master file to include only those words which appear in Full.dic, the largest dictionary of my DFF dictionary processing unit which contains about 63,000 words. About 41,000 of those can be resolved using the mhyph.txt master file and are saved as file Syllables.txt. Another 22,000 are resolved using internal rules. The final 500 or so are resolved by manually created entries in file SyllablesUpdate.txt.  These lines are  merged with the Syllables.txt file to produce the final SyllablesList.txt file used by the program in its syllabizing. Both files, Syllables and SyllablesUpdate have similar formatting; one line per word with the word followed by an equal sign with followed by the syllabized version.  In Syllables the syllable separators are "center dot' characters (decimal 187 or hexadecimal B7).  For easy of manual editing, "space" characters separate syllables in the SyllablesUpdate file.  These are replaced the the center dot character in the final SyllableList file.  
 

Vocabulary

  • Nouns for the process: "syllabication" or "syllabification"
  • Verbs for the act: "syllabify", "syllabicate", or "syllabize"
  • Adjectives for words after undergoing syllabication: "syllabified", "syllabicated", or "syllabized"

The Files

  • Mhyph.txt - Grady Ward's file of 180,000 syllabized words and phrases from Gutenberg.org.
  • Syllables.txt contains the words from mhyph.txt which, with the "soft hyphens" removed, match our 62,000 Full.dic dictionary file.  The format for each line is "original word = syllabized word".
  • SyllablesUpdate.txt contains local additions for words which do not exist on Syllables.txt an cannot be syllabized by internal program rules.  
  • SyllablesList.txt is the merged version of Syllables.txt and SyllablesUpdate.txt. It is copied to a sorted internal "stringlist" structure with the program which supports a fast binary searching to find input words.

Usage

Both the source or the executable downloads include Syllables and SyllableUpdate files The several lookup buttons on the "Testing" page  recognize when required SyllableList file is missing or out of date and will automatically rebuild it before attempting a search.  There is a  button on the "Settings" page which  rebuilds the  Syllables file by matching words from the Full.dic dictionary file against the Mhyph.txt  file.  The other button forces a merge between performs the Syllables and SyllablesUpdate files as described above.


Non-programmers are welcome to read on, but may want to jump to bottom of this page to download the executable program now.

Programmer's Notes:

What was supposed to be a one week project but it turned into two weeks with about 1000 lines of code.  But two satisfying weeks based on overcoming lots of smallish problems with only a few dead ends requiring backing up to find a different route to the solution.  I ended up creating a TSyllables class in unit USyllables to handle the file updating and searching. 

To make the list sorted by the un-syllabized word it was necessary to separate the search word from the syllabized version.  I created a TStrings object class serves to hold the syllabized word in the stringlist's Objects fields.  

A significant part of the lookup problem was creating and applying rules to assist when the word is not in the Syllable List.  I suspect that consonants can be categorized as "hard" or "soft" controlling whether they create additional syllables when suffixes are added.  My approach was empirical; scan a dictionary and examine results for words not syllabized or syllabized incorrectly.   Rules are generalized and defined by 3 strings:

  1. the letter to remove,
  2. the letters  which replace the removed letters before a lookup against SyllableList is attempted, and
  3. the letters to reinsert if the word is found.  The reinserted letters are always the same as those removed initially with the addition of syllable separator characters as required.           

Running/Exploring the Program 

With 60,000 words syllabilized, including several hundred manually created entries and 20,000 processed by rules, there are probably dozens, perhaps hundred of syllabization errors.  I will be happy to receive corrections and suggestions for additions from users.     

Note: Dictionaries are not required to check individual words or to scan text files but will be required to run the "Scan dictionary" option used for testing.  Also the Full.dic dictionary is used to select the words to retained from the mhyph.txt file into Syllables.txt so any significant changes to the dictionary will require rebuilding Syllables.txt to reflect what mhyph.txt says about the new or changed words.        

 

Suggestions for Further Explorations

Didn't get around to counting syllables, but it should be easy to implement.  In each word, the number of syllables will be 1 greater than the number of soft hyphens in the word. 
.
   
   

 

Original:  July 27, 2012

Modified:  October 23, 2016

 
  [Feedback]   [Newsletters (subscribe/view)] [About me]
Copyright 2000-2016, Gary Darby    All rights reserved.