Hayes, B., & Wilson, C. (2008). A maximum entropy model of phonotactics and phonotactic learning. Linguistic inquiry, 39(3), 379-440.

Research Question

How do we account for the acceptability (well-formedness) of non-existing words?

e.g. why blick is more well-formed than bnick?

Acquisition model (Chomsky & Halle)

Learning –>AM–>Grammar

This paper propose a model that describes AM

Goal of the Learner

  1. Expressiveness
  2. Inductive Baseline

Definition of constraints is based on theoretical concepts (underspecification)

Linear : Feature-bundles: [ +consonantal,+approximant]       e.g. /r/,/v/

Add more layers later

1) Autosegmental tiers: main tier, vowel tier,

2) Metrical grid: stress tier

The constraints defined here are not the same as what defined in the OT framework

  1. 3. Gradient phonotactics

Categorical: okay v.s. not okay

Gradient:Okay–>mostly okay–>sometimes okay–>maybe okay–>not okay

  1. 4. Maxent

Maxent value is probability

Maxent (x) = e to the power of – h(x)

where h(x) is defined as constraint violations

see table 1 in the article

My understanding

If an output (x) violates a lot of constraints, then the probability for this output to exist is low.

If an output(x) violates important constraints, then the probability for this output to exist is low. (Importance is defined as weight)

In other words,

If under a set of constraints, an output(x) has a lower probability to occur, then x is ill-formed.

If under a set of constraints, an output(x) has a higher probability to occur, then x is well-formed.

Therefore, well-formedness is defined as maximum probability (i.e. maximum entropy value)

Question:1. Why not use “maximum probability”?

Maxent, although defined as probability, does not mean frequency or occurrence.

e.g. both “blick” and “bnick” have 0 occurrence, but one has a higher Maxent than the other.

The value of Maxent depends on the value of h(x).

The value of h(x) depends on 1) how many times a constraint is violated and

2)how important(weight) this constraint is.

Therefore, to calculate maxent, we need to know

1) what the constraints are

2) the weight of each constraint.

  1. Weight: importance

Iterated hill-climbing search, based on observed constraint violations and expected constraint violations

  1. Searching the space of possible constraints

Question: What is the selection problem? P.390

Space: natural classes determined by UG features

Number of constraints determined by the number of natural classes and features

Accuracy: O/E


  1. search for shorter constraints,
  2. same length: search for more general featural expressions

The algorithm

Input: a set of segments classified by sets of features. Every class is a constraint.

  1. calculate the accuracy of each constraint
  2. divide constraints by accuracy level
  3. select the most general constraints within an accuracy level, train for its weight (maximize probability)


  1. English onset

Download files here: http://www.linguistics.ucla.edu/people/hayes/Phonotactics/

Example demo with UCLA Phonotactic Leaner

Question: What are the constraints under this scheme?

Results generally confirms findings in Scholes (1966)

Better than other alternatives

  1. Nonlocal phonotactics

Pure linear model doesn’t workàAdd projection

Question: what is this projection?

I think it is just another layer. That is, there is a layer for the whole word (e.g. mVmV)

and a layer for the vowels (e.g. V..V)

  1. Metrical Grid & 4. Whole language analysis


Differences between maxent approach and OT framework

  1. constraints are not universal
  2. constraints are weighted, not ranked.

Question: can we say a constraint with a higher weight ranks higher?

  1. there’s no input-output relationship


  1. a well-established mathematical foundation (i.e. the maxent model)
  2. it is flexible and sensitive to the range of frequencies in the learning data

e.g.  OT framework has trouble dealing with rare occur onsets such as /pw/.  “Puerto Rico” will  cause one to assign a lower ranking to [no labial+w] under the OT framework. However [no labial+w] is correct for almost all other English onset consonant clusters, which means  [no labial+w] should rank higher.  To account for the rare occurrence of /pw/, a weight based system is thus better.

Further Discussions

  1. alternations
  2. typology
  3. hidden structure


  1. features
  2. accuracy and generality
  3. projections