SCA++ Syntax
This page documents the syntax used by SCA++.
Overview
SCA++ uses three input boxes, one for phoneme classes, one for rules (= sound changes), and one for the words that the changes should be applied to.
General Remarks
Whitespace in class definitions and rules is ignored; for example,
a/b/_
a / b/ _
both of these rules are equivalent.
In the examples in this guide, remarks enclosed in brackets
()
, unless otherwise noted, are comments and serve only
documentation purposes. They are not part of the syntax of SCA++.
Cheat Sheet
Basic Usage
t / d / _ (ta > da)
p, t, kʷ / b, d, gʷ / _ (pa, ta, kʷa > ba, da, gʷa)
p, t, kʷ / b / _ (pa, ta, kʷa > ba, ba, ba)
Environment
t / d / _a (ta, te > da, te)
a / i / t_ (ta, da > ti, da)
Word boundaries
d / t / _# (da, ad > da, at)
d / t / #_ (da, ad > da, at)
Optional and negated elements
d / t / _(a)r (der, dar, dr > der, tar, tr)
d / t / _~[a] (da, de > da, te)
Wildcards
d / t / _*r (dr, dar, d桜r > dr, tar, t桜r)
Rules
There are five types of rules, each of which use a slightly different syntax.
Each rule must be on a single line.
Substitution Rules
Substitution rules have an input and output. Their syntax is:
input/output/context
NOTE: The first /
can also be a >
if
that’s what you prefer. Semantically, there is no difference between the
two.
Examples:
p / b / vowel_vowel (p > b between vowels)
pp, tt, kk / p, t, k /_# (word-final voiceless stops degeminate)
Input
The input of a substitution rule consists of comma-separated characters, classes, and combinations thereof.
Examples:
a (matches the character `a`)
abc (matches the characters `abc` in a row)
a, b, c (matches either `a`, or `b`, or `c`)
ad, bd, cd (matches either `ad`, or `bd`, or `cd`)
{ a, b, c }d (first matches either `a`, `b`, or `b`, and then `d`)
{ a, b, c }d, f (matches either the same as the previous line, or a single `f`)
Note that lines 4 and 5 are equivalent. Both do the same thing, it’s just written differently.
The wildcard operator *
may be used in place of a
character and matches any one character.
Output
The output of a substitution rule is just like the input, except that:
The output must contain the same number of elements as the input:
a, b, c / e, f, g / _ (OK) { a, b, { c }} / e, { f, g } / _ (OK, classes are flattened) {a, b, c}d / abcd, fegh, i / _ (OK, how complex each element is doesn't matter) a, b / e, f, g / _ (Error: too many output elements) a, b, c / e, f / _ (Error: not enough output elements)
The only exception to this is when you have exactly one output element, in which case you can have as many inputs as you like:
a, b, c / d / _ (OK, `a`, `b`, and `c` all become `d`)
The wildcard operator may not be used in the output (because what would that even mean?).
Percentages can be used to introduce irregularity: an element may be replaced with a class that contains percentage-qualified elements:
a, b, c / e, f, %{ 20%g, 40%h, r }
More on percentages later on.
Context
The context is the same for all rules and determines what must precede or follow the input for a rule to apply to it. The syntax of the context is as follows:
The context must contain exactly one underscore:
a / b / _ (OK) a / b / ___ (OK, multiple consecutive underscores are treated as one) a / b / (Error: no underscore in context) a / b / _c_ (Error: multiple non-consecutive underscores in context)
A context containing only an underscore means that the input will be replaced with the output wherever it occurs. For example, the rule
a / b / _
means ‘every occurrence of
a
is replaced withb
’.The underscore may be preceded or followed by characters and classes. These indicate that the input muse be preceded or followed by those characters and classes for a substitution to take place. The characters and classes that are part of the context themselves are not replaced. For example, the rule
a / b / c_d
means ‘
a
becomesb
betweenc
andd
, butc
andd
remain unchanged’.A
#
sign at the very beginning or end of whatever comes before or after the underscore indicates a word boundary. For examplea / b / #_ (OK, a > b at the beginning of a word) a / b / _# (OK, a > b at the end of a word) a / b / #_# (OK, a > b if the word is ‘a’) a / b / _c# (OK, a > b if followed by ‘c’ at the end of a word) a / b / _#c (Error: characters after the end of a word are not allowed) a / b / c#_ (Error: characters before the beginning of a word are not allowed)
The
~[]
operator negates an element. This means a rule only applies if it doesn’t contain that element at that position.a / b / ~[c]_ (OK, a > b unless preceded by ‘c’) a / b / _~[c] (OK, a > b unless followed by ‘c’) a / b / _~[{c, d}]e (OK, a > b unless followed by ‘ce’, or ‘de’) a / b / ~[]_ (Error, ‘~[]’ must contain an element)
Brackets
()
may be used to indicate optional elements, which may, but need not, be present:a / b / _(c)e (OK, a > b before ‘e’ or ‘ce’) a / b / _({c, d})e (OK, a > b before ‘e’, or ‘ce’, or ‘de’)
Epenthesis Rules
Epenthesis Rules are just like substitution rules, except that they have no input, and their output may contain only one element:
/ a / b_ (OK, insert ‘a’ after every ‘b’)
/ e / #_s (OK, insert ‘e’ before word-initial ‘s’)
/ a / _ (OK, insert ‘a’ absolutely everywhere (not recommended))
/ a, b / _ (Error: the output of an epenthesis rule may contain only one element)
/ {a, b}c / _ (Error: same as previous line, since this expands to `ac, bc`)
Deletion Rules
Deletion rules are the opposite of epenthesis rules: they have no output. However, their input may consist of more than one element:
a // b_ (OK, yeet ‘a’ before ‘b’)
e // _# (OK, yeet word-final ‘e’)
a // _ (OK, yeet ‘a’ everywhere)
a, e // _ (OK, yeet ‘e’ and ‘a’ everywhere)
// _ (Error, empty deletion rule)
Again, whitespace doesn’t matter, so whether you use //
or / /
here is up to you.
Metathesis Rules
Metathesis rules are identified by their ‘output’ consisting of
&
. A metathesis rule reverses each input element.
Diacritics remain attached to the preceding character:
st / & / _ (OK, ‘st’ becomes ‘ts’)
st, zd / & / _# (OK, ‘st’ and ‘zd’ become ‘ts’ and ‘dz’ word-finally)
ɑ̃n̩e / & / s_ (OK, ‘ɑ̃n̩e’ becomes ‘en̩ɑ̃’ after ‘s’)
Reduplication Rules
Reduplication rules are identified by their ‘output’ consisting of
one or more +
signs. The input elements are repeated n
times, where n is the number of +
signs:
p, t, k / + / #_ (OK, geminate word-initial ‘p’, ‘t’, ‘k’)
s / ++++ / _ (OK, ‘s’ becomew ‘sssss’)
st / + / _ (OK, ‘st’ becomes ‘stst’)
Classes
Classes can be defined in the ‘Classes’ input box, in which case they are assigned a name and can be referred to by that name in rules and following definitions.
The syntax for a class definition is as follows:
class-name = { characters }
The class name consists of one or multiple characters and may contain
any character that doesn’t have special meaning (like #
or
/
).
The characters inside the class definition are sequences of characters that are separated by commas. You can also define classes in terms of other classes:
front = { i, e }
back = { u, o }
vowels = { front, back }
In the example above, the classes front
and
back
in the definition of vowels
are expanded
right then and there, yielding { i, e, u, o }
.
Using Classes in Rules
Classes denote alternatives and normally simply expand to their containing elements. The following are all equivalent:
{a, b, c}
{{a, b, c}}
{{a, b}, c}
{a, {b, c}}
{{a}, b, c}
{a, {b, {c}}}
{ {{a}}, {{{{ b, {{c}} }}}} }
As we have done multiple times already, we can also use classes
directly in a rule without assigning them a name first. For example,
assuming vowels
is defined as above, the rules below are
equivalent:
ai, ae, au, ao / a, b, c, d / _
a{vowels} / a, b, c, d / _
a{i, e, u, o} / a, b, c, d / _
IMPORTANT: Class names must be separated
from surrounding characters that do not have special meaning (like
{
or /
) by an extra pair of {}
.
If you were to write avowels
rather than
a{vowels}
, it would interpret avowels
either
as the name of a class, or, since we haven’t defined any class with that
name, as the character sequence a v o w e l s
.
For example, assuming we have the following class definitions:
FS = { a, b }
SR = { b, c }
FSR = { o, p, q }
We can use them as follows:
FSR (equivalent to ‘{ o, p, q }’)
{FS}R (equivalent to ’{ a, b }R’)
F{SR} (equivalent to ’F{ b, c }’)
Definition Order
Class definitions are processed top to bottom. The following is valid, but does not do what you might think it does:
vowels = { front, back }
front = { i, e }
back = { u, o }
vowels-or-q = { vowels, q }
In this case, vowels
is defined in terms of
front
and back
, but front
and
back
are not defined yet and are just treated as the
character sequences f r o n t
and b a c k
. The
vowels
class is thus equivalent to
{f, r, o, n, t, b, a, c, k}
This is because a class definition is expanded as soon as it is
encountered. Here’s another example. Consider the definition of
vowels-or-q
above. In it, we’re using the
vowels
class, which we defined in terms of
front
and back
.
However, front
and back
in the definition
of vowel
will always have the meaning that they had at
the time vowels
was defined. This means that that
vowels-or-t
is NOT defined as
{ i, e, u, o, t }
, but rather as
{f, r, o, n, t, b, a, c, k, q}
This behaviour is necessary, because otherwise, the following might lead to complications:
a = { b }
b = { a }
If forward references to classes were allowed, this would lead to
problems: in the example above, we would be defining a
in
terms of b
, we’re defining in terms of a
,
which we’re defining in terms of b
and so on. It would
never stop.
This is why class definitions are processed in order. Doing so solves
this problem: In the example above, the class a
is defined
as being a class containing only the character b
.
And the class b
is then defined to be the same as the class
a
.
Operators
Due to the fact that classes are very similar to sets, we can apply set-theoretical operations to them to construct new classes.
The Difference Operator
The binary ~
operator is used to construct new classes
by removing characters from a class. It’s left-hand side should be a
class, but its right-hand side may be either a class or simply a
character. Assuming FS
, SR
, and
FSR
are defined like so:
FS = { a, b }
SR = { b, c }
FSR = { o, p, q }
We then get:
FSR~o (Equivalent to ‘{ p, q }’)
FSR~d (No effect since ‘FSR’ doesn't contain ‘d’; same as ‘FSR’)
FSR~{o, p} (Equivalent to ‘{q}’)
FS~SR (Equivalent to ‘{a}’)
The reason why this is called the ‘difference’ operator is because, it computes set difference between two classes.
Other operators
A detailed explanation of all of these will be provided in the near future.
The *
operator computes the cartesian product of two
classes.
The +
operator concatenates classes element by
element.
The |
operator computes the union of two classes.
The &
operator computes the intersection of two
classes
Grammar Specification
This section is intended as a formal specification of the syntax of SCA++. You probably want to skip it if you’re not a programmer.
Terminals are in all-caps and are not further elaborated on in here.
See lib/parser.hh
for a list of all tokens, which more or
less correspond to the terminals.
<rule> ::= <substitution-rule>
| <epenthesis-rule>
| <deletion-rule>
| <metathesis-rule>
| <reduplication-rule>
<class-def> ::= TEXT [ "=" ] <simple-el>
<substitution-rule> ::= <input> SEPARATOR <output> <context>
<epenthesis-rule> ::= SEPARATOR <output> <context>
<deletion-rule> ::= <input> SEPARATOR <context>
<metathesis-rule> ::= <input> SEPARATOR "&" <context>
<reduplication-rule> ::= <input> SEPARATOR "+" { "+" } <context>
<input> ::= <input-els> { "," <input-els> }
<input-els> ::= { <input-el> }+
<input-el> ::= <simple-el> | "*"
<output> ::= <output-els> { "," <output-els> }
<output-els> ::= { <output-el> }+
<output-el> ::= <percent-alternatives> | <simple-el>
<context> ::= SEPARATOR [ <ctx-els> ] { USCORE }+ [ <ctx-els> ] EOL
<ctx-els> ::= <decorated-els>
<decorated-els> ::= { <decorated-el> }+
<decorated-el> ::= <simple-el>
| <boundaries>
| "(" <decorated-els> ")"
| "~" "[" <decorated-els> "]"
<boundaries> ::= { "#" | "$" }
<percent-alternatives> ::= PERCENTAGE <percent-class>
<percent-class> ::= "{" <percent-list> "}"
<percent-list> ::= <percent-els> { "," <percent-els> }
<percent-els> ::= [ PERCENTAGE ] { <percent-el> }+
<percent-el> ::= ( TEXT | <percent-class> )
<simple-el> ::= TEXT | <simple-el-class>
<simple-el-class> ::= <simple-el-class-lit> { <set-op> <simple-el-rhs> }
<simple-el-rhs> ::= <simple-el-class-lit> | TEXT
<simple-el-class-lit> ::= CLASS-NAME | "{" <simple-el-list> "}"
<simple-el-list> ::= <simple-els> { "," <simple-els> }
<simple-els> ::= { <simple-el> }+
<set-op> ::= "~" | "&" | "*" | "|"