Algorithm to figure out how many "base words" you can create given a small and configurable set of "suffixes" (and given a finite set of sounds)?

42 views Asked by At

Background

This is a question about how to create a system for an agglutinative language (a language with an infinite number of words formed by a base verb/noun, plus several suffixes, like Turkish, Finnish, Navajo, Swahili, etc..). Right now I'm facing this conundrum, and so asked about how to resolve ambiguity in agglutinative languages, when the suffix might be included in the base word.

For example, say we have the base words "dar (tree), darsi (store)", and the suffix "si" (to go to). Then "darsisi" means "to go to the store", and "darsi" means "to go to the tree"? Nope, because "darsi" means "store" from our previous definition! So there is a constraint, the suffixes cannot collide with the base words in any way-shape-or-form. ChatGPT says that Turkish doesn't ever run into this problem (I don't know Turkish), so somehow they have solved it.

I would like to treat it like an algorithm problem here to see if we can make more sense of it, what is and is not possible.

Situation: Constructing Suffixes

So say we have these consonant sounds (C = consonant):

  • b d f g k l m n p r s t v z (14 sounds, the easy to recognize letter-sounds, assume they are always pronounced in only one way).

And we have these 5 vowel sounds (V = vowels)

  • i ("ee") e ("ay") a ("ah") o ("oh") u ("oo")

Then assume we always construct suffixes by having -CV be the sound sequence for suffixes. Then here are a few suffixes, but say we created 100 suffixes, either 1 or 2 syllables. Say we can only have CV (1 consonant, 1 vowel), or CVCV, for the suffixes.

  • do re mi fa so la ti do

Since there are 14 sounds, and 5 vowels, that is 70^2 = I guess 4,900 possible combinations. So choosing 100 seems fine.

Situation: Constructing Base Verbs/Nouns

Given we have our ~100 suffixes, which will be things like tense, mood, aspect, person/number, etc., in grammar, we need to come up with our base nouns and verbs to attach our suffixes to!

But the catch is, you should try to avoid using the suffix in the base word!

Say our base words can be 2-3 syllables, CVCV, or CVCVCV.

  • bora
  • duri
  • sopa
  • gumo

No conflict with the suffixes I listed above yet! But if we have 100 suffixes, we are going to encounter conflicts:

  • bodo (conflict = do)
  • dati (ti)
  • etc.

Problem

So given that, how can you construct the possible list of suffixes and base words programmatically? You can choose to limit the sound-structure to CV and CVCV like I did, or you can make it more generic like CCV (bro) CVCC (bark), etc.., using the 14 consonants and 5 vowels outlined above.

  • The suffixes should be more than 40 and no need for more than 120 for this experiment.
  • The base words need to number in the 4,000-10,000 range, and should be no longer than 2-3 syllables (max 4 syllables, on rare occasion, if necessary).

The main reason for asking this question is, if we use up all the "good" simple CV sequences for our suffixes, and we should avoid using those sequences in a base word, will we have enough base words to total 4k-10k base words? Will it be hard to get to that number, or really easy? I can't see it at this point...

Some other notes:

  • You don't have to be extra rigid other than using the 14 + 5 sounds/letters. The syllable structure can be varied, and perhaps if you can explain how it's not a problem, you can reuse the suffixes in the base words in some fashion (so it's not as if you just do if base.indexOf(suffix) === -1 then ok.
  • There can be a large number of suffixes (see the longest Turkish word "muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsinizcesine", 70 letters, I don't know how many suffixes but probably over 7). So since you can chain suffixes together, you might run into problems with the suffix sequence being in the base word. But know that when you hear the words with suffixes, it is as if you are streaming in one affix at a time, so you parse it out like that, not looking backward at the whole string.

Ideally this could be done in JavaScript, which is the language I am most familiar with, so I can mentally parse it and perhaps run it.

Notes on Turkish

Turkish gets hardcore with it. Here is a list of all Turkish suffixes (800+ I think). Some are iffy, maybe not a perfect list. But some suffixes are 1 vowel, like -i, how do they manage that!? I don't get it yet....

Starting attempt

Here is my start on the suffixes, but with only CV there are only ~200 possibilities, and with CVCV with V-V the same vowel, there are over 4k possibilities. I would like a sweet spot of like 400-800 suffixes TBH, so I can use only a fraction of them, but hey.

const vowels = `i
e
a
o
u`.split(/\n+/)

const starts = `b
bl
br
c
d
dr
f
fl
fr
g
gl
gr
h
k
kl
kr
kw
l
m
n
p
pl
pr
r
s
sk
sl
sp
spr
st
str
t
tr
tw
v
w
z
`.trim().split(/\n+/)

const mids = `b
c
rc
d
rd
ld
nd
f
lf
g
lg
k
lk
sk
l
m
rm
n
rn
p
rp
r
s
rs
ps
ts
t
st
nt
v
rv
z
rz
lz
`.trim().split(/\n+/)

const suffixEnds = `b
c
d
f
g
k
l
m
n
p
r
s
t
v
z
`.trim().split(/\n+/)

const baseEnds = `b
c
d
f
g
k
l
m
n
p
r
s
t
v
z
`.trim().split(/\n+/)

const suffixes = []
for (const c of starts) {
  for (const v of vowels) {
    const k = `${c}${v}`
    suffixes.push(k)
    for (const e of suffixEnds) {
      const k = `${c}${v}${e}${v}`
      if (k.match(/yi|w[uo]/)) {
        continue
      }
      suffixes.push(k)
    }
  }
}
console.log(suffixes.join('\n'))
console.log(suffixes.length)

const bases = []
for (const c of starts) {
  for (const v of vowels) {
    for (const e of suffixEnds) {
      const k = `${c}${v}${e}`
      if (!k.match(/yi|w[uo]/)) {
        bases.push(k)
      }
    }

    for (const m of mids) {
      for (const v2 of vowels) {
        for (const e of suffixEnds) {
          const k = `${c}${v}${m}${v2}${e}`
          if (!k.match(/yi|w[uo]/)) {
            bases.push(k)
          }
        }
      }
    }
  }
}
// console.log(bases.join('\n'))
console.log(bases.length)

When I add the attempt at base words, I get hundreds of thousands easily. But I don't know how to tell if there is going to be conflicts with the suffix chains.... How can you figure that out? I think partly we need to create a clearer definition of what it means to be unambiguous, because that in itself is ambiguous to me haha. But I'm lost currently on it.

Some more constraints

To make it possibly easier to solve, what if we made it so:

  1. All suffixes end in a vowel. 1-2 syllables.
  2. All base words end in a consonant. 1-2 syllables.

Does that make it any easier?

For example:

  • bases: barat
  • suffixes: ba, ra, ta

Then we might have:

baratba
baratbara
baratbarata

I still can't see if this would run into any problems in the future... Looking to figure out how to tell if it will run into problems. (And I'm amazing that natural agglutinative languages have roughly avoided this problem, so crazy!).

Actually, this will run into a problem if we have a suffix which is two syllables, like having all these:

  • ba, ra, ta, bara, rata
  • base: barat

Then we can get barat-ba-ra or barat-bara, so somehow you have to cut those out too I think.

0

There are 0 answers