Heaps Law with Notes-Durations, Part IV

This histogram shows that the shuffled Heaps β values are not merely “high on average,” but statistically stable and approximately normally distributed around a central value near 0.41. The distribution is narrow as most runs cluster tightly between 0.395 and 0.425, shown on the plot as the 95 percent confidence interval. That visually confirms the low standard deviation of 0.013. The shuffled β value is not fragile or highly dependent on a particular random permutation. The vocabulary-growth behavior is intrinsic to the corpus itself.

Scientifically, this means the result is reproducible under random reorderings.

The distribution is roughly bell-shaped, suggesting no evidence of multiple competing vocabulary-growth regimes. There’s no indication that some shuffles produce radically different statistical structures. In practical terms, the corpus behaves like a statistically homogeneous symbolic system once ordering effects are removed.

If the histogram had multiple peaks, that would suggest an unstable vocabulary structure, multiple hidden corpus populations, or sensitivity to ordering.

Notice, however, that the right tail is slightly longer than the left, which makes it appear mildly right-skewed. There are a few runs approaching 0.45 to 0.47, while very few fall below 0.38. That may simply reflect the finite vocabulary ceiling of using a note (pitch-duration) as a vocabulary word.

The histogram visually reinforces the importance of the original β of 0.3185, which lies far outside the center of this distribution and even well outside the 95 percent confidence interval. We can’t even mark it on the plot because the horizontal axis does not extend that far to the left.

The difference is not subtle. The shuffled cloud centers around a β of 0.41, while the original ordering is around 0.32, a difference of 0.09. That is roughly seven standard deviations away from the shuffled mean. Statistically, that is enormous and strongly implies that the original corpus ordering contains genuine structural clustering rather than random sequencing effects.

The low beta value for the “as entered” database means that melodies near each other in the corpus tend to share vocabulary, that stylistic neighborhoods exist, and that historical or genre adjacency suppresses vocabulary discovery. That feature parallels natural-language corpora strongly. For example, reading one Dickens novel after another introduces fewer new words than randomly mixing scientific papers, novels, recipes, and legal documents. Topic clustering suppresses Heaps growth. The Skiptune corpus appears to exhibit the same phenomenon musically.

All this makes sense when it is remembered that we found a rhythm of enterin 100 tunes in a row of the same genre. They might be all show tunes, or all Irish music, or all dance music, or all Beethoven compositions. Because tunes were entered in blocks of related material, each hundred tunes would tend to reuse the same local vocabulary, the same idiom, same interval patterns, same rhythmic figures, same key habits, same genre conventions. So within each block, the database would discover many fewer new note-duration tokens after the first few tunes in that block.

The effect on the Heaps curve is that it becomes “stair-stepped” or locally flattened. It rises when a new stylistic block begins, then flattens as that block repeats already-seen tokens. On a log-log fit, that suppresses the estimated β and fits our results exactly. The original β was 0.3185, while the 1,000 shuffled runs centered around 0.4096. Shuffling destroys those hundred-tune genre/composer/source clusters, so rare or style-specific tokens get spread more evenly through the corpus. The fitted β rises.

So the right interpretation is that our order of entering suppressed the data because we entered in blocks of the same kind of tune. The shuffled β around 0.41 is probably a better estimate of the corpus’s underlying vocabulary-growth behavior when ordering bias is removed.

The original β is still musically meaningful because it shows that stylistic neighborhoods in melody are real: Irish tunes, Beethoven themes, musical-theater songs, and dance tunes each reuse characteristic pitch-duration vocabulary.

Finally, the histogram says something encouraging about AI training. Because the shuffled β values concentrate tightly around a stable mean, the corpus has a predictable statistical structure, token recurrence rates are highly regular, and the model should encounter stable long-run frequency distributions.

Heaps Law with Notes-Durations, Part IV

Heaps Curves

Comparing All Three Betas