The Persephone API

In this section we discuss the application program interface (API) exposed by Persephone. We begin with descriptions of the fundamental classes included in the tool. Model training pipelines are described by instantiating these classes. Consider the following example for a preliminary look at how this works:

# Create a corpus from data that has already been preprocessed.
# Among other things, this will divide the corpus into training,
# validation and test sets.
from persephone.corpus import Corpus
corpus = Corpus(feat_type="fbank",

# Create an object that reads the corpus data in batches.
from persephone.corpus_reader import CorpusReader
corpus_reader = CorpusReader(corpus, batch_size=64)

# Create a neural network model (LSTM/CTC model) and train
# it on the corpus.
from persephone.rnn_ctc import Model
model = Model("/path/to/experiment/directory",

This will train and evaluate a model, storing information related to the specific experiment in /path/to/experiment/directory.

In the next section we take a closer look at the classes that comprise this example, and reveal additional functionality, such as loading the speech and transcriptions from ELAN files and how preprocessing of the raw transcription text is specified.

On the horizon, but still to be implemented, is description of these pipelines and interaction between classes in a way that is compatible with the YAML files of the eXtensible Neural Machine Translation toolkit (XNMT).

Fundamental classes

The four key classes are the Utterance, Corpus, CorpusReader, and Model classes. Utterance instances comprise Corpus instances, which are loaded by CorpusReader instances and fed into Model instances.

class persephone.utterance.Utterance

An immutable object that represents a single utterance.

Utterance instances capture key data about short segments of speech in the corpus. Their most important role is in representing transcriptions in various states of preprocessing. For instance, Utterance instances may be created when reading from a linguists transcription files, in which case their text attribute is a raw unpreprocessed transcription. These Utterance instances may then be fed to a function that preprocesses the text, returning new Utterance instances with, say, phonemes delimited with spaces so that they are in an appropriate format for model training.

Note that Utterance instances are not required as arguments to Corpus constructors. They exist to aid in preprocessing.


A pathlib.Path to the original source audio that contains the utterance (which may comprise many utterances).


A pathlib.Path to the source of the transcription of the utterance (which may comprise many utterances in the case of, say, ELAN files).


A string identifier for the utterance which is used to prefix the target wav and transcription files, which are called <prefix>.wav, <prefix>.phonemes, etc.


An integer denoting the offset, in milliseconds, of the utterance in the original media file found in org_media_path.


An integer denoting the endpoint, in milliseconds, of the utterance in the original media file found in org_media_path.


A string representation of the transcription.


A string representation of the speaker of the utterance.

class persephone.corpus.Corpus(feat_type, label_type, tgt_dir, labels=None, max_samples=1000, speakers=None)[source]

Represents a preprocessed corpus that is ready to be used in model training.

Construction of a Corpus instance involves preprocessing data if the data has not previously already been preprocessed. The extent of the preprocessing depends on which constructor is used. If the default constructor, __init__() is used, transcriptions are assumed to already be preprocessed and only speech feature extraction from WAV files is performed. In other constructors such as from_elan(), preprocessing of the transcriptions is performed. See the documentation of the relevant constructors for more information.

Once a Corpus object is created it should be considered immutable. At this point feature extraction from WAVs will have been performed, with feature files in tgt_dir/feat/. Transcriptions will have been segmented into appropriate tokens (labels) and will be stored in tgt_dir/label/.

__init__(feat_type, label_type, tgt_dir, labels=None, max_samples=1000, speakers=None)[source]

Construct a Corpus instance from preprocessed data.

Assumes that the corpus data has been preprocessed and is structured as follows: (1) WAVs for each utterance are found in <tgt_dir>/wav/ with the filename <prefix>.wav, where prefix is some string uniquely identifying the utterance; (2) For each WAV file, there is a corresponding transcription found in <tgt_dir>/label/ with the filename <prefix>.<label_type>, where label_type is some string describing the type of label used (for example, “phonemes” or “tones”).

If the data is found in the format, WAV normalization and speech feature extraction will be performed during Corpus construction, and the utterances will be randomly divided into training, validation and test_sets. If you would like to define these datasets yourself, include files named train_prefixes.txt, valid_prefixes.txt and test_prefixes.txt in <tgt_dir>. Each file should be a list of prefixes (utterance IDs), one per line. If these are found during Corpus construction, those sets will be used instead.

  • feat_type (str) – A string describing the input speech features. For example, “fbank” for log Mel filterbank features.
  • label_type (str) – A string describing the transcription labels. For example, “phonemes” or “tones”.
  • labels (Optional[Set[str]]) – A set of strings representing labels (tokens) used in transcription. For example: {“a”, “o”, “th”, …}. If this parameter is not provided the experiment directory is scanned for labels present in the transcription files.
  • max_samples (int) – The maximum number of samples an utterance in the corpus may have. If an utterance is longer than this, it is not included in the corpus.
Return type:


classmethod from_elan(org_dir, tgt_dir, feat_type='fbank', label_type='phonemes', utterance_filter=None, label_segmenter=None, speakers=None, lazy=True, tier_prefixes=('xv', 'rf'))[source]

Construct a Corpus from ELAN files.

  • org_dir (Path) – A path to the directory containing the unpreprocessed data.
  • tgt_dir (Path) – A path to the directory where the preprocessed data will be stored.
  • feat_type (str) – A string describing the input speech features. For example, “fbank” for log Mel filterbank features.
  • label_type (str) – A string describing the transcription labels. For example, “phonemes” or “tones”.
  • utterance_filter (Optional[Callable[[Utterance], bool]]) – A function that returns False if an utterance should not be included in the corpus and True otherwise. This can be used to remove undesirable utterances for training, such as codeswitched utterances.
  • label_segmenter (Optional[LabelSegmenter]) – An object that has an attribute segment_labels, which is creates new Utterance instances from old ones, by segmenting the tokens in their text attribute. Note, LabelSegmenter might be better as a function, the only issue is it needs to carry with it a list of labels. This could potentially be a function attribute.
  • speakers (Optional[List[str]]) – A list of speakers to filter for. If None, utterances from all speakers are included.
  • tier_prefixes (Tuple[str, …]) – A collection of strings that prefix ELAN tiers to filter for. For example, if this is (“xv”, “rf”), then tiers named “xv”, “xv@Mark”, “rf@Rose” would be extracted if they existed.
Return type:


class persephone.corpus_reader.CorpusReader(corpus, num_train=None, batch_size=None, max_samples=None, rand_seed=0)[source]

Interfaces to the preprocessed corpora to read in train, valid, and test set features and transcriptions. This interface is common to all corpora. It is the responsibility of <corpora-name>.py to preprocess the data into a valid structure of <corpus-name>/[mam-train|mam-valid<seed>|mam-test].

__init__(corpus, num_train=None, batch_size=None, max_samples=None, rand_seed=0)[source]

Construct a new CorpusReader instance.

corpus: The Corpus object that interfaces with a given corpus. num_train: The number of training instances from the corpus used. batch_size: The size of the batches to yield. If None, then it is

num_train / 32.0.
max_samples: The maximum length of utterances measured in samples.
Longer utterances are filtered out.
rand_seed: The seed for the random number generator. If None, then
no randomization is used.
class persephone.model.Model(exp_dir, corpus_reader)[source]

Generic model for our ASR tasks.


Path that the experiment directory is located


CorpusReader object that provides access to the corpus this model is being trained on.


log softmax function


A batch of input features. (“x” is the typical notation in ML papers on this topic denoting model input)


The lengths of each utterance. This is used by Tensorflow to know how much to pad utterances that are shorter than this length.


Reference labels for a batch (“y” is the typical notation in ML papers on this topic denoting training labels)


The gradient descent method being used. (Typically we use Adam because it has provided good results but any stochastic gradient descent method could be substituted here)


Label error rate.


Dense representation of the model transcription output.


Dense representation of the reference transcription.


Path to where the Tensorflow model is being saved on disk.

__init__(exp_dir, corpus_reader)[source]

Initialize self. See help(type(self)) for accurate signature.

Return type:None
train(early_stopping_steps=10, min_epochs=30, max_valid_ler=1.0, max_train_ler=0.3, max_epochs=100, restore_model_path=None, epoch_callback=None)[source]

Train the model.

min_epochs: minimum number of epochs to run training for. max_epochs: maximum number of epochs to run training for. early_stopping_steps: Stop training after this number of steps

if no LER improvement has been made.
max_valid_ler: Maximum LER for the validation set.
Training will continue until this is met or another stopping condition occurs.
max_train_ler: Maximum LER for the training set.
Training will continue until this is met or another stopping condition occurs.

restore_model_path: The path to restore a model from. epoch_callback: A callback that is called at the end of each training epoch.

The parameters passed to the callable will be the epoch number, the current training LER and the current validation LER. This can be useful for progress reporting.
Return type:None

Transcribes an untranscribed dataset. Similar to eval() except no reference translation is assumed, thus no LER is calculated.

Return type:None


persephone.preprocess.elan.utterances_from_dir(eaf_dir, tier_prefixes)[source]

Returns the utterances found in ELAN files in a directory.

Recursively explores the directory, gathering ELAN files and extracting utterances from them for tiers that start with the specified prefixes.

  • eaf_dir (Path) – A path to the directory to be searched
  • tier_prefixes (Tuple[str, …]) – Stings matching the start of ELAN tier names that are to be extracted. For example, if you want to extract from tiers “xv-Jane” and “xv-Mark”, then tier_prefixes = [“xv”] would do the job.
Return type:



A list of Utterance objects.

class persephone.preprocess.labels.LabelSegmenter

An immutable object that segments the phonemes of an utterance. This could probably actually have a __call__ implementation. That won’t work because namedtuples can’t have special methods. Perhaps it could instead just be a function which we give a labels attribute. Perhaps that obfuscates things a bit, but it could be okay.


A function that takes an Utterance and returns another Utterance where the text field has changed to be phonemically segmented, using spaces as delimiters. Eg “this is” -> “th i s i s”.


A set of labels (eg. phonemes or tones) relevant for segmenting.

persephone.preprocess.wav.extract_wavs(utterances, tgt_dir, lazy)[source]

Extracts WAVs from the media files associated with a list of Utterance objects and stores it in a target directory.

  • utterances (List[Utterance]) – A list of Utterance objects, which include information about the source media file, and the offset of the utterance in the media_file.
  • tgt_dir (Path) – The directory in which to write the output WAVs.
  • lazy (bool) – If True, then existing WAVs will not be overwritten if they have the same name
Return type:



class persephone.rnn_ctc.Model(exp_dir, corpus_reader, num_layers=3, hidden_size=250, beam_width=100, decoding_merge_repeated=True)[source]

An acoustic model with a LSTM/CTC architecture.


Writes a description of the model to the exp_dir.

Return type:None

Distance measurements

persephone.distance.min_edit_distance(source, target, ins_cost=<function <lambda>>, del_cost=<function <lambda>>, sub_cost=<function <lambda>>)[source]

Calculates the minimum edit distance between two sequences.

Uses the Levenshtein weighting as a default, but offers keyword arguments to supply functions to measure the costs for editing with different elements.

  • ins_cost (Callable[…, int]) – A function describing the cost of inserting a given char
  • del_cost (Callable[…, int]) – A function describing the cost of deleting a given char
  • sub_cost (Callable[…, int]) – A function describing the cost of substituting one char for
Return type:



The edit distance between the two input sequences.

persephone.distance.min_edit_distance_align(source, target, ins_cost=<function <lambda>>, del_cost=<function <lambda>>, sub_cost=<function <lambda>>)[source]

Finds a minimum cost alignment between two strings.

Uses the Levenshtein weighting as a default, but offers keyword arguments to supply functions to measure the costs for editing with different characters. Note that the alignment may not be unique.

  • ins_cost – A function describing the cost of inserting a given char
  • del_cost – A function describing the cost of deleting a given char
  • sub_cost – A function describing the cost of substituting one char for

A sequence of tuples representing character level alignments between the source and target strings.

persephone.distance.word_error_rate(ref, hyp)[source]

Calculate the word error rate of a sequence against a reference.

  • ref (Sequence[~T]) – The gold-standard reference sequence
  • hyp (Sequence[~T]) – The hypothesis to be evaluated against the reference.
Return type:



The word error rate of the supplied hypothesis with respect to the reference string.


persephone.exceptions.EmptyReferenceException – If the length of the reference sequence is 0.


exception persephone.exceptions.PersephoneException[source]

Base class for all exceptions raised by the Persephone library

exception persephone.exceptions.NoPrefixFileException[source]

Thrown if files like train_prefixes.txt, test_prefixes.txt can’t be found.

exception persephone.exceptions.DirtyRepoException[source]

An exception that is raised if the current working directory is in a dirty state according to Git.

exception persephone.exceptions.EmptyReferenceException[source]

When calculating word error rates, the reference string must be of length >= 1. Otherwise, this exception will be thrown.