Subscribe Now
Trending News

Blog Post

Writing my PhD using groff
News

Writing my PhD using groff 

PhD researcher, University of Glasgow.

‘A Modern Study of Thirteenth–Century Organa and Motet: The Clausula as
Fundamental Unit’


Writing my PhD using groff

Thu Jul 21 2022 16:50:00 GMT+0100 (British Summer Time)

In this blog post: thoughts on the tools available to a PhD researcher,
benchmarking Markdown to PDF using Pandoc, and thoughts on using groff for
academic work.

I’ll begin with the obvious: a PhD is a difficult document to write. It is
similar in size and scope to a book, with the added constraint of monograph. For
the most part, you are effectively writing on your own. Academic publishing in
the twenty–first century, however, is becoming more collaborative. Perhaps it is
the kind of books that I am reading recently but I find that, more often than
not, an academic book published within the last decade is more likely to have an
editor, or set of contributing editors, than a single author. It is likely in
part also due to the fact that, in the postmodern, poststructuralist humanities,
readers are affronted by any book that communicates only one point of view.
Indeed, most disciplines have their examples of this.

As an example, a student of musicology is generally asked at some early point in
their academic career to write an essay comparing the single–author, Oxford
History of Western Music
by the late Richard Taruskin — that characterises
Western music as a single narrative from the dark ages to the present — to the
multi-editor, multi-author series of the Cambridge History of Western Music
which has a different editor for each volume, different author for each chapter,
and only the broadest sense of narrative possible. I have seen almost this exact
question on first–year undergraduate curricula at three different institutions.

It is a good question: the student is pointed towards the resulting discourse in
the academic literature, beginning with Taruskin’s review of the Cambridge
twentieth century history volume, and Cambridge editor Nicholas Cook’s
response, Harry White’s review of the Oxford history, and Taruskin’s
eventual defence of his position as music historian). If I may be allowed to
condense a far–reaching discussion into one sentence, the most common takeaway a
reader gets from all this is that Taruskin’s endeavour is laudable and
incredibly useful, especially for pedagogical purposes, but his history is
necessarily narrative and as a result biased to his personal views; the
differing and at times contradictory chapters in a Cambridge volume may offer
fresh insights and the idea that history does not have to be a story, or even
complete.

For better or worse, a PhD is more like the former than the latter (and there
are of course arguments that can be made for how a PhD’s focus on monograph
excludes skill–based and collaborative research), yet it is generally required
that a successful PhD candidate will write a book–length study on the research
that they have completed over the last few years. The writing of that study — or
at the very least the conversion of draft material into final prose — usually
takes place in the final year of study, and this is where I am about to find
myself.

It would be easy for an outside observer to conclude that “writing up” usually
takes place mostly in the final year simply due to time constraints with
research and disorganisation on the part of the PhD researcher (and there are
without doubt many examples of rushed write–ups!) but what is more common is
that writing made in the first year of a PhD becomes essentially useless for the
final product. Very rarely does a PhD begin and end with the same goals in mind,
and in this way a PhD is a journey: you learn things and find things out during
research that alter your conception of a topic and therefore its research
consequences.

These changes may not be large on their own, but the accumulation of new
research questions, methodologies, and data can make your initial writing seem
wrongheaded. This is not to say that a PhD researcher should save all their
writing to the very end, rather the opposite: that they should do lots of
writing in order to flesh out and structure their thoughts, but not expect to
then be able to file that writing away and use it again. It is in fact the act
of writing and putting your thoughts into precise words itself that makes
you question your own preconceptions: ‘How should this be named/defined?’ ‘Why
do I believe this?’ ‘Does this argument stand up to scrutiny?’ With that said,
this blog post is not about those issues of monograph directly, but rather the
technical burdens that writing this way creates.

Feeding into the time pressure of the final–year write–up is that a supervisory
team are not editors, expert typesetters, figure makers, or publishers: nor
should they be expected to be. Granted, it is expected that a supervisor will
read what you write and give feedback on things, but it is ultimately up to the
PhD candidate to create the finished thesis. A supervisor gives advice but
should not hold your hand through the process. In fact, a supervisor that
insists on weekly meetings and continuous updates can be overbearing (in my
experience) and ruin your ability to think creatively. (These kinds of
supervisors are often to be found in situations where the success of the PhD
researcher is a tacit prerequisite to securing further funding for that
topic/supervisor/lab).

With respect to the thesis, historically the final product was made by passing
your manuscript to a typist who would create typewritten copy. However, since
the advent of computer typesetting, PhD candidates have taken this final burden
upon themselves. This has numerous advantages, not least the ability to pass
around copies more easily as digital files, receive comments directly on those
files, and go through multiple cycles of drafts without having to rewrite your
material by hand, as well as a stricter control of output. As a pretty
acceptable touch typist, I am at a personal advantage here. My advice however,
is still to write notes by hand as the pen–and–paper process better cements your
learning than skimming your fingers over a keyboard. I can better recall
something I have copied with my own hand than copy–pasted or typed.

Humanities

However, with this new–found control comes the caveat that (most) PhD candidates
are very similar to their supervisors: they are not editors, expert typesetters,
figure makers, or publishers. As a result, most researchers in the humanities
instinctively reach for Microsoft Word or equivalent and begin by creating
“thesis.docx”. It is surprising to me how little thought often goes into the
organisation of a document’s internals with regard to styles and markup, and how
quickly draft prose starts appearing in a blank document. Perhaps it is due to a
lack of suitable computer literacy but, despite the tools being there, I see
only a minority of humanities researchers using basic style controls and
reference managers, and having to therefore laboriously go one–by–one through
their references and updating their footnotes, fonts and parenthetical citations
by hand. I do not mean that you should waste hours getting your fonts and
margins just right, but that many people seem to be unaware of the possibility
of using styles and reference managers that will in fact save them time in a
large and regularly–changing document.

This is perhaps exacerbated by the fact that most PhD researchers are broke,
using the same battered equipment they started their undergraduate with, and it
is not uncommon to see a 200–page, multi-image, multi-figure PhD thesis take
five minutes to open in Word on a five–year–old laptop, and then take twenty
seconds to scroll. Sometimes this effect can be ameliorated by starting each
chapter in a separate file (this is exactly how a previous supervisor sent me
their PhD thesis when I asked to read it), then merging the chapters together as
PDF. However, this makes altering each chapter unwieldy, especially if you need
to have multiple chapters open at once simultaneously (for example, to move
content from one chapter to another), and the multiple instances of Word running
simultaneously can be far more taxing on the old laptop’s processor than one
larger document would ever be.

Sciences

In the sciences, I hear that things are usually a little more organised. It is
at least common practice there to use a reference manager (unsurprisingly, and
as a result, reference manager systems are typically geared towards the needs of
science rather than humanities use). The more daring researcher may use LaTeX to
typeset their thesis. This may be their own install, but it is becoming more
common for institutions to provide LaTeX templates on an online platform such as
Overleaf. This is commonly thought of as TheRightWay™ to go about things, and
any LaTeX user will surely love to tell you about its superior typesetting
(kerning, ligatures etc.) and how its text–based format even allows them to
integrate their document as a git repository.

I know this because up until a few months ago I was that person. Who knows how,
but I came to love LaTeX as a teenager, and carried my beautifully–typeset
essays all through my music undergraduate (although their beauty stops at their
aesthetic). I wrote my Master’s thesis in LaTeX and marvelled at how Word would
not have been able to cope with the hundreds of pages of appendices I required.
It looked great and I felt great. That is, until I came across groff. Now I’m
trying to pare my documents down to the minimum requirements possible in order
to give myself greater room to write.

Perhaps surprisingly, my first true clash with groff was with Pandoc. As a
long–time Linux user, I had always heard the name of groff and its *roff
cousins being used in the context of manpages (terminal–based documentation
pages), and in the world of LaTeX many tools work both with groff and TeX. Groff
is part of an ancient hereditary line of typesetting systems stretching back
nearly 60 years. Groff (first released 1990) is the GPL–licensed version of
troff (mid 1970s), which was a typesetter–friendly version of roff (1969),
itself a portable version of RUNOFF (1964) for the IBM 7094. My initial
impression, then, of groff was of a dinosaur, better–suited to terminal and
ASCII output than creating modern PDFs. I dismissed groff out of hand as a
rudimentary and more obtuse predecessor to LaTeX.

It was during the writing of one of my earlier blog posts (in Markdown) that I
decided that I needed a nicely–typeset PDF copy to read in full before editing
and posting. I do not find it ideal to read blog posts directly in Vim, or even
copying the code into Word. For me, I can read better (and be less distracted)
when something is written as if it were on A4 paper or a physical book. I
resorted, then, to Pandoc to convert my Markdown code into a nice, readable PDF.
This is as simple as invoking pandoc -s -o post.pdf post.md and by default it
uses pdflatex as the backend to create the resulting PDF file. However, as I was
searching for margin options in the documentation, I came across the command
line option --pdf-engine=PROGRAM which allows Pandoc to switch to another
means of generating PDF. There are eleven options:

  • pdflatex
  • lualatex
  • xelatex
  • latexmk
  • tectonic
  • context
  • wkhtmltopdf
  • weasyprint
  • pagedjs-cli
  • prince
  • pdfroff

These can be split into three broad methods:

  1. Six LaTeX methods using four engines (pdflatex, lualatex, xelatex, context),
    latexmk being a frontend to pdflatex and tectonic a frontend to xelatex
  2. Four HTML to PDF converters requiring a two–step process from Markdown to
    HTML then to PDF
  3. pdfroff: on my system this is groff with the ms macro set

Let’s look at their outputs (p.2):

pdflatex

pdflatex

lualatex

lualatex

xelatex

xelatex

latexmk

latexmk

tectonic

tectonic

context

context

wkhtmltopdf

wkhtmltopdf

weasyprint

weasyprint

pagedjs-cli

pagedjs-cli

prince

prince

pdfroff

pdfroff

These outputs all look fine for most uses. Obviously, the LaTeX engines look
very good and the HTML engines are of varying quality (wkhtmltopdf is the best
of these in my opinion, the freemium prince is high quality but leaves a
watermark on the first page, weasyprint is okay, and pagedjs-cli was not very
acceptable given that it added extra blank pages and spacing as well as ignoring
my margin parameters). However, what surprised me was the output of groff which
for something I had only seen the output of manpages, produced quite a
professional–looking PDF.

Quality was not the only way in which groff surprised me: it has serious speed
too, especially in comparison to LaTeX. Let’s benchmark these PDF engines on my
most recent blog post (100 runs, images removed):

Graph of runtimes of
PDF engines
Benchmarked pandoc PDF engines (lower is
better)

The LaTeX engines are extremely slow, and this came as no surprise to me. At the
end of writing my Master’s thesis, I was often waiting over a minute for XeLaTeX
to compile my document with its multiple figures and musical examples. It may
look extremely beautiful, but it takes a long time to iterate the feedback (or
“edit–compile”) loop. This is the same with programming: if your compiler is
slow then it usually does not matter how optimised its output is, the slow
compiler is wasting valuable programmer cycles rather than cheap CPU cycles. I
believe also that a slow feedback loop hinders flow state, as every time you’re
waiting for the program/document to compile, you lose a little bit of
concentration. It is for this same reason that I cannot stand doing web design
using a framework where I cannot see the results of my changes almost instantly.
If I have to wait more than five seconds to see the results of my slight change
in padding, then I become frustrated with the process.

The HTML engines are, once again, of varying speed. Prince is by far the
fastest, but groff comes in a close second, with a much better quality. Groff is
notable, too, for its output size:

Graph of sizes of
resulting PDF files
Pandoc PDF engine output sizes (lower is
better)

I believe this is largely due to the fact that groff in my case embedded only
four fonts in its output PDF, whereas the LaTeX engines embedded up to twelve. I
would think therefore that the size advantage for groff would diminish on
longer, larger files. For me, the Markdown engine of choice is pdfroff.

Using groff as my preferred backend for converting Markdown to PDF piqued my
interest in the system, and soon I found the macro set called “Mom” that applies
numerous quality–of–life changes and sensible defaults to groff. Mom has a good
set of documentation, and it didn’t take me long to begin writing documents
using the macro set. I had no idea that groff could accomplish
citations/referencing (using refer), figures (by including PDF images created
using Imagemagick), diagrams (using pic), tables (using tbl), graphs (using
grap — those benchmark graphs above were created using grap), tables of
contents, cover pages! Not only is groff quick to compile, but also quick to
write. Most commands are a period and a few letters on a new line, rather than
the complicated brace expressions of LaTeX.

I recently updated my CV (in LaTeX) and the process underscored how much of my
typesetting was just “try something and see how it looks”. Not only did this
take a long time to refresh using LaTeX, but no matter how I indented the file,
I could not make the markup look clean. I realise now how much I am fighting
LaTeX whenever I stray from the happy path (that path being almost unilaterally
text and equations).

In groff, the typically all–caps commands stick out to me and, just like LaTeX,
are generally semantic rather than aesthetic (“create a heading” rather than
“create large bold text”). Although I have never used WordPerfect, I am reminded
of how people speak of its “Reveal codes” mode in hushed tones. This quick
writing works well with my preferred workflow of vim + terminal. I write a
Makefile with a “watch” rule that calls Mom’s frontend pdfmom in a loop with
inotifywait. As soon as I save a file in that directory, the PDF is compiled.

In fact, many of the advantages of using groff are the same as LaTeX, but with
the added advantage of speed and simplicity. Just like with LaTeX, you can dive
into the code to change a macro but, unlike LaTeX, I have found Mom’s macros
fairly understandable and not nested ten layers deep in some ancient package.

I contain my files within a git repository and make use of all the advantages
contained therein, particularly stashing changes for later and working on new
material in a separate branch.

I use Mom’s .COLLATE macro to keep chapters in separate files, and maintain a
“thesis.order” file that describes which files are to be concatenated in what
order. The Makefile reads this file and simply cat’s the files together,
piping into pdfmom which handles the preprocessing with refer, tbl, grap and
pic.

The manpages for the preprocessors are eminently readable, with a strong focus
on tutorial rather than simply documentation of “what you should already know”,
but I would also recommend:

  • Mom’s own documentaton page on refer.
  • James and Schaffter’s PDF document, especially for creating internal links.
  • Bentley and Kernighan’s paper on grap (warning: Postscript).
  • Raymond — Making pictures with GNU pic.

Below a sample of the in–progress PhD. I don’t want laggy Word and inconsistent
styling issues. I don’t want KOMA-Script woes and minutes–long compile times. I
want something simple that works: groff!

Sample page from the
thesis
Sample page from groff–powered
thesis

Graph–based Analysis

Fri Mar 25 2022 15:01:15 GMT+0000 (Greenwich Mean Time)

The long slog of inputting both W1 and W2 into CANDR is complete. You can view
the result here. F, although its
facsimiles have been uploaded to the site, is for now lying with only its system
and stave extents partially defined, and work on the tracing–transcription of F
is not begun properly. This is simply due to time constraints. Perhaps F, too,
will be encoded in the near future. What we do have however, is a database
containing the positional data of the musical notation of both W1 and W2. This
alone is something that I am proud of (the database file currently weighing in
at 231MB), but the most important issue is how to analyse this data in order
to extract features that will help me to answer my research questions.

One distinction that I find particularly useful from MEI is that of musical
domain. MEI correctly realises that there is no “one encoding to rule them
all” and provides tools and schemata for encodings that consider differences of
opinion, inconsistencies, and elements of performance practice. MEI accomplishes
this through the concept of simultaneous musical domains. MEI acknowledge SMDL
(the Standard Music Description Language) as the genesis of this idea,
formalising the concept of musical domain from Milton Babbitt. SMDL divides
the area of musical domain into four separate categories (definitions mine):

  • Logical domain: the “composer’s intentions”
  • Gestural domain: aspects of performance
  • Visual domain: how notation looks on the page
  • Analytical domain: post hoc analyses of music

MEI shares the same terminology of domain with SMDL, but freely admit that they
prioritise ‘the visual domain over the gestural domain by (partly) conflating
the logical and the visual domains’. There is a certain inconsistency in the
formulation of the logical domain carried over from Babbitt’s 1965-era
musicological thought (although I do applaud him for opening his writing by
admitting ‘I am not a computer expert, and I am not a musicologist’). SMDL
defines the logical domain as ‘the basic musical content’ without first defining
what falls exactly under “basic”:

The logical domain is describable as “the composer’s intentions with respect
to pitches, rhythms, harmonies, dynamics, tempi, articulations, accents, etc”
… It can also be described as “the abstract information common to both the
gestural and visual domains.”

What unknown source the SMDL specification is quoting exactly notwithstanding,
SMDL leans heavily into a problematic concept of a commonly–understood authorial
intent between what is written on the page and what is performed. Rather it is
precisely those ‘pitches, rhythms, harmonies, dynamics, tempi, articulations,
accents, etc.’ that are the subject of discussion and should, in my view, not
form part of a “logical” domain of infallible truth, the implicit understanding
being that anything falling outside the “logical” domain must be “illogical”.

MEI are somewhat correct therefore to ‘conflate’ — although “conflate” seems a
little unfair to a decision that clearly understands the issue above — the
logical and visual domains, but in my view they uphold the same problematic
basis. MEI clearly prioritises visual aspects of notation above all else.
Indeed, much of MEI’s apparata are geared towards encoding Western notation:
pitches are referred to by their position in the twelve–note octave rather than,
for example, by their absolute frequency, there are multiple attributes for
measuring distances and lengths between objects on a page, definitions of music
begin primarily with how they are divided into systems and staves on a page. It
is difficult to escape the confines of common Western music notation when
encoding in MEI. Indeed, the gestural and analytical often appear as
afterthoughts to a primarily visual–logical encoding in MEI.

Using this same terminology of domain, CANDR has so far been encoding solely in
the visual domain. Not to be confused with MEI’s conflated visual–logical domain
where the position of an item is attached to its semantic interpretation in a
musical syntax, but instead a purely visual encoding where I have encoded simply
what an element is and where it is. Each element on a stave could be
simplified down to a tag plus a set of cartesian coordinates describing the
element’s absolute position or even its extent. For example, a note and a clef
could be encoded in a cod-XML as:

Such a domain has its uses: using this data I could very easily compare the gaps
between elements on a stave, or the size of those elements and answer questions
regarding the scribal identities and practices present in the MSS. Although
interesting questions, this would bring me no closer to finding out the
constitution and construction of clausulae, which are fundamentally a musical
construct. The data in its raw form says nothing as to what pitch the note is or
what stave line the clef sits on. This data must therefore be transferred into a
different domain.

Rather than the term “logical domain” which for me carries too many formalist,
positivist connotations — a domain to be useful in an analysis which supposes to
include all possible valid interpretaions — I prefer to use the term “semantic”.
A semantic domain — the indefinite article implying the existence of other
semantic domains — is an interpretation of meaning (one of perhaps many) from
the visual domain. Rather than assuming a one–to–one mapping of the visual onto
its meaning, I envisage a number of possible readings based upon just as many
tunable parameters: who (or what) is doing the interpretation, how (by what
process) and why (for what purpose)?

To convert from this pure visual domain into the semantic (musical) domain, I
developed two modules on the CANDR site: a transcribing module that takes a
stave or system and infers from its placement of stave items their semantic
attributes (for example linking a note to a stave line and a clef to infer its
pitch), and a walking module that moves over the first–level schema of the
sources, folios and facsimiles to feed the transcribing module staves in order.

Transcribing Module

All transcribing modules inherit from an AbstractTranscriber class, which
defines the basic methods for transcribing a database item, such as fetching the
items from the database and support for beginning a transcription partway
through a stave. It is up to inheriting classes to implement exactly how the
transcription is carried out and what items are fetched from the database. For
example, TextTranscriber (for transcribing the words of a stave) implements
the function for retrieving items from the database thus (altered and commented
for clarity):

protected function get_items($record) {
	// "record" is stave, so a system is its "up"
	$system=$record->up();

	// get all links in this system
	$links=$system->links()->has('setting')->get();

	// get all the items associated with each link
	$link_items=$links->flatMap(function($a) {
		return $a->items;
	})->filter(function($a) use (&$record) {
		return $a->up->is($record);
	});

	// get all the items on this stave only
	// filter them to only be syllables
	$items=$record->items()
		->with('type')
		->whereIn(
			'itemtype_type',
			[
				'AppModelsSyllable',
				'AppModelsEditorialSyllable'
			]
		)->get();

	// create a new store of notes
	$note_store=new MEINoteStore();

	// get all notes on this stave
	// add them to the note store
	$record->items()
		->with('type')
		->where('itemtype_type', 'AppModelsNote')
		->get()->map(function($a) use (&$note_store) {
			$type=$a->type;
			$note_store->add_note(
				$type->comp_x,
				$type->comp_y,
				$type
			);
		});

	// create a unique set of all items on the stave,
	// including link items
	$syllables=[];
	$seen=new DsSet();
	$all_items=$link_items
		->merge($items)
		->reject(function ($a) use (&$seen) {
			$reject=$seen->contains($a->id);
			if($reject) {
				$seen->add($a->id);
			}
			return $reject;
		});

	// iterate through all the items, filtering by syllable
	foreach($all_items as $item) {
		$syllable=$item->type;
		$item_class=get_class($syllable);
		if(
			$item_class==='AppModelsSyllable' ||
			$item_class==='AppModelsEditorialSyllable'
		) {
			// using the note store, find the nearest note
			// to this syllable

			$nearest_note=$note_store->nearest_note(
				$syllable->centrepoint_x,
				$syllable->centrepoint_y
			);
			if($nearest_note) {
				// +1 to shift it right ever so slightly
				$syllable->centrepoint_x=
					$nearest_note->comp_x + 1;
				$syllable->centrepoint_y=
					$nearest_note->comp_y;
			}
		}
		
		// add to the list of syllables
		$syllables[]=$syllable;
	}
	
	// sort the items using a custom function, left to right
	usort($syllables, function($a, $b) {
		$c=$a->comp_x  $b->comp_x;
		if($c) {
			return $c;
		}
		return $b->comp_y  $a->comp_y;
	});
	return $syllables;
}

In essence, this function not only retrieves the stave’s syllables from the
database, but also ensures that they occur at the right place to be included in
a synchronisation link by moving them just right of the nearest note, and sorts
them left to right to be transcribed. TextTranscriber is a small
proof–of–concept for the transcription element, but CANDR also includes an
MEITranscriber implementation following the same principles. To transcribe
music rather than syllables is more complex, but MEITranscriber returns MEI
(in a CANDR–specific dialect) for a system of music (the limits of which will be
discussed later).

Walking Module

Similarly, all walking modules inherit from an AbstractWalker class,
whose main function walk() begins at any point in a source and walks forward,
calling the transcriber, until it reaches any one of many “stop” elements,
passed in as an array. For example, given a stave with two elements,
respectively listed as “start” and “stop”, the walker will call the transcriber
once on that stave with those parameters. However, if the “stop” elements are on
another stave, then the walker will call itself recursively on the element
above: walking on a stave without raising a “stop” signal will cause the walker
to then find the common system and transcribe the next stave in that system. If
the “stop” is not found there either, it will go up to the facsimile or even
folio level and keep transcribing until that “stop” element is found. The
TextWalker simply has to instantiate a TextTranscriber as its transcribing
class, and transcribe only the lowest stave (where the words are found) in a
system. Once again, there is an MEIWalker for creating MEI
documents.

How a walking module walks between first–level schema items

Result

These two elements combine to create automatic transcriptions of the sources
into words and MEI. For example, navigating to
https://candr.tk/browse/facsimile/86/
will show you the facsimile of folio 43v in W2. If you alter that address to
https://candr.tk/browse/facsimile/86/text/,
then it will display the transcribed text for that facsimile “regina tu laquei
tu doloris medicina fons olei uas honoris Tu uulneris medelam reperis egris
efficeris oleum unctionis Post ueteris quem”. Requesting
https://candr.tk/browse/facsimile/86/mei/
will yield an MEI document of the music on that page.

Behind the scenes, there is a fair amount of caching to speed up this process.
Every time a transcription is requested, CANDR first checks a cache record in
the database. If there is a cache hit then it simply returns that cache record.
This means that requesting the same transcription twice should not re-transcribe
the item, needlessly hitting the database, but returns a pre-computed record.
However, if a database element is altered after a transcription is cached then
the transcription may be invalid. CANDR notifies all the records that use that
element that their transcription may now be invalidated. For example, if a stave
has been altered then that change is propagated throughout the cache system:
caches for its stave, facsimile and folio are instantly deleted such that the
next time the facsimile transcription is retrieved, it must be re-transcribed
using the most up–to–date information. This way we can ensure that the caches
are never out of date.

Considering once again the musical domains of the visual vs the semantic, the
transcriptions created so far have themselves considered only the visual aspects
of the source. We can transcribe staves, systems, facsimile and folios but, more
often than not, music flows from one system to the next and from one folio to
another. We cannot simply conceive of a source as a disparate set of
codicological items, but as a book intended to be read from one opening to the
next. The concept of the walker allows for this movement between codicological
items as previously described, but it needs to know where to start and stop. How
do we find elements to begin and end our transcription? I would like to
introduce another concept into this ontology: that of the musical “setting”.

A setting is a semantic rather than visual construct, but relies on visual cues
supplied by an editor (completed during the data input stage). I have previously
mentioned my concept of the synchronisation link which indicates that two
elements on adjacent staves occur simultaneously with one another. The beginning
of a piece of music links items from all staves together, and this special kind
of link is tagged as being a “setting”. Using these tags, a walker can walk from
one tag to the next, returning the music of a setting.

However, these are also tagged and titled by hand, being an editorial creation.
It is here that CANDR shades into replication of some of the functionality of
DIAMM. DIAMM lists the contents of W1 as a hand–curated inventory (human
mistakes included!), listing “compositions” with their titles, “composers”
(nearly all of whom are anonymous) and start/stop folios. However, DIAMM
considers pieces such as Alleluya. Nativitas gloriose as single items, whereas
I treat them as separate alleluia and verse settings. Regardless, these settings
can be categorised and listed in CANDR at
https://www.candr.tk/browse/setting/. The
advantage that CANDR brings is the possibility of then browsing the facsimiles
for that setting, and viewing a transcription of the words and music, all the
way to the granularity of each note.

This is the effective completion of the online database stage of CANDR. The next
analytical stages are completed offline by scraping data from the site.

MEI is a hierarchical encoding, by virtue of its tree–based XML format. An MEI
document (after header and boilerplate) begins with a base music element, such
as and then moves into that music’s components in a divide–and–conquer
strategy until it reaches a level where atomic elements can be encoded directly
without any children. Looking at MEI’s example encoding of Aguardo’s Walzer
from the MEI sample encodings repository, lines 227–254 begin the encoding
proper by limiting the initial scope until we reach the first moment of the
first stave, a quaver rest. Greatly simplifying those lines to just contain
the initial rest, we get a tree structure:

This XML structure can also be visualised as a series of concentric boxes,
similar to the CSS box model:

This blog post is full of pretty pictures

Each element can only be linked to other elements in three, mutually exclusive
ways:

  1. Parent: the containing box of an element
  2. Children: the boxes contained within this element
  3. Siblings: other boxes that share the same parent element

The

in this example has one parent , one sibling
and five children: as well as others. These limits are
often powerful, as they enforce a strict design philosophy on a document such as
MEI. Elements are contained within other elements such that an element can only
have one parent. For example, a can only occur in one

at
a time. This makes parsing and analysing this data structure (a tree) simple as
it is guaranteed to be acyclic (i.e. when descending the structure it is
impossible to visit an element twice).

Generally this works well for MEI as it is easy to conceive of musical structure
as such: rests in layers in staves in measures in sections in scores in
divisions in a musical body, and this does work well for for elements that
occupy a single space in time, such as simple notes. On the whole it echoes how
common practice Western music notation is the de facto standard for encoding and
how music must be fit into its confines: for example notes are contained within
measures and cannot simply overflow a bar.

How, then, might we encode Notre Dame notation using such a model? Before
thinking of MEI exactly, let us consider it simply as a series of boxes. It is
simple to draw boxes around a system and its staves and also around smaller
items such as notes and ligatures, but in the midground we come across a thorny
issue: do we first divide a system into its constituent staves or into a series
of virtual timeseries boxes (which we could errantly name “ordines”)? Each has
its advantages and disadvantages, and the issue is roughly analogous to
MusicXML’s two encodings: timewise and partwise, either encoding measures of
music at–once or encoding entire parts or voices at–once. MEI is always
equivalent to MusicXML’s partwise encoding.

Two attempts to draw boxes around elements: top is partwise (staves first) and bottom is timewise (“ordines” first)
The above conceptualised as concentric boxes

Partwise

  • Advantages:
    1. Each part element is a sibling of its successor and predecessor.
    2. Staves can be considered at–once.
    3. Mirrors the exact lack of verticality in the notation.
  • Disadvantages:
    1. The information of which items occur simultaneously is lost. In common MEI
      or MusicXML, this can be inferred by counting durations, but in Notre Dame
      notation, rhythm is subjective and up to interpretation. We cannot rely on
      duration counting.
    2. Virtual items, such as the idea of a common tactus between parts, or a
      common ordo length, is also lost.

Timewise

  • Advantages:
    1. Everything that occurs together is grouped together.
    2. Polyphony can be easily extracted
  • Disadvantages:
    1. Often to reach the next note in a stave, we have to traverse up the
      structure to reach a common ancestor.
    2. Infers an editorial synchronisation between parts as a first–class element.
    3. Cannot infer verticality between a subset of parts.

MEI’s solution to this is to keep the music encoded partwise, but use a @synch
attribute to break the encapsulation and link between other items in the
document. I use this exact solution to encode Notre Dame music in MEI, but
the data model is still unsatisfactory. There is a false idea of musical
encapsulation with virtual objects (such as ordines) which is then arbitrarily
broken to make connections that are not formalised in the model. This also makes
parsing the data much more difficult: more than simply viewing a node and the
limited relationships between parents, siblings and children, a parser must now
look at the node’s attributes and other objects in the tree, potentially
creating cycles. In short, it seems like a bit of a hack. In terms of the
structure, what happens if an element has two parents? In which “box” should it
belong?

How I link Notre Dame notation together using the synch attribute in MEI

In fact, MEI provides a whole selection of attributes in the att.linking class
to indicate relationships between items this way, such as @next and @prev
for temporal relationships. However, these still must be structured in the
element. Rather than encapsulation, this music that falls outside
common Western music boxes would be better encoded with links between items
being first class, in order words encoded entirely as linked data as a directed
graph. These kinds of graphs are not of the bar, line, pie, scatter ilk but
rather those of graph theory, what we might commonly term networks. Instead of
items preceding each other being known by the order of siblings, and items in
ligature being known by being children of a “ligature” element, each element
would link from one to the next, and an element in a ligature would have a link
signifying membership of a ligature. Synchronisation links are therefore simply
another link of the same class as temporal relationships or ligature membership.

We could for instance model this as RDF data. Consider the same music as linked
data (in Turtle format, using example namespaces for simplicity):

@prefix rdf:  .
@prefix ex:  .
@prefix it:  .


	ex:hasPitch  ;
	ex:next  ;
	ex:synch  .

	ex:ligatureType  ;
	ex:next  ;
	ex:prev  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:next  ;
	ex:prev  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:prev  .

	ex:prev  ;
	ex:next  ;
	ex:synch  .

	ex:hasPitch  ;
	ex:prev	 ;
	ex:next  .

	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:hasPlica  ;
	ex:prev  ;
	ex:next  .

	ex:prev  ;
	ex:next  ;
	ex:synch  .

	ex:ligatureType  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:prev  .

	ex:ligatureType  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:prev  .

	ex:hasPitch  ;
	ex:prev  ;
	ex:next  .

	ex:prev  ;
	ex:synch  .

	ex:hasPitch  ;
	ex:next  ;
	ex:synch  .

	ex:hasPitch  ;
	ex:prev  ;
	ex:next  .

	ex:prev  ;
	ex:next  ;
	ex:synch  .

	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:inLigature  ;
	ex:prev  .

	ex:hasPitch  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:prev  ;
	ex:next  .

	ex:prev  ;
	ex:next  ;
	ex:synch  .

	ex:hasPitch  ;
	ex:prev  ;
	ex:next  .

	ex:hasPitch  ;
	ex:prev  ;
	ex:next  .

	ex:prev 	;
	ex:synch  .

This is a big leap, and understandably much more difficult to conceptualise than
the box model, so it can be visualised as an extraordinarily complex digraph
(using
https://www.ldf.fi/service/rdf-grapher
for visualisation):

Very large image, zoom in or open image in new tab for detail

To understand this better, let’s look at the node “note/1/” (it is on the
centre–left of the visualisation). It has three links going out and two coming
in.

The first, ex:hasPitch indicates that this note has a pitch of F. It then has
two links, one going out ex:next and one coming in ex:prev, indicating that
the note’s next item is “ligature/1/”. “ligature/1/” also has a link indicating
that its previous item is “note/1/”. Finally, “note/1/” is linked using
ex:synch with “note/21/” in both directions, indicating that these two items
occur simultaneously. This is true throughout the graph (although the
visualisation tool struggles with the amount of links), so that we can model the
data consistently without needing to place the items in boxes (encapsulation).
Traversal of this data structure is much more simple, all that we need to do is
to examine an item’s links to see how it relates to other items. For example, we
could find all the notes with pitch F by examing the “ex:pitches/F/” node and
following the links backward (backtracking).

We can use this graph structure (although not in RDF for reasons I will explain
later) to analyse the same music.

Candr-analysis is a Python package for analysing a Notre Dame notation corpus
(although it could be further generalised to more notational forms). It
consists of multiple modules that can be linked together to: fetch information
(such as from the CANDR site), parse notation, convert a tree–based
representation into a graph, transform that graph into subgraphs, link elements
in graphs together, and generate analyses based upon those graphs.

Pipeline

Candr-analysis is based upon a data pipeline architecture, where analysis
modules are linked together to asynchronously pass data from one stage to the
next. The act of creating an analysis is to link pipeline modules together so
that the output from one feeds the input of the next.

The root of this is the Pipeline class, which receives PipeObjects (or
rather objects that inherit from PipeObject) and links them together.
Once all the objects have been added to the pipeline, then the pipeline is
instructed to begin running and wait for the objects to run. Each object runs in
its own process, so a resource–intensive component such as graph creation does
not necessarily slow down another component such as parsing. This makes creating
new analyses simple. For example, an initial stage of text analysis is to fetch
all the text transcriptions from the CANDR site. This is achieved by simply
running this small script:

from candr_analysis.scrapers.candr.rdf.setting import Setting as Scraper
from candr_analysis.transcribers.candr.text import Setting as Transcriber
from candr_analysis.utitilities.pipeline import Pipeline
from candr_analsysis.utilities.progress import CLIProgress as Progress
from candr_analysis.transformers.pickle import PickleSaver as Saver

progress=Progress()
pipeline=Pipeline()
pipeline.add(Scraper(progress=progress))
pipeline.add(Transcriber(progress=progress))
pipeline.add(Saver('output.pkl', flush=True))
pipeline.start()
pipeline.stop()

It is clear to see from this fully–functional program that it accomplishes three
tasks:

  1. Setting up a pipeline;
  2. Adding the components of the website scraper, transcriber and a component that
    saves the result to disk;
  3. Running the pipeline and waiting for it to finish.

Pipelines of any complexity can be set up this way which allows for a dynamic
iteration of analysis ideas, and the ability to save results to disk at any
stage allows for analyses to be created, saved, resumed, re-run, and
meta-analyses to be created.

Adding new functionality to Candr-analysis is as simple as inheriting from the
base class of PipeObject and defining a generator function process() to take
in an input and returning sets of output to be passed to the next object in the
pipeline. Below are some objects that I have already created for my uses:

Scraping

A “scraping” PipeObject is one that views a data source, such as the CANDR
site, and generates metadata based upon that source. For example, the
rdf.setting scraper views the setting list page and uses its RDF markup to
create a rich list of settings in the sources.

Transcribing

A “transcribing” PipeObject takes a single data endpoint and fetches an
individual transcripton of that point. For example, the candr.text transcriber
receives a CANDR endpoint object and fetches its respective text transcription.
Similarly the candr.MEI transcriber fetches the MEI transcription.

Parsing

A “parsing” PipeObject takes an unparsed input, such as one from the CANDR
site, and parses it into something more useful for further analysis. There can
be more than one parser for any input, for example a parser may only be
interested in certain subsets of the data. Text generally needs no parser, but
the MEI parser validates and transforms an MEI file into a graph–based
representation as outline above. The MEI parser defines multiple relationships
(edge types in a multi–digraph, roughly analogous to the data model of RDF)
between items. These have been generalised so that they could be used with any
repertory of music, not just Notre Dame notation, and the graph can support more
kinds of links:

  • Sequential: B follows A
  • Synchronous (Hardsynch): A and B occur at the same time
  • Softsynch: A and B are some distance away, but their distance can be computed,
    the weight of this edge being defined as the reciprocal of the distance between
    the nodes.

Transforming

A “transforming” PipeObject performs some sort of transformation on parsed
data before it is analysed. For example, data may need to be weighted correctly
or converted from one data structure to another.

Analysing

An “analysing” PipeObject takes a corpus of data and performs some analysis on
it, generating some new artefact — typically a database — that can then be
interpreted.

Interpreting

An “interpreting” PipeObject extracts useful information from an analysis.
What that information is depends greatly on the type of analysis being done.

Utilities

A utility is some quality–of–life object that can help make the workflow
simpler. For example, I have found that a good workflow is to scrape and
transcribe in one step, taking all the necessary data from the network in one go
before moving onto following stages. The code above uses an object that dumps
its input directly to disk, and there is another object that can read data from
disk and send it into a new pipe. Examining these intermediary files is also
indispensable in debugging.

What remains, then, is the analysis of these graphs. Although the narrative
inroad to Notre Dame notation as a graph was through RDF, Candr-analysis
internally uses the more general–purpose library NetworkX to generate its
graphs during parsing, and SQLAlchemy for on–disk graphs. The reason
for this is that although RDF’s model fits my data model and is simple to
comprehend, it is designed for linking humanistic data using strings. There is
therefore no RDF implementation efficient enough for this use case. For example,
I could use SPARQL to query an RDF store for this dataset, but the number of
nodes that will be created and the density of connections between the nodes
would likely be far too much for a sensible RDF implementation to handle.

N-grams

A common methodology in computational linguistics is that analysing a text’s
n-grams, splitting texts into tokens then combining sequential tokens of size n.
In other words, creating 2-grams (bigrams) of “the cat sat on the mat” would
yield the following set: “the cat”, “cat sat”, “sat on”, “on the”, “the mat”.
3-grams (trigrams) of the same text would be “the cat sat”, “cat sat on”, “sat
on the”, “on the mat”. These can be used to detect similarities and syntax
between texts other than simply counting word occurrences (which would be
1-grams or unigrams). N-grams can also detect word sequences and context.
Generating n-grams for CANDR text transcriptions is as simple as defining a
PipeObject that returns n-grams for input strings:

class TextNGrams(PipeObject):
	def _split(self, raw):
		return re.split(self._split_regex, raw)
	def process(self, input_data):
		split=list(filter(
			(lambda a: a !=''),
			self._split(input_data.transcription)
		))
		ret=list(zip(*[split[i:] for i in range(self._n)]))
		if ret:
			yield NgramsTranscriptionRecord(
				input_data.subject,
				input_data.type,
				ret
			)
		return

However, the more difficult task is that of generating n-grams of music as this
is a problem that has not yet been solved, indeed n-grams generally cannot be
generated for more than single streams of data. Monophonic music (such as chant)
could simply be tokenised into its constituent pitches, but polyphonic music
(such as the Notre Dame repertory) depends not simply on what pitches occur in
what order, but also what is happening in other voices at the same time. I
propose a method of extracting workable n-grams that are not represented by
tokens but by graphs, and maintain their synchronisation contexts to then be
analysed as a whole corpus graph.

Consider the same passage of music we looked at previously. We could more simply
represent it as a graph like so:

Simplified graph of music

In the upper voice, there are four notes in the first ordo: F, F, E, D. In the
lower voice, there are only two: C, D. Not everything is synched together — only
the initial F to the initial C — so we don’t know exactly where the D comes in
the lower voice. We must therefore make some assumptions. Since there are half
as many notes, we can assume that in this voice the notes move at roughly half
the speed as the upper voice. The C in the lower voice must change to a D
somewhere around the E in the upper voice. This is likely incorrect, but over
the course of millions of grams, these assumptions should average out.

If we wanted to created unigrams of the first ordo, we could create:

F→C
F→C
E→D
D→D

The next ordo (with only one note in the lower voice) would all be paired with
F:

|→|
F→F
G→F
F→F
G→F
A→F
A^→F

And so on. However, if we wanted to create bigrams, we could move with the same
ratio, but using pairs of notes. In the first ordo top voice, there are three
bigrams: FF, FE, ED. The lower voice only has one bigram: CD. Our graph bigrams
are therefore:

FF→CD
FE→CD
ED→CD

There are no trigrams, so we must overflow into the next ordo to create one:

FFE→CDF
FED→CDF
EDF→CDF
DFG→CDF

By using the graph representation of the music, these n-grams cannot be confused
with other n-grams with other synchronisation patterns. This is because there
are also soft synchronisation (softsynch) links computed between elements of the
graph. Consider this polyphonic music:

A simple contrived example

This yields four bigrams: FF→C, FE→C, EF→C, FE→C. There is a duplicate bigram,
FE→C and this might incorrectly indicate that they are identical. The first
bigram is closer to its synchronisation C than the second and this should be
reflected in its n-gram. To solve this, the softsynchs are computed such that
every note is linked to every other by the reciprocal of its distance (here
shown in dashed lines). The first FE→C bigram has softsynchs of .5 and .33
respectively whereas the second has .25 and .2: the synchronisation link between
the voices in the second is weaker although the pitch content of the bigram is
identical.

N-grams in a corpus graph

A n-gram PipeObject can output subgraphs of a graph of a passage of music as a
series of n-grams to then be processed further. In my analysis, the n-grams are
further split up into their constituent voices, taking the mean of their
synchronisation weights, but split into hard vs soft synchs. For example, the
first bigram in the above example, FF→C would yield two records: FF and C. These
two records are linked by a hard synch of 1 and a soft synch of 0.5. The next
bigram, FE→C links FE and C with a hard synch of 0 and a soft synch of 0.42 (the
average of 0.5 and 0.33). These links are also linked back to the “subject” of
the link, for example the setting that the n-grams were generated from. This
corpus graph is then stored on disk as an SQLite database with millions of
connections made between n-grams.

We can query this corpus database by comparing the graph to another piece of
music and how similar it is to edges in the graph. For example, another piece of
music may contain FE→A. This would match FE, but in our graph there is only
FE→C. A scoring function controls the weight of certain parameters in
calculating which grams more closely match the queried grams. We can tweak
parameters of the graph search to control for gram size, the strength of hard vs
soft synchs and “wrong synchs” such as the example just mentioned where there is
a synch from FE but to C rather than A.

By way of example, we could give hard synchs (which always have an edge weight
of 1) a scoring ratio of 10, and soft synchs a weight of 5. Wrong synchs are
weighted less, say 0.1. We calculate the difference between the bigrams as:

Where Hw is the hard synch weight, Hr is the hard synch ratio, Sw is the soft
synch weight and Sr is the soft synch ratio. Let us imagine that we query the
graph with FE→C with a hard synch of 1 and a soft synch of 0.5. This would
match the first bigram more than the second as the score for the first bigram
would be:

Whereas the second bigram’s score would be:

However, if we query FE→A with a soft synch mean of 0.64, then this would score
0 on hard and soft synch, but if we calculate hard and soft synchs again with
the wrong synch and multiply by 0.1 to find the score:

We repeat this calculation for all the grams in the queried music and keep a
cumulative score of subjects (such as settings). Subjects that score small
differences over many grams may be inferred to be similar to the queried music.
The overall idea of this analysis is not to generate a single, binary truth “X
is the same music as Y”, but a view on a repertory (I can see this methodology
generalised to other repertories of music) where, given certain parameters and a
particular viewpoint, two pieces of music display a high degree of concordance
saying some more along the lines of “X, when split into bigrams and weighted
using these synch parameters, is most similar to Y but also scores highly on Z
etc.”


Back of the Envelope

Fri Feb 12 2021 13:05:00 GMT+0000 (Greenwich Mean Time)

To cut a long story short, I had vastly miscalculated the costs for using
the ✨ magic ✨ of Google, and they were going to send me a bill
for £200. Panic, delete account, please don’t charge me, I can’t afford that.
Well, that was a bit of a dead end, but I’m a lot wiser now (see last post). The
“RightWay” can be a productive way of working, if you have a budget and a whole
team of people to support. I have neither, so I returned to the world of the
pragmatic, and spent a few days migrating everything off Google and to Mythic
Beasts
, an independent provider based in Cambridge with a fantastic name, which
I have used before. They charge me a few quid for the pleasure, which is a
small expense I am happy to deal with! Nothing is as magic or as new, my efforts
to automate everything have been practically for nought, and I will have to do
the ML myself on my own computer, but at least it works.

CANDR now has a temporary domain name: www.candr.tk, where I have begun to input
some of W2 to test the ML. More tweaking and cleaning up of the ML code has
improved the accuracy a surprising amount. The boring detail is to do with
weighting. Take, for example, the notes. I preprocess the data by (basically)
making two images. Image one (the input) is the original stave, and image two
(the output) is blank with the features I wish to detect marked up (it is a
little more complex than this, I’m not using pretty colours). Put simply, the
training phase of the ML takes both images and attempts to alter its neural
network to best recreate the output from the input.

The original stave against a pretty representation of the output
The original stave against a pretty representation of the output

As you can see from the image, in this case there are many more times blank
“nothing” spaces than useful features, so the ML frequently got away with just
predicting nothing for every sample, the important notes and items would not be
enough to get it off the ground, so to speak. To counteract this, during my
preprocessing, I now count the number of “nothings” and the number of features.
Say I have 100x more nothing than feature, I can then pass the inverse of that
into the training, such that if the ML predicts a note wrong, it is 100x more
likely to effect a change on the neural network than predicting a blank space
wrong.

Yeah, there was a RAM emergency, the office had too much ... RAM?
I wish. Everyone remembers this episode of *The IT Crowd* for the “street countdown” storyline, but forget Jen’s exclusion from Douglas’ secret workout sessions. A reminder that the writer of *The IT Crowd*, *Father Ted* and *Black Books* is rightfully cancelled and permabanned from Twitter for being transphobic.

Another problem that I have run into is the ever–increasing size of my dataset.
My computer has a fair amount of RAM, and the dataset fits easily in that space
(currently sitting at around 2GB after preprocessing), however the RAM issue
lies in the implementation and what should hopefully be the last of the gory
technical detail. The language I am using, Python, is the most popular and
supported language for developing high–level ML in, as it is simple and fast to
develop in and easy to pick up with few gotchas. However, it gets itself in
trouble with extremely large datasets.

Python, like most modern programming languages, is garbage collected, in that
every time you make a new variable, it automatically allocates space for that
variable but crucially you do not have to tell it when you’ve finished with that
memory or manually delete everything as you go along. Every so often, the
garbage collector finds all the variables you’re no longer using, and frees that
memory back to the operating system. AFAIK, Python’s garbage collector is quite
lazy: it doesn’t come round very often and will regularly not actually take the
garbage away, thinking that you actually haven’t finished with it, and you’re
intending to use it later. Often this is correct, and for small programs it
doesn’t matter very much as all the memory is automatically freed when the
program finishes. However, for long–running programs (like this one!) we can
quickly run out of memory, compounded by another issue called memory
fragmentation which I won’t go into.

Under the hood, Python uses the allocator malloc to get memory from the
operating system. malloc takes a single argument: how much memory you would
like allocated. malloc does not warn the garbage collector that we’re running
out of memory, or force it to return memory to the operating system. malloc
often will keep allocating memory until something breaks. I believe this was
what was occurring with my program. Although the dataset was only 2GB and I have
many times that available on my computer, my program was passing through the
data multiple times, allocating different copies of the data as it went,
fragmenting and duplicating until it ran out of memory. I managed to mitigate
this somewhat by manually marking large variable as done with using Python’s
del, but often they still would not be garbage collected (del is a marker
for deletion, not a command to delete-right-this-very-second). This problem was
only going to get a thousand times worse when I increase the granularity of the
ML, and use more staves as training data. Even if I managed to completely quash
the memory fragmentation issue, I anticipate that my dataset will grow large
enough not to fit in RAM, even in the best of cases.

I therefore must save my dataset to disk, and access it a bit at a time when
training. To do this, instead of saving all the data into one huge variable, I
developed a small DataFrame class (inspired by a popular data science library
base class of the same name) that wraps a SQLite database. Instead of keeping
the data in memory, before training it is packaged up into batches and saved
into the database file. This also has a side effect advantage of forcing the
memory to be reorganised contiguously. Each training batch fetches only the next
record from the database rather than the entire dataset, and so the size of the
dataset should now be limited only by the size of my harddisk. Disk is slow
however, many times slower than RAM and it takes upwards of half an hour of
crunching to package the data and save it to disk. I really don’t fancy fine
tuning it any further, so I now use that dataset as a cache that can be fed back
into the training rather than recalculated on each and every run.

Phew! Hopefully that should be the final technical tangle I get myself in.

There are still some silly little bugs plaguing the transcription interface, the
most most notable of which being divisiones that rudely do not delete properly
when asked, sudden bouts of unresponsiveness, and plicae being applied to the
wrong notes. However, I can work around these, and most importantly I have added
a lovely big button to fetch the ML predictions from the database and populate
the stave for checking and fixing. Please enjoy this video of me transcribing
some staves with the help of ML.

I ramble and transcribe a system.

So how long is this going to take, and is it feasible? At the database prompt,
here are two useful commands to give me an idea (comments on the right hand
side).

SELECT COUNT(*) FROM folios WHERE folios.deleted_at IS NULL; SELECT AVG(idcount) AS average FROM (SELECT COUNT(staves.id) AS idcount FROM staves INNER JOIN systems ON staves.system_id=systems.id INNER JOIN facsimiles ON systems.transcribe_id=facsimiles.id INNER JOIN folios ON facsimiles.folio_id=folios.id WHERE staves.deleted_at IS NULL AND systems.deleted_at IS NULL AND facsimiles.deleted_at IS NULL AND folios.deleted_at IS NULL GROUP BY folios.id) AS subq;
Two database commands to calculate my heuristic.

The first command reveals that there are 1784 folios in the database. My
terminology is a little confusing as in the database, folios are defined as
manuscript surfaces that are named (n.b. named rather than numbered). For
example, “42r” is a different folio to “42v”, even though we would call them the
same folio as it has a different name. In practice, folio is better approximated
as page rather than folio as one “folio” usually has a single facsimile. This
figure of 1784 includes all the facsimile images of F, W1 and W2.

The second command indicates that, of the currently–defined staves, there are on
average 8.83 staves per folio. This means 1784 x 8.83=15752 staves to
transcribe in total. I have transcribed 273 of them in the past week with and
without the help of ML. Without ML it takes a gruelling eight minutes per stave
to get everything sorted, the majority of the time being taken clicking on all
the notes. With ML at its current accuracy, the time is taken down to between
two to three minutes per stave. The most successful predictions are staff lines
(almost perfect each time) and notes (mostly perfect). This is nearly a
threefold improvement and goes to show that this has not all been for nothing!

But can I do it all? Time for a back–of–the–envelope calculation. Here’s the
envelope:

1784 folios x 8.83 staves per folio x 3 mins per stave=47258 mins ÷ 60=787 hours ÷ 8 hour day=98 days=19 weeks ≈ 5 months?
This blog post was delayed because I couldn’t find an envelope for this bit: one of the downsides of going paperless

So, about five months of forty–hour weeks just sitting at the computer pointing
and clicking? This is such an interesting figure. It’s neither so small as to be
a simple endeavour, like a fortnight, nor so large as to be an unreasonable
task, like a year. Likely it would take a little longer due to various issues
that will surely crop up, but then again I anticipate that the ML will improve,
not by too much as to completely change matters, but small improvements. I think
five months is a good estimate.

I could say that it is all too much and I should trim things down, but on the
hand the idea of having it all is so inescapably alluring. To be honest I am
not sure, with COVID as it is, that I’ll be doing much else this summer, and
will probably be stuck inside again anyway (sorry for the downer). Perhaps five
months of clicking will be the best use of my time‥?


A boring update: I should not be Google

Fri Dec 11 2020 20:11:00 GMT+0000 (Greenwich Mean Time)

Inside you there are two wolves: Do it the RightWay(TM) / Push directly to Production

I have always considered myself further to the right of this image. Nothing
that I have ever created has really had that many consequences, so I prefer to
“move fast and break things” because my projects have nearly always been for my
own benefit. Onoseconds have been entirely my mess to clean up when I
completely break my project. I have been influenced in the way I work by Oz
Nova’s article You are not Google, where he makes the case that it is
extremely tempting to over–engineer solutions when the simplest, most pragmatic
solution will probably do just fine. It is unlikely you really need that
99.99999% uptime.

This has kept me closer to the laissez–faire attitude, and I have always been
sceptical of solutions that will feed two birds with one scone, or even solve
all my issues at once! I have found that these solutions will likely bring in
more issues that I didn’t have to contend with before, for example instead of
just changing one thing I will have to go through a whole process to get the
same result (more of that later).

My last post was very much an example of that (perhaps too wild) attitude. It
was definitely more of a proof of concept of using such ML techniques to
recognise and locate features on ND polyphony manuscripts. However, the code
that got me there was rushed, hacky and buggy. For example, each model was a
hacked–together Python script. When I created a new model, I foolishly just
copied the script to a new file and changed the bits I needed, rather than
abstracting out the common functionality into other bits of program. I had
perhaps taken too much to heart that quote “premature optimisation is the root
of all evil”. Regardless, those scripts obviously needed cleaning up at
some point. The git repositories, too, were a mess (now cleaned up into a
different kind of Dockerfile mess).

CANDR as it is, was written in that same vein, but has since been cleaned up.
Not as hacky, but written very simply in good ol’ PHP using a thin framework
and backed by the disk–based SQLite database. However, when I began to think
about how to deploy my machine learning models onto the website, into the cloud
(although I detest that expression), my infrastructure must become more
complex, and I must move closer to the “Right Way” of doing things.

As the machine learning models are fairly CPU intensive and benefit from being
extended by graphics cards (GPUs), it certainly will not do to have them running
on the same service as the website frontend. That way, every time the models are
trained, the website will crash. Not good. I had to deploy the models onto a
different service to the website. However, they also need access to the
database, which right now was locked on the hard drive of the website service.

I began therefore by migrating my simple SQLite database to a more flexible but
complicated PostgreSQL instance on Google Cloud (not the most ethical cloud
provider, but the best of a bad bunch). I went with PostgreSQL as MySQL (the
other option) is really quite difficult to work with when you have lots of
variable–width text fields.

I eventually landed on an infrastructure which works like so:

sequenceDiagram
participant C as CANDR Website
participant B as Codebase
participant I as GitLab CI
participant R as CANDR ML Runner
participant D as Database
participant W as CANDR ML Worker
B->>+I: Change code
I->>+I: Build website / runner / worker
I->>+C: Deploy with K8s
I->>+R: Deploy with K8s
I->>+W: Deploy with K8s
C->>+D: Alter database
Loop: Every 24 hours
R->>+D: Is database changed?
D->>+R: Yes
end
R->>+I: Trigger training pipeline
I->>+R: Begin training models
R->>+W: Provision training service on K8s
Loop: Every 5 seconds
I->>+R: Get training status
R->>+I: Training status
I->>+I: Display training status in pipeline
end
W->>+D: Get the model data
D->>+W: Model data
W->>+W: Training models
W->>+D: Save models
W->>+D: Get untranscribed data
D->>+W: Untranscribed data
W->>+W: Predicting untranscribed data
W->>+D: Save predictions
I->>+R: Get training status
R->>+I: Status done!
I->>+R: End training
R->>+W: Unprovision training service
C->>+D: Get prediction for stave
D->>+C: Prediction data

Sequence diagram of infrastructure

This began as a simple way to train the models, but through various small
necessities has become larger as a move towards the fabled “Right Way” of doing
things. This includes modern technologies such as Continuous Integration /
Deployment (CI), Docker images and Kubernetes (which I have gone from zero
knowledge to a little knowledge on). Simply put:

  • CI is a way to make a change to your code and the resulting app or website to
    be automatically built for you.
  • Docker images are pickled versions of services that can be stored as a file,
    uploaded to a registry and then pulled from that registry and deployed at
    will.
  • Kubernetes is an architecture that can deploy Docker images with zero downtime
    by maintaining multiple running copies of those services and performing an
    automatic rolling update of those services. In addition, if you have passwords
    or secrets that need to be passed to those services, it can inject those at
    runtime rather than your Docker images being passed around with passwords
    stored inside.

What I initially wished to occur was that a regular job (say every 24 hours)
would spin off and train the machine learning models. However, I soon realised
that if I’m provisioning a large, expensive machine to train the models, I can’t
have it running all the time wasting money here and there. The machine that
serves the website costs only a penny an hour, but a machine powerful enough to
train machine learning models costs a pound an hour. I could not leave that
machine running, but needed to provision it when necessary, and only when there
is new data.

Therefore, I use a level of indirection: a “director” service in CANDR ML
Runner
, provisioned on the same machine as the website. The runner checks
the status of the machine learning and is responsible for provisioning the ML
service via Kubernetes (i.e. flicking the big switch to turn the expensive
computer on). This frees up the website, and all other services can check the
runner service to see the status of the ML training. As the runner runs on the
same cheap machine as the website, it does not incur any extra cost and updates
the predictions only when the training and prediction has been completed. This
way, the website can retrieve the most up–to–date predictions from the ML.

Furthermore, the entire build process has been automated using GitLab’s CI. As
soon as a change is made to the master branch of the codebase, the website and
ML tools are rebuilt automatically and redeployed (again via Kubernetes) to
live. This adds another layer of complexity to the diagram above.

It finally works‥ but it is rather reminiscent of this video:

A familiar conversation with myself

As mentioned above, the Python scripts were rather hacked together and needed
some abstraction. I have now begun work on cleaning up these scripts into their
constituent functionality. Previously, I had long, repetitive code that did the
same work over and over for each model, but I have now successfully abstracted
work into individual classes to deal with:

  • Abstract model as base functionality for all derived models:
    • Staffline line model
    • Staffline ROI model
    • Staffline location model
    • Divisione model
    • Note model
  • Each model uses their respective data generator derived from an abstract data
    generator:

    • Staffline line data generator
    • Staffline ROI data generator
    • Staffline location data generator
    • Divisione data generator
    • Note data generator
  • Class for drawing stafflines from constituent detected segments
  • Image fetching
  • Training callbacks for status updating
  • Database driver
  • ML runner status updating
  • Driver app

Rather than a complicated mess of tangled code, the driver app now simply calls
each model one–by–one and calls train() followed by predict(), and the
abstract classes and data generators take care of the rest. The status is
updated via a simple JSON API on the runner which exposes the status to other
services. This is authenticated by public key encryption challenges for each
status update, so that only the worker is authorised to update the status.

My first conception as to how this should work had to be changed however, as
initially I had the idea to train the models and predict the untranscribed
staves in two discrete stages. However, I soon found out that to train and then
save the models, I would have to save the neural networks that make up the ML
models in some format. There is the h5 format for this purpose, but getting that
format out was tricky, especially as each model when saved in this format came
out as several hundred megabytes in size. I thought that I could simply put that
data in a big column in the database, but it was so slow that it was timing out
the connection, and I really did not want to go down the route of making more
infrastructure just to store the model save files.

I have since altered the process to train and predict in one pass. Each model is
trained on the current data, the untranscribed staves are predicted, and then
that model must be thrown away as it cannot be saved. This is not too much of a
loss however, as I was planning to train the models from scratch each time
anyway.

Training models. Little ticking progress bar at the bottom shows some slow progress on training for notes.

I mentioned also that I was using Google’s Cloud Compute service for much of
this, and I have been toying around with their generous free trial before
spending any real money (however small). Google’s platform needfully sits to the
far left of that first image, doing everything the “Right Way” each time. If
anyone is Google, it’s Google. However, that means that for a small
single–person project, there are a lot of unnecessary hoops to jump through. I
found myself using their interface to grant privileges to myself saying, “Yes, I
will allow myself to use this or that particular function”. In a large company
or project, it is easy to see how this bureaucracy would be useful, with every
employee or team member having their own little walled garden and not
accidentally messing things up by pressing the wrong button. This way of working
was frustrating for me and just served to get in the way as I was going to grant
myself those privileges anyway!

However, I may have got a little too trigger happy granting myself privileges
when I came to the issue of using cloud GPUs to accelerate the ML tasks. Such
computing tasks are better using graphics cards and other accelerators rather
than your computer’s CPU as you can think of GPUs as lots of little CPUs working
together to achieve a common goal rather than one big CPU doing one thing at a
time. The CPU works well when you’ve got one single big task to do, but lots of
computing tasks in the past few decades have been on working to split big tasks
into individual independent tasks.

ML is one of those such tasks where each step requires the weights of hundreds
of tensors on a neural net to be adjusted. Using the CPU for this means that the
CPU is iterating over every tensor, adjusting the weights as it goes. The CPU is
fast, but this still takes time. The individual cores in a GPU are much slower,
but there are hundreds of them and they can update hundreds of tensors at a
time, speeding up the process by orders of magnitude.

I attempted to provision myself with a small GPU, to test some ML functionality
on this machine rather than the slower, cheaper machines and make sure that my
code had the right stuff installed to work with the GPUs. However, an error
came back, something along the lines of, “You do not have the necessary quota
to do this‥ go to the quota panel”. I therefore found this configuration panel
and saw that I had a quota of zero GPUs allowed. Surely another case of
granting myself permission! I upped my GPU quota from zero to a grand total of
one and was prompted for a reason. Ah, more bureaucracy! This reason will
surely just be sent back to me to grant myself permission! I put some silly
excuse like “i need gpu pls x” and clicked submit.

My inbox pinged:

Hello,

Thank you for contacting Google Cloud Platform Support.

This message is to confirm that we’ve received your quota request for project
‘candr-292514’. Quota increase requests typically take two business days to
process. If this quota request is very urgent, please respond to this email so
that our support agents can have full context when processing your quota
increase request.

Onosecond. This doesn’t go to me. This goes to a real person at Google. A few
agonising minutes went by while I looked to see (unsuccessfully) if I could
cancel the request. Another e-mail: denied (I wouldn’t grant myself access
either). I had to wait two days to make another request.

So, 48 hours later, I wrote a much more formal paragraph apologising for my
previous request, outlining my humble request for a GPU to do some ML, pretty
please. Clicked submit and just as quickly, denied. Rude, maybe they were still
peeved about my previous communiqué? They don’t provide a reason, so I reached
out to them asking why my request was denied (conveniently omitting my blunder
two days earlier). I was told that I had “insufficient service usage history”.
In other words, I was still a free trial baby and hadn’t put down any megabucks
on their services already, so they couldn’t wouldn’t give me one.

I can’t have moved all my infrastructure to Google for nothing, and I was happy
about the rest of the service, so I pushed another route further. I received a
much more helpful e-mail from someone who outlined my options going forward:

  1. Reapply for quota
  2. Apply for research credits
  3. Reach out to certified partners

Option one was an inevitability. I have not yet reapplied, but am hoping that my
continued usage of the platform will eventually allow me to use their hallowed
services.

Option two is interesting. It turns out that Google offer free usage of their
platform for research, but such credits must be applied for (they are
currently, of course, prioritising projects related to COVID-19) and I will be
applying to get more free usage that way. Their form asks for a link to my
research profile, so I took the time to update my University page with things
that make me look much more successful and qualified. I will be applying for
these credits as soon as I can think of the right words to write. There is also
the possibility of applying for a research grant through the University.

Option three is not useful: it is a list of companies that offer particular
services using the same platform, none of which were useful to me as I am
running bespoke services.

For now, and until Google deign to allow me to use their service further, all is
not lost. I can still use the platform as I was before, and I can use the same
tools (Docker/Kubernetes/etc.) to run those same ML tasks on my computer. It’s
not as cool or ✨ automatic ✨ magic, but I’m getting close to
the point where my own computer pretends to be the big computer in the sky with
the big GPU (although the big GPU is my own, or more precisely borrowed from my
brother who wasn’t using it).

I think that the main thing to keep in mind as I struggle to get to grips with
these technologies and ideas is that I am learning as I am going: six months ago
every technology listed here was alien to me, although I am looking forward to
getting over this technology hump and getting back to the musicology and
manuscripts.


Speed 3: Machine Learning

Thu Oct 01 2020 18:24:00 GMT+0100 (British Summer Time)

The git repositories that host CANDR are split into two: yockyrr/candr and
yockyrr/candr-data. The former hosts the codebase and scripts for CANDR to
operate, and the latters stores a dump of the database in a text format (a raw
SQL dump) along with the images uploaded to the database as well as the
necessary scripts used to transform to and from the database format (SQLite).
The project clones both of these repositories and keeps them in separate
directories, such that a change in the database does not effect a change in the
codebase, and vice versa.

I’ve been fixing bugs in the frontend of CANDR by uploading sets of facsimiles
and attempting to input a few staves here and there, seeing what issues crop up
along the way. Exactly as I had hoped, input is relatively fast, such that I can
trace the key features of a stave of ND music and save it to the database for
processing in under a minute. I am aiming to get a large proportion of the ND
repertory traced in this manner to create a large dataset for analysis by next
year. I have implemented an automatic guide on the site (the “Transcription
Wizard”) that tracks transcription process and automatically leads you through
each step of the tracing and transcription process for an entire source of
music. The wizard tool takes you through the categorisation of a source to its
stave transcription, and each stave is quite simple to transcribe.

Labelled screenshot of stave editor window: 1. Toggle facsimile visibility 2. Toggle staffline visibility and below, add staffline 3. Clef visibility and add 4. Accidental visibility and add 5. Divisione visibility and add 6. Note visibility and add 7. Syllable visibility and add 8. Tools for adding ligatures and adding notes to ligatures 9. Multi-input tool. When toggled, as soon as an element has been added, automatically adds another of the same type 10. Delete selected element 11. Various tools for adding editorial elements

Using this editor, my personal workflow is thus:

  1. Trace the stafflines using the staffline tool (2). Use the multi-input
    function (9) to trace all of these at once. There are typically four to six. It
    is important to trace these first as they will form the basis for the pitch
    content of other elements such as clefs, accidentals and notes.
  2. Trace non-note data. I usually like to begin with the clef (3) as all staves need
    one of these (existing or editorial), then a starting accidental if present (4),
    finally using the multi-input tool once again to trace the divisiones (5).
  3. Trace the notes (6). Use the multi-input tool to click on the note heads,
    taking care to see which have plicae and which will be transcribed incorrectly
    by looking at the green staffline outlines.
  4. Go back over the notes and add plicae and manual shifts where necessary to
    correct the notes’ transcription.
  5. Add syllables to notes (if present) making a best guess as to which note this
    syllable is attached to.
  6. Finally, use the transcription wizard (not in screenshot) to mark this stave
    as transcribed.

Once all staves in a system have been traced, you can then use the
transcription wizard to mark the system as transcribed, and it will suggest
that you add system synchronisation tags to the system. This is a fancy way of
saying: add lines between staves that indicate that these events occur
simultaneously. A setting of music can then be broken down into “chunks” of
polyphony for processing in XML (see previous posts). It will be my aim during
analysis to detect which chunks can be grouped together to form clausulae and
then how these clausulae are passed around the repertory.

A chunk can be as small or large as you like, but I like to make as little
editorial input with regard to inferred rhythm as possible, relying on visual
cues to synchronise the staves, such as alignment and the size and shape of
divisiones on the page. The scribes of the manuscripts clearly intend some
moments to be synchronisation points. For example, it may be possible to imply
through rhythmic interpretation that two notes occur simultaneously, but it is
more obvious that two divisiones are simultaneous when they are written
directly above one another at the end of a system. In this way, it is usually
short, synchronous ordines of polyphony that are collected into chunks. Longer,
more hocketed passages I will usually shy away from making into smaller chunks
as I don’t wish to editorialise a rhythm upon polyphony that is not fully
understood. If such hocketed passages are to be passed around in clausulae,
then they are likely to be transmitted as single units: half a hocket doesn’t
make musical or poetic sense.

In August, I grew bored of just drawing some lines and boxes so decided to
source the public domain or permissibly–licensed manuscript images for the
three main sources (F, W1 and W2) and upload them in total to the website. The
images of these can be downloaded as a large set of numbered JPEGs from the
respective library websites, so I had a task at hand to organise these JPEGs
and label them, then upload them one–by–one to the website, creating a folio
and facsimile record for each, then grouping them all into facsimile sets. As
can be seen from the folder of ID-named files in the image store of the data
repository, this was a lot of files to upload! However, I did not complete
this mammoth and repetitive task alone. I wrote two shell scripts
(foliator.sh and uploader.sh) to streamline the tagging and uploading
process of these files.

The JPEGs as I had downloaded them were luckily saved in a numbered order, so
moving back and forth through the numbers in an image viewer moved back and
forth in the manuscript. By flicking through the manuscripts, I took a note of
where the foliation was broken, repeated, or needed editorialising.
foliator.sh simply prompts for four numbers:

  1. The filename of the first JPEG.
  2. The folio that this JPEG is a facsimile of (recto was assumed, and the first
    images of JPEG sets that began on verso were manually inserted).
  3. The final folio in this set.
  4. The “width” of the filename, e.g. a width of five would be numbered
    00001.jpg, 00002.jpg, 00003.jpg etc.

Finally, it asked for the filname extension (typically .jpg) and generated a
TSV (tab separated value) file which linked folio name to a calculated filename
for this folio. The script creates two records: one for recto and another for
verso. Broken foliation could be calculated by doing one pass for each
contiguous set then appending the files in vim, and filenames that had
prefixes were fixed using a mixture of sed and vim. After manual review,
this script gave me a file that contained folio labels and a matching JPEG for
each.

The second script, uploader.sh, reads in this generated TSV file and parses
it. After inputting user credentials for the CANDR website instance,
uploader.sh uses the site’s API to log in, create correctly–named folios and
facsimiles for each record, and uploads the image to the site’s database. Where
each image previously took a few minutes to tag and upload manually, whole
facsimile sets were uploaded in a matter of seconds. As of yockyrr/candr-data
commit c74fac5a, all of the three main sources have been uploaded.

Looking at all these facsimiles uploaded to the website was quite awe–inspiring.
These are all the images that I want to transcribe and it was actually rather
worrying to scroll through the images and think, “I have to transcribe all of
these”, then look at the little progress metric I added at the top which
reminded me that I had so far to go.

Cue soundtrack from Interstellar

When travelling, I like to continuously work out in my head how much farther it
is to go and how long it will take me to get there, especially on long boring
journeys where I know I will be travelling at a constant speed. I enjoy
consoling myself with a calculation, e.g. if I’m driving at 60mph, that’s one
mile every minute, so if home is 46 miles away then it’s going to take me about
46 minutes to get there. Then I like to compare how accurate I was when I
eventually arrive. Similary when I find it difficult to sleep, I likely make it
worse for myself by looking at the clock and thinking, “If I fall asleep right
this very second
, I’ll sleep for exactly 4 hours and 24 minutes before my alarm
goes off” (n.b. this rarely results in me falling asleep any faster).

It came as no surprise to myself then that when looking at all the images I had
to transcribe, I did a quick back–of–the–envelope calculation to see how long
all this is going to take. This time two years ago, I was perhaps naively
setting off on transcribing a large proportion of W1 for my Masters, the
transcription of which took far longer than I anticipated (roughly three months
off and on) and I don’t think I’ve ever been the same. Moreover, I wasn’t
trying to capture as much information then as I am doing now, and input was
likely just as fast, if not faster.

I timed myself inputting a few staves, added a little on for redos and general
faff, multiplied by a ballpark figure for the number of staves per facsimile
and the number of facsimiles in the database and arrived at an alarming figure.
I won’t share that figure here, but suffice to say that at 40 hours of
transcription a week, I’d be lucky to finish transcribing my dataset by next
Christmas. No, not Christmas 2020, Christmas 2021. Who knows what the world
will look like by then? I shudder to think about that, as well as my mental
state after such an undertaking.

If there has been any work done on using digital technologies to increase the
speed of music transcription, then the work done over nearly two decades by the
DDMAL team, mostly based at McGill, on OMR (optical music recognition) would
likely have researched this. By using OMR, researchers have managed to
extract meaningful categorisations of music by developing numerous in–house
tools, the most interesting for this purpose listed on their website being
Gamera. Gamera styles itself as not simply a tool for OMR, but a tool for
generating OMR tools, a sort of meta-tool that has had equal success in text and
music. After dutifully downloading and installing Gamera, I set to opening a
facsimile of music and attempting to extract some meaningful features from it.

Gamera’s best go at ND repertory

Unfortunately, the feature extraction filters of Gamera didn’t perform well
here. The main issue with detecting staff notation is that elements are layered
on top of one another: elements such as clefs, accidentals and notes are layered
on top of the stafflines, and in the ND repertory notes are joined into
ligatures of no fixed shape or size. Unlike text, where the individual glyphs
that make up words can be extracted individually and recombined to make words,
staff notation requires a larger context to understand elements. For example,
whether a horizontal line is a staffline or ledger line requires looking further
afield to the edges of the stave to see whether that line continues. This makes
it difficult or impossible for filter–based feature extractions as used in
Gamera to extract features that rely on larger contexts. To attempt to remove
stafflines from common practice printed music, Gamera needs a special toolkit
which does not work on ND manuscripts.

However, DDMAL have more recently begun work that uses machine learning (ML) to
classify pixels of early music facsimiles, for example. Pixel.js builds upon
Diva.js (an IIIF viewer), and is a web–based editor that allows users to
categorise the pixels of a manuscript: notes, ligatures, stafflines, text,
etc. What they call the “ground truth” of these pixel classifications can
then be fed into an ML algorithm (DDMAL uses a scheduler called Rodan) which can
then be trained on that input data. Then, when asked to predict the
classification of an unknown pixel from another manuscript, it can attempt to
predict what class of item that pixel belongs to. Of course, it is impossible to
guess a pixel from that pixel alone, so a window of image context is given to
the training data (typically 25 pixels square). The ML model is therefore asked
a simple question: given this 25×25 image, what is the class of the central
pixel?

Transcribing facsimiles with Pixel.js is incredibly laborious and difficult when you need pixel–perfect accuracy. It is easy however, to ragequit.

All this sounds great, until you realise the timescales involved for generating
any ground truth data. DDMAL proudly state that using Pixel.js, one can
generate ground truth for a single facsimile of music in just eighteen
hours! This approach clearly won’t work in my case as I do not have the
time to generate pixel–perfect ground truth data. Furthermore, I am a little
sceptical of DDMAL’s approach, and their narrow focus on classifying pixels
rather than extracting features. Consider a best–case scenario: the ML model has
been exceptionally well trained and is asked to predict the classifications of a
facsimile that is 1000×1000 pixels. Say the model is 99% effective, and you are
then interested in extracting stafflines. Even at 99% accuracy (an unreal
standard), the staffline layer will not consist of neat stafflines, but a series
of correctly–predicted segments and incorrectly–predicted anomalies, which will
then have to undergo further processing to be smoothed out into “real”
stafflines.

Remember: a staffline or other element isn’t a tangible item, but a notational
abstraction based on a collection of similarly–functioning notational elements.
That’s why I put “real” in big scare quotes here. When considering what a “real”
staffline is, aesthetic is everything and a staffline is only as good as its
visual manifestation. There is a tempting avenue in considering these old
manuscripts as inferior technology and making the case that if only we could
just get it to display in [notation editor of choice] then we would be okay.
There is something unsettling about this, a feeling like we’re colonising the
past somewhat and I believe it’s important to continuously remind ourselves that
our notational representations are just as valid as the originals. We’re losing
and gaining information during these transformations.

This focus on the pixel also ignores the context of the music. DDMAL correctly
acknowledge that there are layers to the facsimile (stafflines, notes, text,
etc.) but fail to mention that these layers overlap. When a note is on a
staffline, the staffline is still implied to lie behind, hidden by the note
which has been written on top. A pixel, or matrix of pixels, is not a perfect
representation of the context of the underlying music nor the notation on the
page, and a classification of a pixel as one thing or another fails when a pixel
could in fact be two or more things at once.

This is not to say that DDMAL’s method is not effective: they have seen success
in classifying pixels this way, and I have not seen much output from their
models except a few cherry–picked examples. I would contend that we need a
greater diversity in approach in order to detect diverse features in diverse
contexts. However, it is one thing to say it and another to actually go and do
it.

Rather than a single ML model that classifies all pixels as one of a single
class, I decided to create individual ML models for detecting particular
features, extracting features directly rather than classifying pixels. This has
the advantage that I can tailor my approach to the feature I am trying to
detect. I decided to start by detecting stafflines, as they are the most
time–consuming and repetitive elements to add, and I figured (perhaps
incorrectly) that they would be simple to detect.

The most well–used and popular library for ML is TensorFlow. In the past it
was difficult for novices to get started using the sometimes confusing libraries
such as TensorFlow, but recently simple APIs such as Keras have made that much
simpler, by hiding some of the unnecessary configurations and providing sensible
defaults that can be tweaked later. ML has had such a high barrier to entry
for this very reason, plus the fact that every option and algorithm used has a
complicated name derived from a research paper. In reality, most of the things
that you want to do with ML have already been done, and it is just a matter of
finding the right blog post or StackOverflow question that works for you.

The most important step is trying to simplify the question you’re asking as much
as possible, and trying to get as much data for that question as possible.
Usually these go hand–in–hand: if you’re simplifying your question, you’re
likely to be splitting up your data into smaller and smaller chunks. For example
in this case, the question I’m looking to answer is: looking at a picture of a
stave, where are the stafflines? I want to better define that question and break
the problem down into smaller sub-problems to solve individually.

First of all, it is important to think about how the ML model “sees” the data.
It’s not looking at an image or sequence and making classifications. What it is
in fact doing is relating two matrices of numbers, and trying to guess the
second from the first. For example, the 25×25 window that DDMAL use is usually a
colour image. There are three components to a colour image: red, green and blue.
Say I’m only trying to find stafflines, the output is not a simple
classification but a probability: staffline or no staffline. Therefore, the
input is a matrix with shape 25, 25, 3 (i.e. width 25, height 25, depth 3 for
the 3 colour channels) and the output is a matrix of 2, 1 (width 2, height 1)
to make two probabilities for each of the classifications. The ML model doesn’t
know that the input matrix is an image and the output is a probability, it’s
just numbers that it’s trying to guess using the input numbers.

For instance, a naive interpretation would be to take your colour image, feed it
directly to the ML model and extract probabilities from each pixel. Say your
image is 1000×1000, then your input shape is 1000, 1000, 3 and the output is
1000, 1000, 1. This is not going to yield a good model as you’re expecting too
much from the model, and your training data is going to be only in the dozens
(one per stave). This is quite a few orders of magnitude off as we want at least
thousands of samples for training.

Better are approaches that simplify the question, such as DDMAL’s 25×25 pixel
window and the classification of a single pixel. I want to go down the same
route, but simplify my questions even further. For the problem of stafflines, I
see it being broken down like so:

  1. How many stafflines are there? (Staves commonly have between four and six).
  2. Which areas have stafflines in? (This is known as region of interest or ROI
    for short).
  3. Where in these areas are the stafflines?

The difference between DDMAL’s and my approach is that rather than asking the
pixel location of stafflines in an entire stave, I can first narrow my search by
finding general staffline parameters and areas, then doing a fine–grained
location search in those areas.

First: to define exactly the data I will be working on. Calling back to previous
posts, CANDR applies perspective transformations on facsimiles to isolate
systems and then applies another round of transformations to isolate staves
(perhaps there is another ML application here), so in this case I first need to
preprocess my data to gain an initial dataset of the perspective–transformed
staves, as seen by the CANDR stave editor. This is all done in Javascript in
browser, and so to get that same data for the ML models, I had to port my
Javascript code exactly to Python and verify that viewing a stave in Javascript
is the same output as viewing a stave in Python (n.b. this is the third time
I’ve rewritten the same functions in a new language: I previously had to port
the same Javascript code to PHP for the backend).

There are (broadly) three steps to ML:

  1. Training. Give the model a whole load of input → output cases and let the
    algorithm work itself out using the parameters you’ve given it to generate a
    neural network that best can predict the output given the input.
  2. Validation. Use more data to check that the training you’ve done is accurate,
    or at least good enough. This should be data that you haven’t used in your
    training. What we’re checking for here is that the ML model has learnt the
    underlying process behind input → output rather than just learning that set
    of data and being very good at replicating it.
  3. Prediction. Your model is complete! Use new input to predict some outputs.

I am always looking at ways to segment and simplify my data so I started by
looking at colour. Do I need a colour image to predict? Not necessarily. In
fact, some transformation may help the ML do its job. To this end, I performed
edge detection analysis on my input images. This works very well for these inked
manuscripts as the difference between what is and what is not an element is
typically very sharp, and edge detection brings these sharp features into focus.

A sample stave, with edge detection applied

To generate a model that best answers this question, for each stave that has
already been transcribed, I tagged each stave with the number of stafflines
defined in that stave. The minimum and maximum number of stafflines across the
dataset were taken to normalise my data for training the ML model (they work
best when all numbers are between 0–1). Using this initial data, moving windows
of each stave were generated using a fixed width so that all columns of
stafflines were visible. The was to get the ML model to count the horizontal
stripes. Since all I was looking for was the stripes to be horizontal, I
generated three times more data synthetically by flipping the images upside
down, then back to front, then both.

Visualisation of a sliding window of width 20

I found that the width of the stripe did not affect the training accuracy, so I
chose a midrange window size of 20 pixels. Each fixed–width stripe was trained
against the ML model, such that an input shape of n,20,1 → 1 where n is the
height of the stripe and the output is a single number indicating a predicted
number of stafflines. The output, of course, is never a clean answer of “4”, “5”
or “6”, but lies somewhere in–between. I rounded my answer to the nearest whole
number. After training, this ML model can be used to figure out how many
stafflines to try to detect in the stave.

This model uses a much larger, square window size (60×60 pixels) and moves over
the entirety of the stave. If the end of a staffline is contained in this
window, then it is marked as being a region of interest (ROI). If no
staffline–ends are in the window, then it is marked as not interesting. I used
the image flipping method to synthetically create transformed data. Since there
is so much data created here (roughly a few thousand windows per stave), I
elected to only use the training data in one “epoch” (i.e. only let the model
look at each sample once).

Visualisation of a sliding window of size 60

In prediction, a window of the same size is given to the model and it predicts
whether it contains the end of a staffline, as a probability in the range 0–1.
This probability is added to a “heatmap” of probabilities of the stave to
generate a masking layer to use for the final model. Bright spots will be
processed, and dark spots ignored.

The ROI for the stave

The final model runs a similar moving window, but with a smaller dimension (20
pixels). I found this window size to be optimal for only containing one
staffline–end at a time. As we will only be looking at areas marked by the
previous ROI model as containing a staffline–end, the question is not the
existence of the staffline, but where is it? This is crucial: as the ML models
are only looking at numbers, it’s tough to train them to look at existence and
location simultaneously. The model was trained using windows that had
staffline–ends in, and once again I flipped the images to generate more data.
The transformation shape was therefore from 20, 20, 1 → 2, the output being an
x, y coordinate between 0–1 where 0, 0 is the top left of the window and 1, 1 the bottom right.

Visualisation of a sliding window of size 20 operating over the ROI of the stave

Heatmap output of the third model

In prediction, this is used once again to create bright spots on a heatmap.
Using the output from the number of lines predicted, we can make the assumption
that this number of bright spots will appear in the left half of the image, and
the same number in the right half. I therefore processed the image to find this
many bright spots in each half and matched them up: the highest spot in the left
matches with the highest spot in the right, the lowest with the lowest and so
on. We can then draw on the image where we believe the stafflines are, or use
these points as predictions for correction in the CANDR editor. This output was
created with only a few staves’ worth of data input, so naturally the prediction
is going to improve as more staves are input, giving these ML models more
training data. It is likely that I will be creating and refining more ML models
to detect other features, such as divisiones, clefs, accidentals, text syllables
and notes. I can foresee that these ML predictions will not serve as perfect
transcriptions of the images, but quick, mostly–there transcriptions that will
only require a little tweaking on my part.

Processed output from the final model


How–to: Transform XML in Browser (Answer: Don’t)

Sun Aug 16 2020 21:12:00 GMT+0100 (British Summer Time)

In my first post, I mentioned that my next milestone was to be able to transform
XML using XSLT:

In–browser soon, for all browsers that support XSLTProcessor (that is, not
IE).

The harsh reality is that although XSLT is “supported” by browsers, you will be
tricked at every turn by a technology that, although basically functional, has
not been updated with any new API for over a decade. Your lovely, performant
stylesheet that worked fine using libXML’s xsltproc will fail, seemingly for
no reason, in browsers and, as mentioned, will never work in IE.

Good luck! Here are the particular hurdles that I came up against, and how I
overcame, hacked, or just plain gave up on them. My goal was to be able to
transform the data that will be input into my app’s database into a “useful”
format, something that can be parsed by humans as well as computer and then
simply transformed into a viewable edition and other useful formats such as
MEI. However, I was deceived along the way by the call of “the right way to
do things” and stepped upon the path of XML, which lead me directly to hell.
This is part of the story of how I am still trying to get out!

At the end of my last post, I had some lovely (and not so lovely) XML transform
(XSLT) stylesheets, as well as an API endpoint that could spit out some XML to
be transformed by these XSLTs. One small stylesheet can scrape the XML for
syllables, thus extracting the text of a setting (candrToSettingText.xsl).
Another huge and ugly stylesheet (candrSettingToMei.xsl) can transform
this XML into an odd dialect of MEI, and one final stylesheet
(MeiToABC.xsl) can convert this MEI oddity into ABC, a text–based music
typesetting language. My API can convert database tables into XML, then by
chaining XML transforms together I can convert from XML to ABC, then ABC to
printable editions in SVG or PDF etc. My issues arise with that second step:
chaining XML transforms.

I had initially shied away from performing XML transforms on the server as I
knew that candrSettingToMei.xsl in particular can take several seconds to run
on large settings due to its matrix transformations, and I did not wish to tie
up my backend with needless XML transforms every time a page is requested. I
believe an evil man once said, ‘Browser CPU cycles are free’, and he was
technically and unfortunately correct. If I could offload XML transforms and
their delays to the client rather than tying up my server, then I would be able
to dedicate more server time to generating XML rather than transforming it.

XSLTProcessor has been around in most browsers since the very beginning, and
this makes sense as there were some dark days of XHTML and stylesheets other
than CSS. It turns out that IE can in fact do XML transforms but it is
hidden within… an ActiveX control shudders. Luckily, there is a simple wrapper
around these conflicting standards in murrayju/xslt, which exposes a
single function for doing XML transforms. Ideal! After importing the package
into my project, I created some classes to consume XSLT jobs and despatch them
to a Web Worker thread, as I do not want to halt the main thread to do XML
transforms, but instead keep them running in the background and use the main
thread to provide updates to the user.

This provided the first big sticking point: Web Workers cannot use
XSLTProcessor or similar.

I’ll be the first to admit: this makes zero sense. Obviously, Web Workers cannot
manipulate the DOM as they operate outside of the main thread and don’t want to
create race conditions in the DOM. All DOM methods are off–limits to them. For
some bizarre reason, this includes XSLTProcessor, perhaps because it is
technically manipulating a DOM. Not the DOM, but a DOM nonetheless.

A no–go then, but are there vanilla JavaScript libraries to perform the same
function as XSLTProcessor without using that API? Yes there are, kind of, but
they are incomplete and lack even basic functionality. Moreover, my stylesheets
use EXSLT for all different kinds of functions, and no JavaScript–based XSLT
processor shows any sign of an attempt to comply with any part of EXSLT.

So it looked like I would have to use XSLT on the main thread. This might hang
the browser for those few seconds it takes to transform the XML, but I supposed
that the page wasn’t doing anything else particularly, and needs must.

My original XML made liberal use of XInclude, which is a simple way in which
to include other XML files in your XML. All you have to do is to set a link to
another XML file, and provide a fallback (usually an error) in case that link
didn’t work for some reason, like so:

xinclude: /xml/set/7/ not found

xsltproc has a great flag: --xinclude which, you guessed it, performs
‘XInclude processing on document input’, i.e. before processing. This works
great, but all the requests slow down the processing and increases load to the
server, which ends up having to serve multiple XML files rather than doing one
huge SQL SELECT * across the database which, although not ideal, is a fair
sight better. Unfortunately, XSLTProcessor has no such flag so before
processing, the XML will have to undergo XInclude processing. MDN has a snippet
to resolve XIncludes using JavaScript and, after porting to CoffeeScript (my
language of choice, I have my reasons) it… didn’t work. Well, I suppose
that’s what you get for trying a page with code from 2008?

The alternative is to attempt to do the XInclude on the server when requesting
the XML. I first tried PHP’s support for XInclude, but ran into difficulties
straightaway. PHP’s development server is single–threaded only, so cannot handle
another request while another is active. Put another way, the server is
attempting to do XIncludes as it goes, and each XInclude it comes across is a
URL, making another request to the same server, which is already tied up trying
to deal with the XIncludes in the first place! The obvious solution would be to
use a “real” server rather than just a toy development server, but there must be
a better way than spawning hundreds of threads just to generate one XML file?

The answer was to generate a subpath in my routes. Each XML endpoint now has a
subpath called /xinclude/ which does an effective XInclude, but through PHP
rather than the XInclude API. Each generator function that generates my XML now
has a boolean flag added to its signature indicating whether to generate an
XInclude for an element or to include it directly, and that flag is passed
recursively down through the functions. This is an order of magnitude faster as
we are making a single request to the API rather than hundreds. When doing XSLT,
I‘ll just have to remember to request the /xinclude/ subpath rather than the
base point.

Returning once again to XSLTProcessor, after busily refactoring my lovely job
controller into a simple procedural function (yes I know I should have worked on
the main thread before putting it into a worker, lesson learned), I finally had
a simple class that loaded XSLTs using AJAX and fed them one–by–one into
XSLTProcessor. Time to test it out:

Uncaught Exception
    columnNumber: 0
    data: null
    filename: "debugger eval code"
    lineNumber: 1
    message: ""
    name: ""
    result: 2153775105
    stack: "@debugger eval code:1:3n"
    : ExceptionPrototype { toString: toString(), name: Getter, message:Getter, …}

What is that, Firefox? This error message appeared as soon as I tried to import
the stylesheet using my perfectly valid XSLT. I have seen some cryptic messages
in the past, but this error decides not to give any indication of the type of
error that has occurred, apart from that number 2153775105. A good hour of
searching yielded only that its hex equivalent (0x80600001) is an error flag in
Firefox called NS_ERROR_XSLT_PARSE_FAILURE. Ah, would have been nice to know
that originally maybe, but it is something at least. It doesn’t like my XSLT. At
least Firefox gave me an error. Chrome just silently returned undefined and
left it at that. What the XSLT parse failure was, where, and how it was
incorrect however, I was given no clue, and I really wasn’t inclined to find
out, given that my debugging would just be adding and removing random lines
until it started working. No thank you, I would rather go down a thousand other
routes before trying that.

After a quick look to see whether some kind soul had ported xsltproc to
WebAssembly (they haven’t, and I’m not about to start), I considered my options
on the server again. PHP has XSLTProcessor too, and it is a binding around
xsltproc. This will definitely work, but I really don’t want to tie up my
server. I considered starting background jobs and running another job server to
send jobs to a queue and send updates on those jobs to the user via AJAX then
pushing the result the same way, but the thought of polling and more JavaScript
made me nauseous as my previous run–in with that nightmare a few years ago left
me reeling. A simple check indicated that yes, XSLTProcessor works with my
stylesheet in PHP, and I might as well use that /xinclude/ endpoint subpath I
made to solve my previous issues, as it is miles faster.

By doing my includes in PHP rather than XSLT, I appear to have inadvertently
solved my own problem a different way as my transforms are much faster with only
the transforming to worry about and not the XInclude-ing. My transforms are now
still slow, but bearable to perform on the server. I could simply set up a
pipeline of processors in PHP to generate the correct output. The final step,
converting to SVG or PDF, will require shelling out to abcm2ps, but should
be trivial in comparison to what I have just gone through.


Database to Edition (through XML hell)

Sun Jul 19 2020 16:14:00 GMT+0100 (British Summer Time)

What to do when you’ve got a database full of tracings of music and you want to
convert those tracings into transcribed music, automatically? It seems that with
XML, every problem is a nail and once you find yourself in XML hell, it’s
difficult to get yourself out!

I became aware of XML gradually through programming experience. I learnt to code
through For Dummies books in the early 2000s: C/C++, Java/JavaScript and so,
perhaps like many early teenage programmers, I conceived of programming as a
procedural endeavour: the program “does stuff” to data, and keeps track of its
state using internal variables. The only reason we separate code into functions
is to keep us from typing out the same code again and again, and to stop us from
getting lost along the way! Recursion was a necessary evil, the future was
object oriented, and when can I start making video games?

XML always seemed to rear its head in these textbooks near the end, time and
again it would introduce “the way” that that author processes XML using their
favourite library, and time and again I would take one look at the HTML-ish
confusion of angle brackets and think, “Well, what’s it for?” These chapters
would get XML in, transform it using the library du jour and nearly always get
XML out. So what? How did this XML come about in the first place? HTML I could
understand, it is the language of structuring webpages, but these XML documents
seemed to have no real purpose to them!

As such, I have managed to avoid XML ever since, learning bits and pieces about
it, nearly always negative, choosing instead to define my own file types and
configurations. This worked well in my Masters’ thesis, where I was trying to
input data as quickly as possible. I defined my own input format where each
musical event could be represented by a single character. Clearly the
verbosity of XML would not have worked here!

XML as such remained a mystery to me until it became a necessary evil as part of
a short course I took on encoding music using the file format of the Music
Encoding Initiative (MEI) in 2019. According to its website, MEI is a
‘community–driven effort to define a system for encoding musical documents in a
machine–readable structure’. To understand how and why XML became the format of
choice for MEI, we must first understand how XML was (and by some still is) seen
as a panacea for structured data, and how when teenage me started reading blog
posts about XML over a decade ago, those writers were already sick and tired of
over a decade of XML hell.

The life of XML (or to give it its full name: Extensible Markup Language) really
begins with SGML (Standard Generalized Markup Language), a standard format for
defining markup languages. It is from here that we get XML’s characteristic angle
brackets. SGML became useful in the 80s and early 90s for defining structured
data in a human–readable format. One of those uses was TEI (Text Encoding
Initiative), and another was HTML. HTML’s fraught history is well known, but TEI
have been using SGML to create text markup since before I was born.

Although MEI is framed as a new and exciting digital humanities technology, it
is over twenty years old (beginning 1999), and has always been considered as a
remix of TEI for music, using XML rather than SGML (TEI has since changed to
XML, in 2002). Although MEI describes itself as ‘21st century’, its technologies
(XML, schema, DTD, XSLT) come straight from the 90s and those chapters I skipped
over in my textbooks, and nowhere is this more apparent than its reliance on XML
as media for markup.

Although there are the diehards that like to write XML by hand (my question is:
why?) it has since become apparent to me that the vast majority of XML is
generated by computer. This is all tied up in the idea of the “Semantic Web”,
another utopian vision where we would use XML (or more precisely, RDF) to
describe and categorise linked data. The semantic web, as a useful tool for the
masses is, outside of hackers who never gave up the dream, dead. First
described by Tim Berners-Lee in 2001, the semantic web quickly formed a dream
where everyone would have their own little website, and on their website they
would provide a short piece of (handwritten?) XML detailing who they were, who
they knew, what they do, what they like to do, etc.

Put like this, it’s no wonder it didn’t become popular. Turns out, people are
fuzzy and the medium in which they express themselves is sometimes just as, if
not more, important than the information that they are expressing. Search
engines and social networks, although they may use semantic technologies on
their backend, are extracting their information from text, images and video.

Ten years ago, I had a Facebook account, and Facebook was interested in getting
me to list my interests, my relationship status, my school, my work: semantic,
structured data. It seems now that social media networks don’t need you to input
that data: they already have inferred it from your data activity. Relying on
humans to accurately and precisely input their information as structured data
was always going to fail, even with the best intentions. It is now usually much
easier to get data from original items than attempt to extract them yourself, or
even worse, ask other people to extract it for you!

However, like it or not, it seems XML is here to stay, but hopefully as
something machine–generated rather than handwritten. The aptly named WikiWikiWeb
page “XML Sucks” has an interesting entry:

Because XML is increasingly used for data interchange, it is too easily
promoted as a data storage model. XML is only a data encoding format.

This is a particularly useful distinction. XML should not be seen as the end
format, or storage for an item, but more of a way of transferring and processing
that item. Creating an MEI file doesn’t do anything, you’re left with a
thousand–line file that’s probably not going to come out right in Verovio
anyway. In this way, MusicXML has the right idea in XML use: as a way of sharing
data. No–one writes MusicXML by hand; it is generated and parsed by computer
programs. This is not to say that MEI doesn’t have its place, but rather that
MEI needs more programs that speak it, else it is just a really bad interface
for Verovio.

XML’s power lies in transmission and transformation. Granted, its syntax is not
as simple as JSON, YAML or S-expressions, but it is ubiquitous. XML has a
standard for transformations, XSLT, and it is (reasonably) well supported.
Lisp is not found on many machines, nor really standardised for this purpose,
and transformations of JSON usually have to pass through JavaScript (I am
however, eyeing jq to replace XML hell as an answer to all my woes).

For now at least, it is useful to generate XML so that it can be transformed
using XSLT into something else. I picked up a book on XSLT in a charity shop a
few months ago, and have discovered that XSLT is simutaneously invaluable and
frustrating.

XSLT is most useful as a set of transforms to incrementally move from some
useless XML to useful XML or text. My situation is that my project generates a
large database of tracings over manuscript elements (e.g. this clef is at point
(x, y) with dimensions (w, h) and is rotated by this much, this note is at point
(x, y) and has these other properties, etc.) I wish to transcribe those tracings
into staff notation, and end up with MEI at some point. XSLT seemed the way to
go. Using a built-in XML writer, my database now contains endpoints for querying
restricted selections and returning XML. To save from walking the entire
database, much of the XML is not generated at that point, but is set as an
XInclude, i.e. a link to another endpoint that will generate the XML for that
section. My first simple XSLT was one to extract text in order from this XML:

This twenty–line stylesheet (not the finest example of XSLT) finds all
or elements and outputs a simple text string
of the setting’s text.

I was forced to use XSLT 1.0 as opposed to the more modern 2.0 or 3.0, as
neither 2.0 nor 3.0 appeared to be well–supported in anything other than some
closed–source CLI–only processors. Perhaps unwisely, I jumped straight from this
small example to a huge stylesheet that transforms and transcribes my tracings
into a subset of MEI. XSLT 1.0, being a technology over twenty years old at
this point, has little support for more complicated transforms, so much of this
stylesheet became my own implementations of functions that are now included in
most languages, such as matrix transformations. XSLT is far better at
transforming XML structure than value. If you want to move an element outside
another element, XSLT is perfect, but if you want to calculate the cofactors of
an 8×8 matrix, you’re on your own.

In a way, the XSLT for transforming the tracing XML to MEI was a better idea
than I initially conceived. Rather than polluting my database with more classes
and functions, I have managed to keep my backend functionality clear as a
utility for tracing manuscript to database, and a way of viewing and exporting
that data as XML. The XSLTs can be used as a way of further transforming that
output data.

Using that MEI XML, I then wrote another (smaller!) XSLT that transforms that
strict subset of MEI into ABC, a text markup language for generating musical
notation which is strikingly fast and good at representing transcriptions of
early music. Passing that ABC text output into abcm2ps, an ABC typesetter,
I can end up with a visual, automatic transcription of my tracing as well as an
MEI output using the same process.

This is still a WIP, but hopefully should be a feature in–browser soon, for all
browsers that support XSLTProcessor (that is, not IE).

Read More

Related posts

© Copyright 2022, All Rights Reserved