![]() ![]() Additionally, the inputs are converted to grayscale color space. Since the scale of the systems can be different for each page, the staff line distance d S L, which is the average distance from one staff line to the next staff line, is calculated as a preprocessing step in order to normalize the raw input pages to an equal scale. As depicted, the inputs to the general workflow are the raw documents. The note accuracy, where both pitch and duration are correctly predicted, was 80%.įigure 3 illustrates the general workflow. Using data augmentation, a pitch and duration accuracy of 81% and 94% was achieved, respectively. In total, the dataset consisted of about 17,000 MusicXML scores, of which 60% were used for training. The model was trained on user generated scores from the MuseScore Sheet Music Archive (Musescore:, accessed on ) to predict a sequence of pitches and durations. Here, the model is trained with the Connectionist Temporal Classification (CTC) loss function, which has the advantage that it is not necessary to provide information about the location of the symbols in the image just pairs of input scores and their corresponding transcripts are enough. A common problem for segmentation tasks is the lack of ground truth. used a Convolutional Sequence-to-Sequence model to segment musical symbols. Furthermore, a post-processing pipeline is introduced that aims to improve the symbol recognition using background knowledge.Īlternatively, the challenge can be approached as a sequence-to-sequence task. The decoder part was expanded so that it fits the architecture of the encoder part. For the encoder part, several architectures, which are often used for image classification tasks, are evaluated. The U-Net structure is symmetric and consists of two major parts: an encoder part consisting of general convolutions and a decoder part consisting of transposed convolutions (upsampling). The fundamental architecture used for this task resembles the U-Net structure. ![]() ![]() Furthermore, the pitch of the symbols can be derived from the position on the staff of the symbol relative to the position of the clef on the staff and other pitch alterations. Further post-processing steps extract the actual positions and symbol types. For this, a Fully Convolutional Network (FCN), which predicts pixel-based locations of music symbols, is used. The extraction of the symbols is addressed as a segmentation task. With additional fine-tuning, the contribution of post-processing is even greater: the basic mAR of 90.5% is raised by more than 50% to 95.8% mAR. When using a mixed model and evaluating on a different dataset, our best model achieves without fine-tuning and without post-processing a mAR of 90.4%, which is raised by nearly 30% to 93.2% mAR using background knowledge. The use of background models improves all metrics and in particular the melody accuracy rate (mAR), which is based on the insert, delete and replace operations necessary to convert the generated melody into the correct melody. Moreover, the effect of different encoder/decoder architectures and of different datasets for training a mixed model and for document-specific fine-tuning based on an extended OMR pipeline with an additional post-processing step were evaluated. Various types of background knowledge about overlapping notes and text, clefs, graphical connections (neumes) and their implications on the position in staff of the notes were used and evaluated. This paper deals with the effect of exploiting background knowledge for improving an OMR (Optical Music Recognition) deep learning pipeline for transcribing medieval, monophonic, handwritten music from the 12th–14th century, whose usage has been neglected in the literature. ![]()
0 Comments
Leave a Reply. |