A systematic comparison of different approaches of unsupervised extraction of text from scholary figures [extended report]
Different approaches have been proposed in the past to address the challenge of extracting text from scholarly figures. However, so far a comparative evaluation of the different approaches has not been conducted. Based on an extensive study, we compare the 7 most relevant approaches described in the literature as well as 25 systematic combinations of methods for extracting text from scholarly figures. To this end, we define a generic pipeline, consisting of six individual steps. We map the existing approaches to this pipeline and re-implement their
methods for each pipeline step. The method-wise re-implementation allows to freely combine the different possible methods for each pipeline step. Overall, we have evaluated 32 different pipeline configurations and systematically compared the different methods and approaches. We evaluate the pipeline configurations over four datasets of scholarly figures of different origin and characteristics. The quality of the extraction results is assessed using F-measure and Levenshtein distance. In addition, we measure the runtime performance. The experimental results show that there is an approach that overall shows the best text extraction quality on all datasets. Regarding runtime, we observe huge differences from very fast approaches to those running for several weeks.