Date: September 7th to 8th, 2017
Location: Department of Language Science and Technology, Saarland University
During the last decade, the availability of historical data has increased dramatically, giving rise to a large amount of research on historical corpora and archives. One of the main challenges with this type of data is data processing: data is erroneous (e.g., OCR errors), heterogeneous (including text, figures, table, etc.), sometimes multilingual within one document (e.g., Latin or French in historical scientific texts), not easy to be processed by standard NLP tools, and difficult to structure document-internally (e.g., titles, pages, paragraphs) and corpus/archive-wide (e.g., categorization into text types/genres/registers).
While there has been quite some work on processing historical data, collecting and making available meta-data is moving into focus. Especially for (socio)linguistic analysis taking a variationist approach, considering metadata information (e.g. text author, production time) is essential to generate valuable results.
In this workshop, we aim to bring together specialists from the field of historical corpora/archive building as well as those researchers involved in conducting empirical research on historical data that are also interested in accounting for different variables, notably social variables. Each corpus/archive has its peculiarities. Which are these and how can we make best use of them? Collections and corpora might be based on the same data sets, but interests in the kinds of meta-data valuable for analysis vary.
The workshop provides a forum to discuss and improve our understanding of building and using historical data as efficiently as possible accounting for a wider range of variables based on availability of metadata.