SQAD - Simple Question Answering database

Question answering systems employ tools that process the input question than go through a knowledge base and provide a reasonable answer to the question. The presented SQAD database will help to measure and improve accuracy of QA tools as it offers all relevant processing parts, i.e. the source text, the question and the expected answer.

The SQAD database consists of 8,566 records obtained from Czech Wikipedia articles. The record structure is following:

Example of sqad record:

In the first phase, the SQAD database consisted of plain texts. To support the comparison and development of question answering systems including SBQA [1], SQAD was supplemented with automatic morphological annotations. The texts were processed with two tools: Unitok [2] for text tokenization and Desamb [3] morphological tagger, which provides unambiguous morphological annotation of tokenized texts (see next example). Both tools are automatic systems and their accuracy is not 100% thus they occasionally make mistakes. To obtain high-quality data, the tagged texts were checked and corrected by semi-automatic and manual adjustments.

Example of SQAD text annotation:
<s>
Kdokdok3yRnSc1
jebýtk5eAaImIp3nS
autoremautork1gMnSc7
novelynovelak1gFnSc2
Létajícílétajícík2eAgMnSc1d1
jaguárjaguárk1gMnSc1
<g>
??kIx.
</s>