Language modelling
IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák
Prepared by: Vít Baisa
State of the Art
The goal of a language model is a) to predict a following word or phrase based on a given history and b) to assign a probability (= score) to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.
References
- Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155.
- Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
- Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems. 2013.
- Chelba, Ciprian, et al. "One billion word benchmark for measuring progress in statistical language modeling." arXiv preprint arXiv:1312.3005 (2013).
Practical Session
Bůh požádal, aby nedošlo k případnému provádění stavby pro klasické použití techniky ve výši stovek milionů korun.
We will build a simple character-based language model and generate naturally-looking sentences. We need a plain text and fast suffix sorting algorithm (mksary).
http://corpora.fi.muni.cz/cblm/generate.cgi
Getting necessary data and tools
wget nlp.fi.muni.cz/~xbaisa/cblm.tar
(or download from a stable copy)tar xf cblm.tar
in your directorycd cblm
- get a large Czech model (3 GB): (stable copy)
scp aurora:/corpora/data/cblm/data/cztenten.trie .
- get a large plain text: (stable copy)
scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz .
- if you want to work with English, download ententen_1M_sentences.txt.xz
mksary
git clone https://github.com/lh3/libdivsufsort.git cd libdivsufsort cmake -DCMAKE_BUILD_TYPE="Release" \ -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT" make cd .. ln -s libdivsufsort/examples/mksary mksary
Training data
To build a new model, we need
- a plain text file (suffix .in) all in lowercase:
xzcat cztenten_1M_sentences.txt.xz | python lower.py > input.in
- to create a suffix array
./mksary INPUT.in OUTPUT.sa
- and compute the prefix tree:
python build_trie.py FILE_PREFIX [MINFREQ]
The model will be stored in FILE_PREFIX.trie file.
Generating text
To generate a random text, just run
python alarm.py FILE_PREFIX.trie
You may try to generate a random sentence using the large 3G model:
python alarm.py cztenten.trie
Task
Change the training data and process to generate the most naturally-looking sentences. Either by
- pre-processing the input plain text or
- setting training parameters or
- changing training and generating scripts (advanced).
Upload 10,000 random sentences to your vault together with the amended scripts and your models. Describe your changes and tunings in README file where you can put some hilarious random sentence examples.