Language modelling
IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák
Prepared by: Vít Baisa
State of the Art
The goal of a language model is a) to predict a following word or phrase based on a given history and b) to assign a probability (= score) to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.
References
- Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155.
- Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
- Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems. 2013.
- Chelba, Ciprian, et al. "One billion word benchmark for measuring progress in statistical language modeling." arXiv preprint arXiv:1312.3005 (2013).
Practical Session
Bůh požádal, aby nedošlo k případnému provádění stavby pro klasické použití techniky ve výši stovek milionů korun.
We will build a simple character-based language model and generate naturally-looking sentences. We need a plain text and fast suffix sorting algorithm (mksary).
http://corpora.fi.muni.cz/cblm/generate.cgi
Getting necessary data and tools
wget nlp.fi.muni.cz/~xbaisa/cblm.tar
tar xf cblm.tar
in your directorycd cblm
- get a large Czech model (3 GB):
scp anxur:/tmp/cztenten.trie .
(or download from a stable copy)
mksary
git clone https://github.com/lh3/libdivsufsort.git
cd libdivsufsort
cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"
make
cd ...
ln -s libdivsufsort/examples/mksary mksary
Training data
To build a new model, we need
- a plain text, see
data
directory, uselower.py
- to create a suffix array
./mksary INPUT.txt OUTPUT.sa
- and compute the prefix tree:
python build_trie.py FILE.sa [MINFREQ] [OUPUTFILE]
In .trie file, the model is stored.
Generating text
To generate a random text, just run
python alarm.py FILE.trie
Task
Change the training process and the generating process to generate the most naturally-looking sentences. Either by
- pre-processing the input plain text or
- setting training parameters or
- changing generating process
- or all above.
Upload 10,000 random sentences to your vault. Describe your changes, tunings in README where you can put some hilarious random sentence examples.