wiki:private/NlpInPracticeCourse/LanguageModelling

Context Navigation

Version 15 (modified by Ales Horak, 10 years ago) (diff)
--

Language modelling

IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák

Prepared by: Vít Baisa

State of the Art

The goal of a language model is a) to predict a following word or phrase based on a given history and b) to assign a probability (= score) to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.

References

Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155.
Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems. 2013.
Chelba, Ciprian, et al. "One billion word benchmark for measuring progress in statistical language modeling." arXiv preprint arXiv:1312.3005 (2013).

Practical Session

Bůh požádal, aby nedošlo k případnému provádění stavby pro klasické použití techniky ve výši stovek milionů korun.

We will build a simple character-based language model and generate naturally-looking sentences. We need a plain text and fast suffix sorting algorithm (mksary).

http://corpora.fi.muni.cz/cblm/generate.cgi

Getting necessary data and tools

wget nlp.fi.muni.cz/~xbaisa/cblm.tar
tar xf cblm.tar in your directory
cd cblm
get a large Czech model (3 GB):
scp anxur:/tmp/cztenten.trie .
(or download from a stable copy)

mksary

git clone https://github.com/lh3/libdivsufsort.git
cd libdivsufsort
cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"
make
cd ...
ln -s libdivsufsort/examples/mksary mksary

Training data

To build a new model, we need

a plain text, see data directory, use lower.py
to create a suffix array ./mksary INPUT.txt OUTPUT.sa
and compute the prefix tree: python build_trie.py FILE.sa [MINFREQ] [OUPUTFILE]

In .trie file, the model is stored.

Generating text

To generate a random text, just run python alarm.py FILE.trie

Task

Change the training process and the generating process to generate the most naturally-looking sentences. Either by

pre-processing the input plain text or
setting training parameters or
changing generating process
or all above.

Upload 10,000 random sentences to your vault. Describe your changes, tunings in README where you can put some hilarious random sentence examples.

Attachments (1)

IA161_2024_Language_modelling.ipynb (28.0 KB) - added by Ales Horak 9 months ago.

Download all attachments as: .zip

Download in other formats:

Plain Text