| 1 | = Language modelling = |
| 2 | |
| 3 | [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák |
| 4 | |
| 5 | Prepared by: Vít Baisa |
| 6 | |
| 7 | == State of the Art == |
| 8 | |
| 9 | The goal of a language model is a) to predict a following word or phrase based on a given history and b) to assign a probability (= score) to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models. |
| 10 | |
| 11 | === References === |
| 12 | |
| 13 | 1. Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155. |
| 14 | 1. Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013). |
| 15 | 1. Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems. 2013. |
| 16 | 1. Chelba, Ciprian, et al. "One billion word benchmark for measuring progress in statistical language modeling." arXiv preprint arXiv:1312.3005 (2013). |
| 17 | |
| 18 | == Practical Session == |
| 19 | |
| 20 | ''Bůh požádal, aby nedošlo k případnému provádění stavby pro klasické použití techniky ve výši stovek milionů korun.'' |
| 21 | |
| 22 | We will build a simple character-based language model and generate naturally-looking sentences. We need a plain text and fast suffix sorting algorithm (mksary). |
| 23 | |
| 24 | http://corpora.fi.muni.cz/cblm/generate.cgi |
| 25 | |
| 26 | == Getting necessary data and tools == |
| 27 | |
| 28 | * {{{wget nlp.fi.muni.cz/~xbaisa/cblm.tar}}} |
| 29 | * {{{tar xf cblm.tar}}} in your directory |
| 30 | * {{{cd cblm}}} |
| 31 | * get a large Czech model (3 GB): [[BR]] |
| 32 | {{{scp anxur:/tmp/cztenten.trie .}}} [[BR]] |
| 33 | (or download from a [[htdocs:bigdata/cztenten.trie|stable copy]]) |
| 34 | |
| 35 | === mksary === |
| 36 | |
| 37 | * {{{git clone https://github.com/lh3/libdivsufsort.git}}} |
| 38 | * {{{cd libdivsufsort}}} |
| 39 | * {{{cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"}}} |
| 40 | * {{{make}}} |
| 41 | * {{{cd ...}}} |
| 42 | * {{{ln -s libdivsufsort/examples/mksary mksary}}} |
| 43 | |
| 44 | == Training data == |
| 45 | |
| 46 | To build a new model, we need |
| 47 | * a plain text, see {{{data}}} directory, use {{{lower.py}}} |
| 48 | * to create a suffix array {{{./mksary INPUT.txt OUTPUT.sa}}} |
| 49 | * and compute the prefix tree: {{{python build_trie.py FILE.sa [MINFREQ] [OUPUTFILE]}}} |
| 50 | |
| 51 | In .trie file, the model is stored. |
| 52 | |
| 53 | == Generating text == |
| 54 | |
| 55 | To generate a random text, just run |
| 56 | {{{python alarm.py FILE.trie}}} |
| 57 | |
| 58 | === Task === |
| 59 | |
| 60 | Change the training process and the generating process to generate the most naturally-looking sentences. Either by |
| 61 | * pre-processing the input plain text or |
| 62 | * setting training parameters or |
| 63 | * changing generating process |
| 64 | * or all above. |
| 65 | |
| 66 | Upload 10,000 random sentences to your vault. Describe your changes, tunings in README where you can put some hilarious random sentence examples. |