| 2 | |
| 3 | == Topical Similarity in Digital Mathematics Library == |
| 4 | |
| 5 | [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/sim_articles.png)]] |
| 6 | |
| 7 | * dif |
| 8 | ferent machine learning methods as Random Projections, TFIDF word weighting, Latent Semantic Indexing/Analysis, Latent Dirichlet Allocation |
| 9 | |
| 10 | * 50,000+ fulltexts on http://dml.cz |
| 11 | |
| 12 | == Coping with Information Overload by Filtering of Big Data == |
| 13 | |
| 14 | [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/search.png)]] |
| 15 | |
| 16 | Life is searching: group '''similar''' and narrow focus of search in [your] Big Data. |
| 17 | |
| 18 | Similarity types: from '''plagiarism''' (similarity on n-grams, narrative similarity, evolved into http://theses.cz) to '''thematic, topical similarity'''. |
| 19 | |
| 20 | == Prehistoric Example: Project Ottuv Slovnk naucny, 1998 == |
| 21 | |
| 22 | Levels of content processing: strings -> words and collocations -> semantics (word meaning) -> information (knowledge). |
| 23 | |
| 24 | Grabbing the essence (content) of documents: '''topical modeling'''. |
| 25 | |
| 26 | [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/ottuv_slovnik.png)]] |
| 27 | |
| 28 | |
| 29 | == Leading Edge Example: Automated Meaning Picking from Texts == |
| 30 | |
| 31 | [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/lda_topics.png)]] |
| 32 | |
| 33 | == Probabilistic Topical Modeling: Latent Dirichlet Allocation == |
| 34 | * topic: weighted list of words |
| 35 | * document: weighted list of topics |
| 36 | |
| 37 | [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/topical_mod.png)]] |
| 38 | |
| 39 | * all topics computed automatically from document corpora |
| 40 | |
| 41 | [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/allocation.png)]] |
| 42 | |
| 43 | == Content Similarity Results in EuDML == |
| 44 | Within ''European Digital Mathematics Library, EuDML'', project EU CIP-ICT-PSP we have developed and delivered technology for |
| 45 | '''similarity''' (gensim), document '''conversions''' (Braille) and '''accessibility''' (math OCR), NLP content '''normalization''' (Mathml2text). |
| 46 | |
| 47 | [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/eudml_sim.png)]] |
| 48 | |
| 49 | == Data Visualization and Representation == |
| 50 | |
| 51 | [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/data_vis.png)]] |
| 52 | |
| 53 | |
| 54 | == Award Winning Topic Similarity Framework '''gensim''' == |
| 55 | |
| 56 | Semantic similarity indexing and search of big (continuous stream of) data. Client (search) and server (indexing) |
| 57 | architecture. |
| 58 | |
| 59 | Developed by NLPlab PG student Radim Rehurek (awarded in Ceska hlava competition in 2011). |
| 60 | |
| 61 | Leading edge machine learning methods implemented. |
| 62 | |
| 63 | Used in 40+ local, EU or worldwide projects. |
| 64 | |
| 65 | Typical deployment and |
| 66 | ne-tuning scenario: expressing data as words (features) -> con |
| 67 | guration of topic modeling of |
| 68 | features -> setting of gensim methods and tuning parameters -> usage in an application with proper vizualization interface. |
| 69 | |
| 70 | |
| 71 | == Conclusions == |
| 72 | |
| 73 | * similarity: plagiarism |
| 74 | * topical modeling |
| 75 | * thematic document |
| 76 | ltering |
| 77 | * visualization |
| 78 | * semantic, meaning computations and modeling of natural language texts |
| 79 | |
| 80 | |
| 81 | Credits: Jiri Franek (illustrations) |
| 82 | |
| 83 | |
| 84 | |