= Topic Similarity = == Topical Similarity in Digital Mathematics Library == [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/sim_articles.png)]] * different machine learning methods as Random Projections, TFIDF word weighting, Latent Semantic !Indexing/Analysis, Latent Dirichlet Allocation * 50,000+ fulltexts on http://dml.cz == Coping with Information Overload by Filtering of Big Data == [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/search.png)]] Life is searching: group '''similar''' and narrow focus of search in [your] Big Data. Similarity types: from '''plagiarism''' (similarity on n-grams, narrative similarity, evolved into http://theses.cz) to '''thematic, topical similarity'''. == Prehistoric Example: Project Ottuv Slovnk naucny, 1998 == Levels of content processing: strings -> words and collocations -> semantics (word meaning) -> information (knowledge). Grabbing the essence (content) of documents: '''topical modeling'''. [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/ottuv_slovnik.png)]] == Leading Edge Example: Automated Meaning Picking from Texts == [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/lda_topics.png)]] == Probabilistic Topical Modeling: Latent Dirichlet Allocation == * topic: weighted list of words * document: weighted list of topics [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/topical_mod.png)]] * all topics computed automatically from document corpora [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/allocation.png)]] == Content Similarity Results in EuDML == Within ''European Digital Mathematics Library, EuDML'', project EU CIP-ICT-PSP we have developed and delivered technology for '''similarity''' (gensim), document '''conversions''' (Braille) and '''accessibility''' (math OCR), NLP content '''normalization''' (Mathml2text). [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/eudml_sim.png)]] == Data Visualization and Representation == [[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/data_vis.png, align=center)]] == Award Winning Topic Similarity Framework '''gensim''' == Semantic similarity indexing and search of big (continuous stream of) data. Client (search) and server (indexing) architecture. Developed by NLP Centre PG student Radim Rehurek (awarded in Ceska hlava competition in 2011). Leading edge machine learning methods implemented. Used in 40+ local, EU or worldwide projects. Typical deployment and ne-tuning scenario: expressing data as words (features) -> conguration of topic modeling of features -> setting of gensim methods and tuning parameters -> usage in an application with proper vizualization interface. == Conclusions == * similarity: plagiarism * topical modeling * thematic document ltering * visualization * semantic, meaning computations and modeling of natural language texts Credits: Jiri Franek (illustrations)