Context Navigation

Changes between Version 1 and Version 2 of en/TopicSimilarity

Timestamp:: Jun 6, 2014, 1:15:35 PM (11 years ago)
Author:: xkocinc
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

en/TopicSimilarity

-                      v1
+                      v2
 = Topic Similarity =
+== Topical Similarity in Digital Mathematics Library ==
+[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/sim_articles.png)]]
+ * dif
+ferent machine learning methods as Random Projections, TFIDF word weighting, Latent Semantic Indexing/Analysis, Latent Dirichlet Allocation
+ * 50,000+ fulltexts on http://dml.cz
+== Coping with Information Overload by Filtering of Big Data ==
+[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/search.png)]]
+Life is searching: group '''similar''' and narrow focus of search in [your] Big Data.
+Similarity types: from '''plagiarism''' (similarity on n-grams, narrative similarity, evolved into http://theses.cz) to '''thematic, topical similarity'''.
+== Prehistoric Example: Project Ottuv Slovnk naucny, 1998 ==
+Levels of content processing: strings -> words and collocations -> semantics (word meaning) -> information (knowledge).
+Grabbing the essence (content) of documents: '''topical modeling'''.
+[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/ottuv_slovnik.png)]]
+== Leading Edge Example: Automated Meaning Picking from Texts ==
+[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/lda_topics.png)]]
+== Probabilistic Topical Modeling: Latent Dirichlet Allocation ==
+ * topic: weighted list of words
+ * document: weighted list of topics
+[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/topical_mod.png)]]
+ * all topics computed automatically from document corpora
+[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/allocation.png)]]
+== Content Similarity Results in EuDML ==
+Within ''European Digital Mathematics Library, EuDML'', project EU CIP-ICT-PSP we have developed and delivered technology for
+'''similarity''' (gensim), document '''conversions''' (Braille) and '''accessibility''' (math OCR), NLP content '''normalization''' (Mathml2text).
+[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/eudml_sim.png)]]
+== Data Visualization and Representation ==
+[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/data_vis.png)]]
+== Award Winning Topic Similarity Framework '''gensim''' ==
+Semantic similarity indexing and search of big (continuous stream of) data. Client (search) and server (indexing)
+architecture.
+Developed by NLPlab PG student Radim Rehurek (awarded in Ceska hlava competition in 2011).
+Leading edge machine learning methods implemented.
+Used in 40+ local, EU or worldwide projects.
+Typical deployment and
+ne-tuning scenario: expressing data as words (features) -> con
+guration of topic modeling of
+features -> setting of gensim methods and tuning parameters -> usage in an application with proper vizualization interface.
+== Conclusions ==
+ * similarity: plagiarism
+ * topical modeling
+ * thematic document
+ltering
+ * visualization
+ * semantic, meaning computations and modeling of natural language texts
+Credits: Jiri Franek (illustrations)