Changes between Version 1 and Version 2 of en/TopicSimilarity


Ignore:
Timestamp:
Jun 6, 2014, 1:15:35 PM (7 years ago)
Author:
xkocinc
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • en/TopicSimilarity

    v1 v2  
    11= Topic Similarity =
     2
     3== Topical Similarity in Digital Mathematics Library ==
     4
     5[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/sim_articles.png)]]
     6
     7 * dif
     8ferent machine learning methods as Random Projections, TFIDF word weighting, Latent Semantic Indexing/Analysis, Latent Dirichlet Allocation
     9
     10 * 50,000+ fulltexts on http://dml.cz
     11
     12== Coping with Information Overload by Filtering of Big Data ==
     13
     14[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/search.png)]]
     15
     16Life is searching: group '''similar''' and narrow focus of search in [your] Big Data.
     17
     18Similarity types: from '''plagiarism''' (similarity on n-grams, narrative similarity, evolved into http://theses.cz) to '''thematic, topical similarity'''.
     19
     20== Prehistoric Example: Project Ottuv Slovnk naucny, 1998 ==
     21
     22Levels of content processing: strings -> words and collocations -> semantics (word meaning) -> information (knowledge).
     23
     24Grabbing the essence (content) of documents: '''topical modeling'''.
     25
     26[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/ottuv_slovnik.png)]]
     27
     28
     29== Leading Edge Example: Automated Meaning Picking from Texts ==
     30
     31[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/lda_topics.png)]]
     32
     33== Probabilistic Topical Modeling: Latent Dirichlet Allocation ==
     34 * topic: weighted list of words
     35 * document: weighted list of topics
     36
     37[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/topical_mod.png)]]
     38
     39 * all topics computed automatically from document corpora
     40
     41[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/allocation.png)]]
     42
     43== Content Similarity Results in EuDML ==
     44Within ''European Digital Mathematics Library, EuDML'', project EU CIP-ICT-PSP we have developed and delivered technology for
     45'''similarity''' (gensim), document '''conversions''' (Braille) and '''accessibility''' (math OCR), NLP content '''normalization''' (Mathml2text).
     46
     47[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/eudml_sim.png)]]
     48
     49== Data Visualization and Representation ==
     50
     51[[Image(/trac/research/raw-attachment/wiki/en/TopicSimilarity/data_vis.png)]]
     52
     53
     54== Award Winning Topic Similarity Framework '''gensim''' ==
     55
     56Semantic similarity indexing and search of big (continuous stream of) data. Client (search) and server (indexing)
     57architecture.
     58
     59Developed by NLPlab PG student Radim Rehurek (awarded in Ceska hlava competition in 2011).
     60
     61Leading edge machine learning methods implemented.
     62
     63Used in 40+ local, EU or worldwide projects.
     64
     65Typical deployment and
     66ne-tuning scenario: expressing data as words (features) -> con
     67guration of topic modeling of
     68features -> setting of gensim methods and tuning parameters -> usage in an application with proper vizualization interface.
     69
     70
     71== Conclusions ==
     72
     73 * similarity: plagiarism
     74 * topical modeling
     75 * thematic document
     76ltering
     77 * visualization
     78 * semantic, meaning computations and modeling of natural language texts
     79
     80
     81Credits: Jiri Franek (illustrations)
     82
     83
     84