wiki:en/TopicSimilarity

Topic Similarity

Topical Similarity in Digital Mathematics Library

/trac/research/raw-attachment/wiki/en/TopicSimilarity/sim_articles.png

  • different machine learning methods as Random Projections, TFIDF word weighting, Latent Semantic Indexing/Analysis, Latent Dirichlet Allocation

Coping with Information Overload by Filtering of Big Data

/trac/research/raw-attachment/wiki/en/TopicSimilarity/search.png

Life is searching: group similar and narrow focus of search in [your] Big Data.

Similarity types: from plagiarism (similarity on n-grams, narrative similarity, evolved into http://theses.cz) to thematic, topical similarity.

Prehistoric Example: Project Ottuv Slovnk naucny, 1998

Levels of content processing: strings -> words and collocations -> semantics (word meaning) -> information (knowledge).

Grabbing the essence (content) of documents: topical modeling.

/trac/research/raw-attachment/wiki/en/TopicSimilarity/ottuv_slovnik.png

Leading Edge Example: Automated Meaning Picking from Texts

/trac/research/raw-attachment/wiki/en/TopicSimilarity/lda_topics.png

Probabilistic Topical Modeling: Latent Dirichlet Allocation

  • topic: weighted list of words
  • document: weighted list of topics

/trac/research/raw-attachment/wiki/en/TopicSimilarity/topical_mod.png

  • all topics computed automatically from document corpora

/trac/research/raw-attachment/wiki/en/TopicSimilarity/allocation.png

Content Similarity Results in EuDML

Within European Digital Mathematics Library, EuDML, project EU CIP-ICT-PSP we have developed and delivered technology for similarity (gensim), document conversions (Braille) and accessibility (math OCR), NLP content normalization (Mathml2text).

/trac/research/raw-attachment/wiki/en/TopicSimilarity/eudml_sim.png

Data Visualization and Representation

/trac/research/raw-attachment/wiki/en/TopicSimilarity/data_vis.png

Award Winning Topic Similarity Framework gensim

Semantic similarity indexing and search of big (continuous stream of) data. Client (search) and server (indexing) architecture.

Developed by NLP Centre PG student Radim Rehurek (awarded in Ceska hlava competition in 2011).

Leading edge machine learning methods implemented.

Used in 40+ local, EU or worldwide projects.

Typical deployment and ne-tuning scenario: expressing data as words (features) -> conguration of topic modeling of features -> setting of gensim methods and tuning parameters -> usage in an application with proper vizualization interface.

Conclusions

  • similarity: plagiarism
  • topical modeling
  • thematic document ltering
  • visualization
  • semantic, meaning computations and modeling of natural language texts

Credits: Jiri Franek (illustrations)

Last modified 6 years ago Last modified on Jul 31, 2014 2:06:27 PM

Attachments (8)

Download all attachments as: .zip