Keyword & Keyphrase Extraction
Keywords are words that are important in a particular text. It is difficult to define importance in terms of computer processing. Usually (and also in this demo), a statistic concerning frequencies (TF-IDF) is used.
For calculating frequencies, we need to know about frequencies of words. For this reason, we have to collect texts. Surprisingly, the distribution of words corresponds to the Zipfian distribution: for example, the most common English word the appears in 7% of all word occurrences, the second most common English word of appears in 3,5% of all word occurrences etc. In every text, about half of the words appear only once (these are called hapax legomena).
In many languages, keyphrases are noun phrases. To extract keyphrases, we need syntactic information about a particular language. For example, in English, keyphrases can consist of zero or more adjectives, followed by one or more nouns, optionally followed by of and another keyphrase. For successful keyphrase extraction, we need high quality tagging, syntactic analysis, and statistic measures such as TF-IDF.
Explore all modules