TextPerience – Experience the Text


Everyday the number of texts increases. Nobody can read all of them. But you can recognize what is a text about without reading it. It's because natural language processing.

Find Out More

About Natural Language Processing


Natural Language Processing is a an applied field of computer science. It is closely related to natural languages such as English, Czech or Tagalog. The aim of NLP is to make computers understand human languages. Or, at least, act in such way.

This demo shows a small part of NLP. You can see what can be automatically extracted from files. This task is closely related to automatic text summarization, semantic searching, information extraction, opinion mining, and question answering.

Try it!

Experience the Text – TextPerience


Upload a document or enter an URL

OR

Extract text from the document

The text extraction quality depends on the data format and the way it was created.

See what we can recognize in the text

We use linguistic analysis, corpus tools, and mathematics.

Evaluate

Tell us whether our assumptions are correct.

TextPerience brings Knowledge


From Data to Information

You upload a file, a stream of bytes. We convert it to text and get the data. We extract meta-information about the file.

From Information to Knowledge

Our tools extract specific information and convert it to knowledge.

We can connect the extracted knowledge to large knowledge bases. The result is called linked data.

Language Tools

The result depends on how we can process particular languages.

We think in a language-independent way but since we use a language spoken by only 10 millions (Czech), we focus on less-resourced languages.

General and Domain-specific NLP

The result also depends significantly on the domain we work in. Our tools (and this demo) are general but we also adapt them for a specific need.

Let's Get In Touch!


Are you interested in what we do? Join us!

Do you have question? Do not hesitate to contact us.

Detect file type and extract text


File type detection (or MIME type detection) can be done in a naive way – using the file extension.

A more sophisticated way is reading a few bytes of the file and detect the MIME type according to known patterns. This method is slightly slower but much more precise than the first one.

Currently, we use Apache Tika to detect 700 different MIME types.

Text can be extracted from widely used document formats such as PDF, PPT, DOC, ODT but also from images via OCR (optical character recognition). This method is relatively slow, nevertheless, it can be efficient if the language of a document is known.

Explore all modules

Detect language and encoding


Surprisingly, it is very difficult to guess text encodings without knowing the language of the text. Conversely, it is very difficult to guess the language of the text if the encoding is not known.

For this reason, probabilistic models and machine learning are used. In this demo, we use Python packages chardet and langid. However, langid is less precise in case of weird languages including e.g. Czech without accents (very common in internet communication).

Our own tools comprise chared that detects encoding of a text in a known language.

Language detection can be based on searching frequent words. However, this method is not good for short texts. Usually, n-gram methods are used. For example, the trigram t h e is very frequent in English but unusual in Czech. Close languages are still difficult to distinguish. Another issue is with multi-lingual documents.

Explore all modules

Boilerplate removal


In case of web pages (HTML documents), we usually do not want the whole page. Apart from many data that are not used in TextPerience, some HTML elements can bring noise to the text data. Imagine a web page like this one, with many buttons saying Try me or Back to top. In this case, the words try or back would have much higher scores than they in fact deserve.

In a web page, we distinguish the content and other texts (menu, buttons, disclaimers, copyright notices) – the boilerplate. For boilerplate removal, we use Justext which leaves just text from a web page.

Boilerplate removal works much better if the language of the page is known.

Explore all modules

Tokenization


For examining the words, we need to recognize words. This sounds like an easy task. In English, Czech or other European languages, words are delimited by spaces or punctuation. But imagine a sentence like I don't like John's sister.. How many words are there – five, six, seven? Is 's a word?

Did you see a Japanese or Chinese text? There are no spaces. Delimiting words by spaces does not work for there languages. That's why it does not work in our demo.

Explore all modules

Morphological analysis


Before identifying keywords, we need to identify grammatical categories of words. In most languages, keywords are always nouns. In languages with rich inflection (such as Slavonic languages, e.g. Czech, Slovak, Russian), we also need to identify other grammatical categories such as gender, number or case.

Morphological analysis identifies grammatical categories for each word, for example barks is either a verb or a noun. It also returns the base form of a word (sometimes called lemma). For barks, it is bark (singular form).

Explore all modules

Tagging


The output of morphological analysis is ambigous. In the previous example, we could see that barks can be either noun or verb. On the other hand, in the sentence The dog barks loudly, barks is a verb.

The aim of tagging is to select the right grammatical categories depending on a particular context. Sometimes, it is pretty easy, in sentences such as The complex houses married and single soldiers and their families., tagging is very difficult.

Taggers can be built upon syntactic rules of a particular language or it can be trained on previously tagged data.

Explore all modules

Keyword & Keyphrase Extraction


Keywords are words that are important in a particular text. It is difficult to define importance in terms of computer processing. Usually (and also in this demo), a statistic concerning frequencies (TF-IDF) is used.

For calculating frequencies, we need to know about frequencies of words. For this reason, we have to collect texts. Surprisingly, the distribution of words corresponds to the Zipfian distribution: for example, the most common English word the appears in 7% of all word occurrences, the second most common English word of appears in 3,5% of all word occurrences etc. In every text, about half of the words appear only once (these are called hapax legomena).

In many languages, keyphrases are noun phrases. To extract keyphrases, we need syntactic information about a particular language. For example, in English, keyphrases can consist of zero or more adjectives, followed by one or more nouns, optionally followed by of and another keyphrase. For successful keyphrase extraction, we need high quality tagging, syntactic analysis, and statistic measures such as TF-IDF.

Explore all modules

Named Entity Recognition


In many texts, names of persons, organizations, and places appear and often are considered important. Named entities can comprise proper names (people, places, organizations, products, artworks etc.) but also abbreviations, dates, times, e-mails or domains. A named entity can consist of several tokens, including punctuation, for example John F. Kennedy.

Detection of named entities can be done via substring search in databases. For example, we can find place names in huge databases such as GeoNames. The problems are two: the number of entities grows permanently (for example new products, new person names), the entities can be ambiguous (for example, Jackson is a personal name or a location). Both problems can be solved using machine learning techniques.

Explore all modules

Linking with Knowledge Bases


After extracting information about file type, language, keywords, keyphrases, named entities (and possibly many other things), we can discover knowledge contained in the text. In this demo, we link keywords and entities with Wikipedia articles. In advanced applications, we can link keywords and entities to many other knowledge bases and dictionaries. Pieces of knowledge form a huge network of concepts where we can discover new concepts (not mentioned in the original text).

Explore all modules