Text and Data Mining

A How Do I Guide that covers resources for Text and Data Mining through the University Library

Glossary of terms

API (Application programming interface): A set of defined rules that allows two applications or computer programs to communicate with one another. APIs are often used to transfer data, and can be useful for downloading large amounts of data from a website for the purposes of TDM. Some of the resources in this guide have developed their own APIs intended for use with their own materials.

Artificial Intelligence (AI): The use of computer systems to perform tasks usually undertaken by a human or requiring human intelligence. There is frequent and increasing intersection between the fields of AI and TDM. See the Artificial Intelligence guide for more on AI.

Corpus: A collection of texts in a structured, machine readable format intended to be used in text mining. This term may also refer to a general collection of written works, often by the same author or on the same topic.

Data mining: The process of using computational analysis tools to discover patterns, relationships, and trends in structured sets of data

Deep learning: One of the main types of AI, in which an artificial neural network aims to simulate the learning behaviour of the human brain using large amounts of data. Deep learning is a subset of machine learning. See the Artificial Intelligence guide for more on AI.

DOI (Digital object identifier): A unique identifying number for a document, such as a journal article, that acts as a static link to the original digital object. DOIs are often preferred to URLs where possible due to their persistence against issues such as link rot.

Freeware: A type of software that has no cost for the user, where the rights and licensing vary and are decided upon by each publisher.

GUI (Graphical user interface): A type of user interface in which programs can be interacted with through visual objects and graphics, as opposed to a text based interface. GUIs are often more accessible than other kinds of user interfaces and are preferable for novice or unskilled users.

Machine learning: One of the main types of AI, in which computer systems learn and adapt from experience with little human interference or programming. See the Artificial Intelligence guide for more on AI.

Metadata: Meaning 'data about data', metadata provides information about data rather than the content of the data itself. This can include things like the purpose of the data, its ownership and source, and relationships to other data. Metadata is often used in conjunction with structured text when performing text mining.

OCR (Optical character recognition): The use of computer technology to recognise text within a digital image and make it machine-readable. This allows for greater accessibility of physical texts that have stored digitally.

Stop words: Words within a stop list that are unimportant in TDM exercises and therefore need to be filtered out. These are generally words in any language that appear disproportionately often. In English, for example, these are words such as "the", "a", "is", and "are", etc.

Stylometry: The analysis of linguistic or literary style, usually involving comparisons between writers or genres. Text mining methods can be used to perform this kind of analysis.

Text mining: Also referred to as text analysis or text data mining, this is the computational analysis of unstructured, natural texts, such as literature, to find insights on the patterns, relationships, and trends in the language used.

XML (Extensible markup language): A markup language and file format for storing and transporting data, using defined rules to encode data so that is both machine-readable and human-readable.

TDM methods

Classification: A data mining method that separates points of data into different classes, using algorithms that link qualitative variables.

Clustering analysis: A method in both text and data mining in which words or data points are grouped based on similarity. These groups are "clusters" and contains words or points that are more similar to one another than anything appearing in other clusters.

Collocation analysis: A text mining method highlighting words or terms that frequently appear together, or are "associated" with one another. A group of words that can be characterised in this way are referred to as a collocation. This can provide insight into the meanings associated with particular words throughout a corpus.

Concordance analysis: Also known as keywords in context (KWIC), this is a text mining method which generates a list of any given word along with the context in which it appears. This is presented in the form of a certain number of words before and after the keyword for each of its appearances.

n-gram: A collection of successive items, usually words, but also numbers, symbols, or punctuation, from any given sequence of text, which indicates a pattern over time. N-grams are usually analysed using term frequency methods.

Named entity recognition (NER): Also referred to as automatic name recognition, this is a method of text mining in which names of people, places, and things are identified and organised into pre-defined categories.

Part-of-speech tagging: A text mining tool in which information about the parts of speech in a corpus is identified, such as the occurrence of nouns and verbs, as well as the grammatical characteristics of words such as their tense, number, and case.

Sentiment analysis: Also known as opinion mining, this text mining method involves the identification of opinions and emotions in a text using a scoring system in order to determine the overall tone.

Term frequency: A text mining method identifying the number of times a particular word or phrase appears within a document or corpus, also known as frequency distribution. This method may be combined with inverse document frequency in which a particular frequently occurring word is offset by the amount of documents or texts in the corpus containing that word. This is referred to as TF-IDF and is helpful for identifying words unique to particular texts in a corpus, as opposed to words that appear frequently throughout the entire corpus.

Topic modelling: A text mining method in which the topics of documents or texts are inferred from words that tend to appear together. There are several algorithms used to perform this kind of analysis, with the most popular being Latent Dirichlet Allocation (LDA). Topic modeling made simple enough by tedunderwood provides a more in depth explanation of this method.