# Categorical Data

### Introduction to Categorical in Statistics eXplorer A data set is in eXplorer defined as a collection of data items, where an item may, for instance, represent a country in a collection of statistical data.

The characteristics of a data item are described by a collection of variables. A variable can be defined as a property, or a characteristic, of a data item that may vary from one item to another or over time. As an example here, the data items represent countries. The variables may represent various characteristics of the countries such as population size, percentage for various ageing groups or a category such as belong to a certain Continent or a classified Human Development Index (HDI).

A multivariate data set is simply a data set including two or more variables. The items of a multivariate data set can be thought of as points in a multidimensional space where each dimension represents a variable. A standard format used for structuring multivariate data is to use an m-by-n matrix including m rows, usually representing data items, and n columns, usually representing variables. The EXCEL table above displays a small example of a world data matrix where the first two columns represents the identification (ISO value and Name) and column C and beyond are data variables. Data variables are the eXplorer classified into two different types Numeric  and Categoric based on common classification taxonomy used in most visualization and data mining literature.. Categorical data can then be different names (Continent) or a classification that provide enough information to order the items (HDI Level).

The definition of variable types for a data set is important since various data types may require different visual analytic techniques since a single visualization technique is rarely appropriate for all types of data. Specifically, techniques used for numerical variables are often based on a numerical difference or similarity between data items. Categorical data on the other hand does not include any distance measure comparable to a numerical distance, and hence other visual analytical techniques may need to be used.

text Figure: The Human Development Index (HDI) is a composite statistic of life expectancy, education, and income per capita indicators. A country scores higher HDI when the life expectancy at birth is longer, the education period is longer, and the income per capita is higher. It is used to distinguish whether the country is a developed, a developing or an underdeveloped country. The UN report covers 185 member states of the United Nations (out of 193). Countries fall into four broad human development categories: Very High Human Development, High Human Development, Medium Human Development and Low Human Development.

A multivariate data set is simply a data set including two or more variables. The items of a multivariate data set can be thought of as points in a multidimensional space where each dimension represents a variable. A standard format used for structuring multivariate data is to use an m-by-n matrix including m rows, usually representing data items, and n columns, usually representing variables. The table below displays a small example of a world data matrix where the first two columns represents the identification (ISO value and Name) and column C and beyond are data variables.

text

Page manager: mikael.jern@liu.se
Last updated: 2016-11-23