If you want to solve difficult questions - the ones that distinguish a senior data engineer or scientist from a junior one - you need to have a deep understanding of data types.
Programming languages use data types to abstract away the inner workings of the machine and provide data practitioners with tools for working with data. Instead of having to write machine code, the savvy data scientist or engineer can work with data types, which offer a set of procedures for collecting, manipulating, and reshaping data to your own bidding.
Python, a very high-level object-oriented programming language, is renowned for its data types and the different methods that accompany them.
In this article, we’ll dive deep into the data types that Python offers the data practitioner.
Python has multiple different data types, each one representing a ‘real-world’ class of data:
Under the hood, each data type is implemented as a Python class. In the terms of object-oriented programming languages, classes are code templates, which (without going into the how of it) are extremely useful for programmers because they come with built-in methods. These methods allow you to work with data types without having to write your functions.
One such method is type(). It is used to return the data type of a variable. For example:
The example above shows how Python thinks that the number 4 is of class ‘int’ (integer) and the sentence “This is a string” is of class ‘str’ (string).
Let’s now turn our attention to numeric data types.
Numeric data types represent numbers. As a data practitioner, these are the types that you will work with most often.
Integers represent positive or negative whole numbers, such as -234, 0, and 42. Python has no limit on the size of the integer that you want to represent, so the only barrier to writing the longest possible integer is your computer’s memory.
If you want to work with non-decimal bases, Python offers the possibility of implementing your integers as Binary, Octal, or Hexadecimal. Simply put a prefix of the type “0 (zero) + letter” in front of the number to change bases:
You can use the usual arithmetic operations on integers:
When it comes to integers, however, there are more than just arithmetic operations to work with. Have a browse through the Python documentation to get inspired.
Complex numbers are declared with the form <real part> + <imaginary part> j:
They are seldom used by the majority of data practitioners, but those who do (e.g. people who are in the gaming industry or involved in geospatial data analysis) use them extensively.
For those who are curious, check out what Python methods have to offer when it comes to manipulating complex numbers.
Float is shorthand for floating-point number or (as it’s more commonly referred to) decimal-point number. Examples include 3.14, -22.2… or any decimal for that matter. Python offers a wide range of built-in methods when it comes to floats.
You will probably use floats on a daily basis, so the implementation details are very important but rarely known by data practitioners.
Just like the majority of programming languages, Python implements floating points as binary (base-2) fractions. This means that a fraction of the type 1/10 could never be implemented with perfect accuracy.
Sure, when you type 0.1 into Python, it seems to return exactly what you would expect:
But in reality, this is a display trick. The real number saved by Python is in base-2, meaning that it looks like this when translated to our decimal system:
The floating-point number is accurate up to 15 decimal places. The rest gets truncated when displayed. This is important - Python does not round up the number but truncates the display of the number. For the majority of applications, this doesn’t matter because you can round up the final result to the decimal places needed after you’ve finished with your arithmetic operations.
If your work does require high accuracy, though, you’d better be aware of the pitfalls of floating-point arithmetic.
Just as a fun and informative example, floating-point arithmetic does not always behave like we would expect from a mathematical standpoint:
That’s because the 0.1 is not exactly 0.1, but rather a very long number which only approximates 0.1.
If accuracy is what you’re striving for, you have three options:
Sequence data types are an array of consecutive elements which group various values. The reason why strings, lists, and tuples are treated together is that they are all sequences below the surface.
And sequences come with a useful built-in method: the slicing operator for accessing elements of the sequence. We will look further into how a sequence can be sliced to get the desired element with each data type.
Strings are a sequence of Unicode characters. You declare them with either single quotes (‘ ‘) or double quotes (" "):
If you want to write a multi-line string, use a triple quote (""" """):
As we mentioned, strings are sequences of characters. Because of this, each character (or element of a sequence) can be accessed via slicing. Slicing refers to getting part of a sequence through the slicing operator: [index].
But first, what is an index?
In a sequence, each element is stored with an index that signifies the relative position from the beginning of the sequence and the value of that element. For example, the string “STRING” is composed of characters (values) “S”, “T”, “R”, “I”, “N”, “G”. Each character in the string is awarded an index starting from 0:
Use the slicing operator to retrieve a value from the specified index:
Slicing is an amazingly powerful tool when working with strings (and sequence data types). If you want to explore slicing further, this is a great introduction. You can delve into the other built-in methods that are available with strings by checking the Python documentation. The latter will come in handy when cleaning text data for textual analysis.
The Python list is an ordered array of items and it happens to be one of the most useful data types for organizing your data.
Declare a list by placing the list items inside square brackets [ ], with each element separated by a comma:
There are three main advantages of a list:
Lists are used extensively for short-term storage of items and for lookups. Make sure that you use the Python documentation to discover everything that lists have to offer.
Just like lists, tuples are ordered sequences of data items. However, tuples are not mutable. Immutability means that they cannot be changed once they’ve been created. Because of this, they are faster than lists for looking up values, but they’re not as flexible.
You declare a tuple by placing values within parentheses () and separating them with commas:
As with any sequence, you can get elements from a tuple by returning them with slicing. Unlike lists, though, changing them with a slicing method would cause a TypeError error in the Traceback:
Tuple objects are extremely efficient in space management and as a result, they’re often used for in-place item assignment (which is computed with O(1) time complexity):
Dive into the Python documentation to understand everything that tuples can do for you.
A dictionary is a collection of items in the form of key-value pairs. The key is the reference which points to the value in the pair. In other languages, dictionaries are also called hash tables and assorted arrays.
Dictionaries can be created by placing a comma-separated list of key: value pairs within braces, for example:
Dictionaries are extremely efficient at retrieval operations, so they’re widely used in backend applications which need to return values given a specified key:
New values can be added by simply specifying new key-value pairs:
Dictionaries are mutable and existing keys can easily be paired with new values:
Their efficient read speed and flexible use come at a cost. Their superior performance necessitates more space allocation than other data structures (for instance, sets in the next chapter), so you need to consider the trade-off when designing applications.
Take a look at other characteristics of dictionaries by reading the Python docs.
A set is an unordered collection of unique (non-duplicated) items. You declare a set by placing items within curly braces and separating them with a comma:
As it’s unordered, you won’t be able to use the slicing operator on the set. But don’t worry, the Traceback error will notify you if you try and access set elements with slicing:
Slicing throws an error because sets were not designed to perform ordered retrieval tasks. Instead, they were made to allow for efficient set operations like membership checks, unions (combining two sets together), intersections (figuring out which elements belong to both sets), and more:
Python is rich with set-specific built-in methods to make your life easier.
The Boolean data type takes two values: True or False. Unlike SQL and other languages, there is no third option in the logical world of Python. The Boolean values are used for logical checks as part of conditional statements (if-else) or looping (while).
Let’s look at a couple of examples:
Boolean values come with a little trick in Python. Some values are implicitly interpreted as True or False (also called truthiness). For example, the value None is evaluated as False. In the same line, 1 and 0 are evaluated as True and False respectively:
This 0 and 1 evaluation derives from the implementation of Boolean values - they are a subtype of integers beneath the surface.
Check the Python documentation for the truth value testing and comparisons ingrained in Python.
From here, there are multiple paths to take that will further expand your knowledge: