Indexing
Basics and Simple Accessing Methods
It is implausible that you will always need an entire dataset when doing analysis, so being able to efficiently pick subsets of data is an important skill. pandas
has some vital indexing functions that add logical parameters when working with a DataFrame
or Series
.
[]
and .
Using brackets and periods to index in pandas
are the most basic methods of indexing, and thus allow for simpler subset selection. Both are limited to column selection, so you won’t be able to select the first row of an object with df[1]
or df.1
.
This indexing is done based on the column name — if we had a column named "columnA", we could select it using df['columnA']
or df.columnA
.
Selecting multiple columns is also possible by nesting a list within your brackets: df[['columnA', 'columnB', 'columnC']]
. This is not possible when using .
to select attributes — you are additionally unable to use .
to select:
-
Columns that match names of
pandas
methods —max
,head
,describe
, etc. -
Identifiers, including keywords, whitespace, and literals. If your column has a space in its name (which can happen often), you cannot select it with `.`
Slicing
If we consider simple []
and .
indexing to be column selection, we can consider :
indexing to be row selection. You still start by using brackets, but when :
is included, the rows are now selected instead of columns. Only our rows can be selected when using simple slicing, but as you will soon see, slicing is a tool that is used heavily in more complex indexing.
:
slicing is inclusive, with an implied 0 for the start and the object’s last index implied for the end. Thus, the input print(df[:])
selects all rows and is the same as print(df)
. You can add a negative sign to the end index to exclude the last n elements.
If you want to skip certain elements, you can add another colon after the stop index, which would take the following form: df[0:4:2]
. There is an implied skip of 1 (meaning no skip at all), meaning print(df[:]) == print(df[::]) == print(df)
. You can add a negative sign in front of the skip parameter to slice the object in reverse order.
isin()
The isin
method (derived from is in) is a pandas
method that mirrors the in
keyword in other programming languages. It’s a Series
method that is passed a list (or list-like object), returning True
or False
depending on the value’s inclusion in the list.
The potential applications are simple to imagine — what if I only wanted to select certain states from a DataFrame
containing information on the United States? How about information when the records include the day of the week, and I only want Mondays and Thursdays? These desires are very easily translated into code.
days = pd.Series(['Monday', 'Thursday', 'Wednesday', 'Saturday', 'Saturday', 'Monday'])
print(days.isin(['Monday', 'Thursday']))
0 True 1 True 2 False 3 False 4 False 5 True
print(days[days.isin(['Monday', 'Thursday'])])
0 Monday 1 Thursday 5 Monday
DataFrames
have their own isin
method, but as we’ll demonstrate in the next section, using the Series
method is better when combined with the loc
/iloc
functions.
Examples
For the following, we use the code np.random.randn(num_rows, num_columns)
to generate a random DataFrame
. The point is not for the output values to match, but rather the rows/columns.
Return the column A
from DataFrame df
.
Click to see solution
# method 1: []
print(df['A'])
# method 2: .
print(df.A)
0 -2.676859 1 0.110410 2 1.263104 3 0.161416 4 -0.213868
Return columns A
, B
, and D
from DataFrame df
.
Click to see solution
print(df[['A', 'B', 'D']])
A B D 0 0.461834 0.456688 -1.061509 1 1.003698 1.115509 0.120536 2 0.814746 2.793606 -0.281329 3 0.766533 0.138788 0.479603 4 -0.084290 -0.141935 0.755774
Multi-dimensional Indexing: loc
and iloc
Fortunately, there are many ways to select multiple dimensions for subsetting, and they generally allow conditionals to further specify the data points you want.
The loc
method allows for label-based indexing, while iloc
is primarily integer-based. These are two of the best functions to use when indexing, so we’ll be using plenty of examples to demonstrate their versatility.
We can use either loc
or iloc
to isolate a single column or Series
. The differences between the two are strict; if you try to use row/column indices for loc
or row/column names with iloc
, you’ll get an error.
# option 1: loc
just_the_year = myDF.loc[:, 'Year']
# option 2: iloc, matches option 1
just_the_year = myDF.iloc[:, 0] # Year is the first column.
print(just_the_year)
0 1987 1 1990 2 1990 3 1990 4 1991 5 1991 6 1991 7 1991 8 1991
The comma is important here — everything before the comma indicates row selection, while everything after indicates column selection. You should recognize the :
from slicing, and we can now slice on both dimensions thanks to loc
!
The same two techniques can both be used for multiple rows and columns as well. Something to keep in mind is that .iloc
slicing is exclusive, contrasting the inclusivity of standard slicing and .loc
slicing.
year_month = myDF.loc[:, ('Year', 'Month')]
year_month = myDF.iloc[:, 0:2]
print(year_month)
Year Month 0 1987 10 1 1990 10 2 1990 10 3 1990 10 4 1991 10 5 1991 10 6 1991 10 7 1991 10 8 1991 10
putting_it_together = myDF.loc[0:2, ['Year', 'Month', 'DayofMonth']]
putting_it_together = myDF.iloc[0:3, 0:3]
print(putting_it_together)
Year Month DayofMonth 0 1987 10 14 1 1990 10 15 2 1990 10 17
Lots of loc
and iloc
examples can look similar, but when we change the index of a DataFrame
, we get to see some crucial differences:
list_1 = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
list_2 = ['Apple', 'Banana', 'Coffee', 'Nothing', 'Oatmeal']
list_3 = [1, 3, 6, 8, 1]
column_names = ['day_of_week', 'breakfast', 'rating']
index_values = [1,1,2,2,3]
myDF = pd.DataFrame(zip(list_1, list_2, list_3), columns=column_names, index=index_values)
print(myDF)
day_of_week breakfast rating 1 Monday Apple 1 1 Tuesday Banana 3 2 Wednesday Coffee 6 2 Thursday Nothing 8 3 Friday Oatmeal 1
print(myDF.loc[0:2,:])
day_of_week breakfast rating 1 Monday Apple 1 1 Tuesday Banana 3 2 Wednesday Coffee 6 2 Thursday Nothing 8
Since loc
is symbol-based, we will get all the rows with the index of 0, 1, or 2, instead of getting the first 3 rows. We get a much different result when directly swapping loc
with iloc
:
print(myDF.iloc[0:2,:])
day_of_week breakfast rating 1 Monday Apple 1 1 Tuesday Banana 3
Boolean Statements
Utilizing logical statements when indexing is vital — you will commonly encounter situations where using sequences to select data does not suffice. When the values you want are scattered throughout the DataFrame
, you’ll want conditional statements.
Let’s say we want only Mondays and the days where no breakfast was eaten. We can combine conditional statements as follows:
monday_or_breakfast = myDF.loc[(myDF.loc[:, "day_of_week"]=="Monday") | (myDF.loc[:, "breakfast"]=="Nothing"), :]
print(monday_or_breakfast)
day_of_week breakfast rating 1 Monday Apple 1 2 Thursday Nothing 8
Nice! The parentheses ()
are critical here — pandas
doesn’t know how to evaulate multiple statements with |
alone and will throw an error. The following will not work (and also looks worse):
monday_or_breakfast = myDF.loc[myDF.loc[:, "day_of_week"]=="Monday" | myDF.loc[:, "breakfast"]=="Nothing", :]
You can use &
for the logical AND just as you can use |
for the logical OR:
apple_and_one = myDF.loc[(myDF.loc[:, "breakfast"]=="Apple") & (myDF.loc[:, "rating"]==1), :]
print(apple_and_one)
day_of_week breakfast rating 1 Monday Apple 1
Examples
"flights_sample.csv" example
Click to see solution
Take the following example from the "flights_sample.csv" file:
import pandas as pd
myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/flights_sample.csv")
print(myDF.head())
Year Month DayofMonth ... NASDelay SecurityDelay LateAircraftDelay 0 1987 10 14 ... NaN NaN NaN 1 1990 10 15 ... NaN NaN NaN 2 1990 10 17 ... NaN NaN NaN 3 1990 10 18 ... NaN NaN NaN 4 1991 10 19 ... NaN NaN NaN
This call to head
would be the equivalent to the line myDF.loc[0:4, :]
.
Let’s say we want only the data from 1990 — the best strategy is to use a conditional.
# option 1: nested .loc call
love_the_90s = myDF.loc[myDF.loc[:, 'Year'] == 1990, :]
# option 2: nested bracket selection; equivalent to 1
love_the_90s = myDF.loc[myDF['Year'] == 1990, :]
print(love_the_90s)
Year Month DayofMonth ... NASDelay SecurityDelay LateAircraftDelay 1 1990 10 15 ... NaN NaN NaN 2 1990 10 17 ... NaN NaN NaN 3 1990 10 18 ... NaN NaN NaN
Cool! This gives us what we want… but what’s with the weird nesting? The thing with conditionals is that the statement only evaluates to True
or False
— boolean values, in other words. In our example, the statement myDF['Year'] == 1990
on its own would evaluate to:
0 False 1 True 2 True 3 True 4 False
Obviously this only gives us information on Year
and its values, but now we know which rows contain the information we want. We now nest our row selection in its proper location within our outer .loc
call, then use :
to select all columns of myDF
. This is why nesting was necessary.
Now, how do we repeat this example with .iloc
? We can try swapping .loc
with .iloc
directly in Option 1, but we’ll run into the error message ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
. This is a long way of saying we can only include numbers or a boolean list when using .iloc
.
Knowing the above knowledge, we see that Year
is the first column in myDF
, corresponding to 0 (since pandas
uses 0-indexing). This works for the inner call, but notice how we said iloc
works on boolean lists, not boolean Series
. If we try to use iloc
on the outer call, we get another error, meaning that we need to keep it as loc
:
# option 3: nested .iloc call
love_the_90s = myDF.loc[myDF.iloc[:, 'Year'] == 1990, :]
print(love_the_90s)
Year Month DayofMonth ... NASDelay SecurityDelay LateAircraftDelay 1 1990 10 15 ... NaN NaN NaN 2 1990 10 17 ... NaN NaN NaN 3 1990 10 18 ... NaN NaN NaN
Return the entries for which country
is in Europe within DataFrame people
.
Click to see solution
people = pd.DataFrame({'first_init': list('SRSE'), 'last_init': list('MVNS'),
'country': ['Ireland', 'Singapore', 'Japan', 'Netherlands']})
print(people.loc[people.country.isin(['Ireland', 'Netherlands']), :])
first_init last_init country 0 S M Ireland 3 E S Netherlands