# 2.1.3 Pandas intro

The Pandas library is a core part of the Python data science ecosystem.
It provides easy-to-use data structures and data analysis tools.

Pandas has some great resources for getting started, including guides
tailored to those familiar with other software for manipulating data:
https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started
.

For now, we’ll stick just to what we need for this course.

In [2]:
import pandas as pd

## Structures

Pandas has two main **labelled** data structures: 
- Series


In [3]:
s = pd.Series([0.3, 4, 1, None, 9])
print(s)

0    0.3
1    4.0
2    1.0
3    NaN
4    9.0
dtype: float64


-   DataFrame

In [4]:
import numpy as np

df = pd.DataFrame(np.random.randn(10,2), index=np.arange(3, 13), columns=["random_A", "random_B"]) 
df

Unnamed: 0,random_A,random_B
3,1.425158,-0.169013
4,-0.299078,0.244578
5,0.503473,-0.465702
6,1.245454,-0.106239
7,0.027438,-1.415794
8,-1.414463,-0.493611
9,-0.623091,-0.350707
10,1.77594,-1.448867
11,1.201266,-0.084514
12,1.041766,-1.319784



Once we have data in these Pandas structures, we can perform some useful
operations such as:

- `info()` (`DataFrame` only)
   - prints a concise summary of a `DataFrame`

- `value_counts()`
   - returns a `Series` containing counts of unique values in the structure


In [5]:
s = pd.Series(np.random.randint(0,2,10))
print(s)

print("\nvalue counts:")
print(s.value_counts())

0    0
1    0
2    0
3    1
4    0
5    1
6    0
7    0
8    1
9    0
dtype: int64

value counts:
0    7
1    3
dtype: int64



We’ll see more on how to use these structures, and other Pandas
capabilities, later.

## Indexing

Again, we’re just covering some basics here. For a complete guide to
indexing in Pandas see
[here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).

Pandas allows us to use the same basic `[]` indexing and `.` attribute
operators that we’re used to with Python and NumPy. However, Pandas also
provides the (often preferred) `.loc` labelled indexing method and the
`.iloc` position indexing methods.

### `[]` Indexing

For basic `[]` indexing, we can select columns from a DataFrame and
items from a Series.

#### DataFrame

In [6]:
# select a single column
print("single column from DataFrame, gives us a Series:")

df["random_A"]

single column from DataFrame, gives us a Series:


3     1.425158
4    -0.299078
5     0.503473
6     1.245454
7     0.027438
8    -1.414463
9    -0.623091
10    1.775940
11    1.201266
12    1.041766
Name: random_A, dtype: float64

In [8]:
# select two columns
print("two columns from DataFrame, gives us a DataFrame:")

df[["random_A", "random_B"]]

two columns from DataFrame, gives us a DataFrame:


Unnamed: 0,random_A,random_B
3,1.425158,-0.169013
4,-0.299078,0.244578
5,0.503473,-0.465702
6,1.245454,-0.106239
7,0.027438,-1.415794
8,-1.414463,-0.493611
9,-0.623091,-0.350707
10,1.77594,-1.448867
11,1.201266,-0.084514
12,1.041766,-1.319784


Note that we can't do:

In [12]:
df[5]

KeyError: 5

as this tries to access a row, not a column. But you can do this with a series (a single column):

#### Series

In [13]:
# select single item
print("single item from Series, gives us an item (of type numpy.int64,in this case):")

s[2]

single item from Series, gives us an item (of type numpy.int64,in this case):


0

In [14]:
# select two items
print("two items from Series, gives us a Series:")

s[[2, 4]]

two items from Series, gives us a Series:


2    0
4    0
dtype: int64

### Attribute Access

Similarly, we can access a column from a DataFrame and an item from a
Series using as an attribute. However, we can’t do this when the label
is not a valid identifier.

In [7]:
df.random_A

3     1.677205
4     1.543037
5     0.689541
6    -0.414047
7     0.155389
8     0.126505
9     0.807657
10    0.282047
11   -0.369983
12   -1.738524
Name: random_A, dtype: float64

### `.loc`

`.loc` provides label-based indexing. `.loc` can also be used for
slicing and we can even provide a `callable` as its input! However, here
we’ll just show single item access.


In [8]:
df.loc[5]

random_A    0.689541
random_B    1.616827
Name: 5, dtype: float64

In [9]:
# and for a Series
s.loc[2]

0


### `.iloc`

`.iloc` provides integer-based indexing. This closely resembles Python and NumPy slicing. Again, we'll just show single item access.

In [16]:
# for DataFrame
df.iloc[5]

random_A   -1.414463
random_B   -0.493611
Name: 8, dtype: float64

In [15]:
# and for a Series
s.iloc[2]

0