Pandas RW from file

Functions for RW from File

data.csv

We will feed this data to pandas


# data.csv 
Duration,   Pulse,  Maxpulse,   Calories
60,         110,    130,        409.1
60,         117,    145,        479.0
60,         103,    135,        340.0
,         109,    175,        282.4

read_csv(), .columns

read_csv(): read and create a dataframe
.columns: list the coloumn header in the dataframe


$ test.py
import pandas as pd
df = pd.read_csv('data.csv')    #read in data frame
print(df.to_string())
print(df.columns)

   Duration  Pulse  Maxpulse  Calories
0      60.0    110       130     409.1
1      60.0    117       145     479.0
2      60.0    103       135     340.0
3       NaN    109       175     282.4

Index(['Duration', 'Pulse', 'Maxpulse', 'Calories'], dtype='object')

describe()

describe(): Returns 8 values for each coloumn
count: The number of non-null values. (For Duration coloumn, count=3)
mean: The average value. (For Duration coloumn, 70+60+50/3 = 60)
std: The standard deviation.
min: The minimum value. (For duration coloumn, min val=50)
25%: The 25th percentile (first quartile).
Imagine sorting each column from lowest to highest value. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values, ie 25%
50%: The 50th percentile (median).
75%: The 75th percentile (third quartile).
max: The maximum value. (For duration coloumn, max val=70)


import pandas as pd
df = pd.read_csv('data.csv')
print(df.describe())

       Duration       Pulse    Maxpulse    Calories
count       3.0    4.000000    4.000000    4.000000
mean       60.0  109.750000  146.250000  377.625000
std        10.0    5.737305   20.155644   85.148904
min        50.0  103.000000  130.000000  282.400000
25%        55.0  107.500000  133.750000  325.600000
50%        60.0  109.500000  140.000000  374.550000
75%        65.0  111.750000  152.500000  426.575000
max        70.0  117.000000  175.000000  479.000000

head()

head returns 1st n rows from dataframe, default is 5

dropna

Drop the empty rows from dataframe


import pandas as pd
df = pd.read_csv('data.csv')
print("Before drop\n", df.to_string())
df = df.dropna(axis=0)
print("After drop\n", df.to_string())

Before drop
    Duration  Pulse  Maxpulse  Calories
0      60.0    110       130     409.1
1      60.0    117       145     479.0
2      60.0    103       135     340.0
3       NaN    109       175     282.4

After drop
    Duration  Pulse  Maxpulse  Calories
0      60.0    110       130     409.1
1      60.0    117       145     479.0
2      60.0    103       135     340.0

Features

Selecting some columns from dataframe is called features.


import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
features = ['Duration', 'Pulse']
print(df[features])

    Duration  Pulse  Maxpulse  Calories
0      60.0    110       130     409.1
1      60.0    117       145     479.0
2      60.0    103       135     340.0
3       NaN    109       175     282.4

   Duration  Pulse
0      60.0    110
1      60.0    117
2      60.0    103