Functions for RW from File
data.csv
- We will feed this data to pandas
# data.csv
Duration, Pulse, Maxpulse, Calories
60, 110, 130, 409.1
60, 117, 145, 479.0
60, 103, 135, 340.0
, 109, 175, 282.4
read_csv(), .columns
-
read_csv(): read and create a dataframe
.columns: list the coloumn header in the dataframe
$ test.py
import pandas as pd
df = pd.read_csv('data.csv') #read in data frame
print(df.to_string())
print(df.columns)
Duration Pulse Maxpulse Calories
0 60.0 110 130 409.1
1 60.0 117 145 479.0
2 60.0 103 135 340.0
3 NaN 109 175 282.4
Index(['Duration', 'Pulse', 'Maxpulse', 'Calories'], dtype='object')
describe()
-
describe(): Returns 8 values for each coloumn
count: The number of non-null values. (For Duration coloumn, count=3)
mean: The average value. (For Duration coloumn, 70+60+50/3 = 60)
std: The standard deviation.
min: The minimum value. (For duration coloumn, min val=50)
25%: The 25th percentile (first quartile).
Imagine sorting each column from lowest to highest value. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values, ie 25%
50%: The 50th percentile (median).
75%: The 75th percentile (third quartile).
max: The maximum value. (For duration coloumn, max val=70)
import pandas as pd
df = pd.read_csv('data.csv')
print(df.describe())
Duration Pulse Maxpulse Calories
count 3.0 4.000000 4.000000 4.000000
mean 60.0 109.750000 146.250000 377.625000
std 10.0 5.737305 20.155644 85.148904
min 50.0 103.000000 130.000000 282.400000
25% 55.0 107.500000 133.750000 325.600000
50% 60.0 109.500000 140.000000 374.550000
75% 65.0 111.750000 152.500000 426.575000
max 70.0 117.000000 175.000000 479.000000
head()
- head returns 1st n rows from dataframe, default is 5
dropna
- Drop the empty rows from dataframe
import pandas as pd
df = pd.read_csv('data.csv')
print("Before drop\n", df.to_string())
df = df.dropna(axis=0)
print("After drop\n", df.to_string())
Before drop
Duration Pulse Maxpulse Calories
0 60.0 110 130 409.1
1 60.0 117 145 479.0
2 60.0 103 135 340.0
3 NaN 109 175 282.4
After drop
Duration Pulse Maxpulse Calories
0 60.0 110 130 409.1
1 60.0 117 145 479.0
2 60.0 103 135 340.0
Features
- Selecting some columns from dataframe is called features.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
features = ['Duration', 'Pulse']
print(df[features])
Duration Pulse Maxpulse Calories
0 60.0 110 130 409.1
1 60.0 117 145 479.0
2 60.0 103 135 340.0
3 NaN 109 175 282.4
Duration Pulse
0 60.0 110
1 60.0 117
2 60.0 103