File in/out: How to import CSV files into Python using Pandas

Comma separated values (CSV) files are a type of text file commonly used to store data. In a CSV file, each line of text contains values separated with commas. CSV files can be imported into Python in different ways (eg. csv.reader
, numpy.loadtxt
, etc). One useful method is to import CSV files into Pandas dataframes.
Pandas package. Pandas is a Python package that structures data as a dataframe and provides functions to manipulate numeric and time series data, similar to the way statistical packages such as R and Stata structure data. The name comes from “panel data”, a term to describe structured data sets.
Let’s write a function called readfile
to import an example CSV file into a Pandas dataframe. The CSV file contains 5 channels of data for a calibration routine, but we only want data from the first 3 channels. The CSV file is available for download here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import pandas as pd
import numpy as np
def readfile(file, fq, timescale):
df = pd.read_csv(file, sep=',', header=None)
df.rename(columns={0: 'sample', 1: 'thumb', 2: 'index',
3: 'channel3', 4: 'channel4', 5: 'channel5', }, inplace=True)
# create time and customise units of time
if timescale == 'second':
time = np.arange(0, len(df['sample']) / fq, 1 / fq)
xlab = 'Time (sec)'
elif timescale == 'minute':
time = np.arange(0, len(df['sample']) / fq, 1 / fq) / 60
xlab = 'Time (min)'
elif timescale == 'hour':
time = np.arange(0, len(df['sample']) / fq, 1 / fq) / (60*60)
xlab = 'Time (hour)'
# fix uneven column lengths between time and samples, if needed
if len(time) > len(df['sample']):
time = time[:-(len(time)-len(df['sample']))]
# assign dataframe values to variables
sample = df['sample']
thumb = df['thumb']
index = df['index']
return sample, thumb, index, time, xlab
|
Line 1-2. Import the necessary libraries and specify showing Matplotlib plots inline if running code in a Jupyter or IPython notebook.
Line 4. Define the readfile
function and have it take the arguments file
(name of data file), fq
(frequency at which data were sampled in Hz) and timescale
(whether we report/plot data in seconds, minutes or hours.)
Line 5. Call the function read_csv
to read in the CSV file where values are separated by commas, and don’t read column names from the first row of data. Assign data to the dataframe df
.
Line 6-7. Use rename
to create column names for the data. Sample numbers are stored in channel 0, and data from the thumb and index finger were stored in channels 1 and 2 respectively.
Line 9-17. Write conditional statements using if
and elif
to customise whether time and the x-axis label xlab
are reported in seconds, minutes or hours.
Line 19-20. If the number of samples in one column is different from those in another column, make the columns the same length. (Some data acquisition systems inadvertedly record an extra sample in one channel, but Python or Matlab dataframes cannot handle columns of uneven lengths.)
Line 22-25. Assign data from different columns to variables, and have the function return these variables.
See these posts for refreshers on functions and conditional statements. Now, let’s use this function to extract and plot data from the CSV file:
1 2 3 4 5 6 7 8 9 10 |
import matplotlib.pyplot as plt
%matplotlib inline
sample, thumb, index, time, xlab = readfile(file="001.csv", fq=10, timescale='second')
fig = plt.figure(figsize=(11, 7))
plt.plot(time, thumb)
plt.xlabel(xlab)
plt.ylabel('Thumb (a.u.)')
plt.savefig('fig1.png')
|
Figure 1:
Play around with specifying timescale
as minutes or hours, and see what this does to the plot.
Summary
To import data from a CSV file into a Pandas dataframe, we use the read_csv
function to get the data in and use rename
to label our columns. Data from each column are assigned to variables for further analysis.