Python How to Call a Function Pandas Read Csv
CSV (comma-separated value) files are a common file format for transferring and storing data. The ability to read, dispense, and write data to and from CSV files using Python is a key skill to master for whatever data scientist or business analysis. In this post, we'll get over what CSV files are, how to read CSV files into Pandas DataFrames, and how to write DataFrames back to CSV files postal service assay.
Pandas is the nigh pop data manipulation package in Python, and DataFrames are the Pandas data type for storing tabular second data.
- Load CSV files to Python Pandas
- i. File Extensions and File Types
- 2. Data Representation in CSV files
- Other Delimiters / Separators – TSV files
- Delimiters in Text Fields – Quotechar
- 3. Python – Paths, Folders, Files
- Finding your Python Path
- File Loading: Accented and Relative Paths
- 4. Pandas CSV File Loading Errors
- Advanced Read CSV Files
- Specifying Data Types
- Skipping and Picking Rows and Columns From File
- Custom Missing Value Symbols
- CSV Format Advantages and Disadvantages
- Additional Reading
Load CSV files to Python Pandas
The basic process of loading data from a CSV file into a Pandas DataFrame (with all going well) is achieved using the "read_csv" function in Pandas:
# Load the Pandas libraries with allonym 'pd' import pandas as pd # Read data from file 'filename.csv' # (in the same directory that your python process is based) # Command delimiters, rows, cavalcade names with read_csv (see after) data = pd.read_csv("filename.csv") # Preview the first v lines of the loaded data data.head()
While this lawmaking seems uncomplicated, an understanding of three fundamental concepts is required to fully grasp and debug the functioning of the data loading process if you lot run across bug:
- Agreement file extensions and file types – what do the messages CSV actually mean? What's the departure between a .csv file and a .txt file?
- Understanding how information is represented inside CSV files – if you open a CSV file, what does the data actually look like?
- Agreement the Python path and how to reference a file – what is the absolute and relative path to the file you are loading? What directory are you working in?
- CSV data formats and errors – common errors with the function.
Each of these topics is discussed below, and we finish this tutorial past looking at some more avant-garde CSV loading mechanisms and giving some broad advantages and disadvantages of the CSV format.
ane. File Extensions and File Types
The start step to working with comma-separated-value (CSV) files is understanding the concept of file types and file extensions.
- Information is stored on your computer in private "files", or containers, each with a different name.
- Each file contains data of different types – the internals of a Word certificate is quite different from the internals of an image.
- Computers determine how to read files using the "file extension", that is the code that follows the dot (".") in the filename.
- So, a filename is typically in the class "<random proper name>.<file extension>". Examples:
- project1.DOCX – a Microsoft Word file called Project1.
- shanes_file.TXT – a elementary text file chosen shanes_file
- IMG_5673.JPG – An paradigm file called IMG_5673.
- Other well known file types and extensions include: XLSX: Excel, PDF: Portable Certificate Format, PNG – images, Zip – compressed file format, GIF – animation, MPEG – video, MP3 – music etc. See a complete list of extensions here.
- A CSV file is a file with a ".csv" file extension, e.grand. "data.csv", "super_information.csv". The "CSV" in this case lets the reckoner know that the information contained in the file is in "comma separated value" format, which we'll discuss below.
File extensions are hidden past default on a lot of operating systems. The first step that whatsoever self-respecting engineer, software engineer, or data scientist volition do on a new reckoner is to ensure that file extensions are shown in their Explorer (Windows) or Finder (Mac) windows.
To bank check if file extensions are showing in your system, create a new text certificate with Notepad (Windows) or TextEdit (Mac) and salvage it to a folder of your choice. If you tin can't see the ".txt" extension in your folder when you view it, you will have to change your settings.
- In Microsoft Windows: Open Command Console > Appearance and Personalization. At present, click on Folder Options or File Explorer Choice, as it is at present called > View tab. In this tab, under Advance Settings, y'all will run into the option Hide extensions for known file types. Uncheck this option and click on Apply and OK.
- In Mac Os: Open Finder > In menu, click Finder > Preferences, Click Advanced, Select the checkbox for "Show all filename extensions".
two. Data Representation in CSV files
A "CSV" file, that is, a file with a "csv" filetype, is a basic text file. Any text editor such every bit NotePad on windows or TextEdit on Mac, can open a CSV file and show the contents. Sublime Text is a wonderful and multi-functional text editor option for whatever platform.
CSV is a standard for storing tabular information in text format, where commas are used to divide the different columns, and newlines (railroad vehicle render / press enter) used to separate rows. Typically, the first row in a CSV file contains the names of the columns for the data.
And instance table data gear up and the corresponding CSV-format data is shown in the diagram beneath.
Note that well-nigh any tabular data can be stored in CSV format – the format is popular because of its simplicity and flexibility. You lot tin create a text file in a text editor, save it with a .csv extension, and open that file in Excel or Google Sheets to encounter the table form.
Other Delimiters / Separators – TSV files
The comma separation scheme is by far the well-nigh popular method of storing tabular data in text files.
Nevertheless, the choice of the ',' comma character to delimiters columns, yet, is arbitrary, and can exist substituted where needed. Popular alternatives include tab ("\t") and semi-colon (";"). Tab-separate files are known as TSV (Tab-Separated Value) files.
When loading data with Pandas, the read_csv part is used for reading any delimited text file, and past changing the delimiter using the sep
parameter.
Delimiters in Text Fields – Quotechar
One complexity in creating CSV files is if you accept commas, semicolons, or tabs actually in one of the text fields that y'all desire to store. In this case, information technology'south important to use a "quote character" in the CSV file to create these fields.
The quote character tin be specified in Pandas.read_csv using the quotechar
statement. By default (every bit with many systems), it's set every bit the standard quotation marks ("). Any commas (or other delimiters as demonstrated below) that occur between two quote characters will exist ignored as column separators.
In the instance shown, a semicolon-delimited file, with quotation marks as a quotechar is loaded into Pandas, and shown in Excel. The utilise of the quotechar allows the "NickName" column to incorporate semicolons without being divide into more columns.
three. Python – Paths, Folders, Files
When you specify a filename to Pandas.read_csv, Python will await in your "current working directory". Your working directory is typically the directory that you started your Python process or Jupyter notebook from.
Finding your Python Path
Your Python path can be displayed using the born os
module. The Os module is for operating system dependent functionality into Python programs and scripts.
To detect your current working directory, the part required is os.getcwd()
. Thebone.listdir()
function can exist used to display all files in a directory, which is a good check to meet if the CSV file you are loading is in the directory every bit expected.
# Notice out your current working directory import os print(bone.getcwd()) # Out: /Users/shane/Documents/web log # Display all of the files constitute in your current working directory impress(os.listdir(os.getcwd()) # Out: ['test_delimted.ssv', 'CSV Blog.ipynb', 'test_data.csv']
In the example above, my electric current working directory is in the '/Users/Shane/Certificate/blog' directory. Any files that are places in this directory volition exist immediately bachelor to the Python file open() function or the Pandas read csv function.
Instead of moving the required data files to your working directory, you lot can also change your current working directory to the directory where the files reside usingos.chdir()
.
File Loading: Accented and Relative Paths
When specifying file names to the read_csv part, you can supply both absolute or relative file paths.
- A relative pathis the path to the file if you start from your current working directory. In relative paths, typically the file will be in a subdirectory of the working directory and the path will not starting time with a drive specifier, e.grand. (information/test_file.csv). The characters '..' are used to move to a parent directory in a relative path.
- An absolute pathis the complete path from the base of your file system to the file that you desire to load, e.g. c:/Documents/Shane/information/test_file.csv. Absolute paths will offset with a drive specifier (c:/ or d:/ in Windows, or '/' in Mac or Linux)
It'due south recommended and preferred to use relative paths where possible in applications, considering absolute paths are unlikely to work on different computers due to different directory structures.
iv. Pandas CSV File Loading Errors
The most common fault's you lot'll become while loading data from CSV files into Pandas will exist:
-
FileNotFoundError: File b'filename.csv' does not exist
A File Not Plant error is typically an issue with path setup, electric current directory, or file name confusion (file extension can play a function here!) -
UnicodeDecodeError: 'utf-viii' codec tin't decode byte in position : invalid continuation byte
A Unicode Decode Fault is typically caused past not specifying the encoding of the file, and happens when y'all have a file with non-standard characters. For a quick fix, attempt opening the file in Sublime Text, and re-saving with encoding 'UTF-viii'. -
pandas.parser.CParserError: Error tokenizing data.
Parse Errors can be acquired in unusual circumstances to do with your data format – try to add the parameter "engine='python'" to the read_csv function call; this changes the data reading part internally to a slower but more than stable method.
Advanced Read CSV Files
There are some additional flexible parameters in the Pandas read_csv() part that are useful to accept in your arsenal of data scientific discipline techniques:
Specifying Data Types
As mentioned before, CSV files practise non contain any type information for data. Data types are inferred through examination of the top rows of the file, which tin can lead to errors. To manually specify the data types for different columns, thedtype parameter can be used with a dictionary of column names and data types to be applied, for instance:dtype={"proper noun": str, "age": np.int32}
.
Annotation that for dates and date times, the format, columns, and other behaviour tin can be adjusted using parse_dates, date_parser, dayfirst, keep_dateparameters.
Skipping and Picking Rows and Columns From File
Thenrows parameter specifies how many rows from the acme of CSV file to read, which is useful to take a sample of a large file without loading completely. Similarly theskiprowsparameter allows you lot to specify rows to leave out, either at the start of the file (provide an int), or throughout the file (provide a list of row indices). Similarly, theusecolsparameter tin can exist used to specify which columns in the data to load.
Custom Missing Value Symbols
When data is exported to CSV from different systems, missing values can be specified with different tokens. Thena_values parameter allows you lot to customise the characters that are recognised as missing values. The default values interpreted equally NA/NaN are: '', '#North/A', '#N/A Due north/A', '#NA', '-one.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', 'ane.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'northward/a', 'nan', 'null'.
# Advanced CSV loading example data = pd.read_csv( "data/files/complex_data_example.tsv", # relative python path to subdirectory sep='\t' # Tab-separated value file. quotechar="'", # single quote allowed as quote character dtype={"bacon": int}, # Parse the bacon column as an integer usecols=['name', 'birth_date', 'salary']. # Only load the three columns specified. parse_dates=['birth_date'], # Intepret the birth_date column as a date skiprows=10, # Skip the starting time 10 rows of the file na_values=['.', '??'] # Take any '.' or '??' values as NA )
CSV Format Advantages and Disadvantages
As with all technical decisions, storing your data in CSV format has both advantages and disadvantages. Be aware of the potential pitfalls and issues that you will encounter as you lot load, shop, and exchange data in CSV format:
On the plus side:
- CSV format is universal and the data can be loaded by almost any software.
- CSV files are simple to empathize and debug with a basic text editor
- CSV files are quick to create and load into memory before analysis.
Yet, the CSV format has some negative sides:
- There is no data type information stored in the text file, all typing (dates, int vs float, strings) are inferred from the data only.
- There's no formatting or layout information storable – things similar fonts, borders, column width settings from Microsoft Excel volition exist lost.
- File encodings tin can become a problem if there are not-ASCII compatible characters in text fields.
- CSV format is inefficient; numbers are stored as characters rather than binary values, which is wasteful. You will find still that your CSV data compresses well using zip compression.
As and aside, in an effort to counter some of these disadvantages, 2 prominent data science developers in both the R and Python ecosystems, Wes McKinney and Hadley Wickham, recently introduced the Plume Format, which aims to exist a fast, simple, open, flexible and multi-platform data format that supports multiple data types natively.
Additional Reading
- Official Pandas documentation for the read_csv function.
- Python 3 Notes on file paths, working directories, and using the Os module.
- Datacamp Tutorial on loading CSV files, including some additional Bone commands.
- PythonHow Loading CSV tutorial.
Source: https://www.shanelynn.ie/python-pandas-read-csv-load-data-from-csv-files/
0 Response to "Python How to Call a Function Pandas Read Csv"
Post a Comment