How to read and write to a CSV File using Pandas

Reading a CSV from a file is a very simple affair, and something that you are likely going to have to do many times during your career as a data scientist.

As an example lets read an image dataset I gathered of all the paintings I could find about the Nativity.

I will not go into detail on how I was able to gather this dataset, but if you are curious I have previously created a video showing how I gathered images using Panda data frames. and BeautifulSoup. You can find it here: https://www.youtube.com/watch?v=yMEMUf3vEvc

Let’s first import the pandas library so we can read our CSV:

import pandas

To understand how to read CSV let’s use Python’s help function:

help(pandas.read_csv)
Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer:Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal:str='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
    Read a comma-separated values (csv) file into DataFrame.

    Also supports optionally iterating or breaking of the file
    into chunks.

    Additional help can be found in the online docs for
    `IO Tools <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html>`_.

    Parameters
    ----------
    filepath_or_buffer : str, path object or file-like object
        Any valid string path is acceptable. The string could be a URL. Valid
        URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is
        expected. A local file could be: file://localhost/path/to/table.csv.

        If you want to pass in a path object, pandas accepts any ``os.PathLike``.

        By file-like object, we refer to objects with a ``read()`` method, such as
        a file handler (e.g. via builtin ``open`` function) or ``StringIO``.
    sep : str, default ','
        Delimiter to use. If sep is None, the C engine cannot automatically detect
        the separator, but the Python parsing engine can, meaning the latter will
        be used and automatically detect the separator by Python's builtin sniffer
        tool, ``csv.Sniffer``. In addition, separators longer than 1 character and
        different from ``'\s+'`` will be interpreted as regular expressions and
        will also force the use of the Python parsing engine. Note that regex
        delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``.
    delimiter : str, default ``None``
        Alias for sep.
    header : int, list of int, default 'infer'
        Row number(s) to use as the column names, and the start of the
        data.  Default behavior is to infer the column names: if no names
        are passed the behavior is identical to ``header=0`` and column
        names are inferred from the first line of the file, if column
        names are passed explicitly then the behavior is identical to
        ``header=None``. Explicitly pass ``header=0`` to be able to
        replace existing names. The header can be a list of integers that
        specify row locations for a multi-index on the columns
        e.g. [0,1,3]. Intervening rows that are not specified will be
        skipped (e.g. 2 in this example is skipped). Note that this
        parameter ignores commented lines and empty lines if
        ``skip_blank_lines=True``, so ``header=0`` denotes the first line of
        data rather than the first line of the file.
  ...

There are quite a lot of parameters that read_csv will accept, for many use-cases that you might find. I will not go into detail on those for now. Let’s just read our CSV:

pd = pandas.read_csv("nativity_dataset.csv")

display(pd)
Image URLWeb page URLLabels
0https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/nativity-1…NaN
1https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/the-adorat…NaN
2https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/winter-lan…NaN
3https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/the-nativi…NaN
4https://d3d00swyhr67nd.cloudfront.net/w944h944…https://artuk.org/discover/artworks/the-adorat…NaN
215https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Herri_…NaN
216https://preview.redd.it/4tc6qepayw521.jpg?widt…https://www.nationalgallery.org.uk/paintings/g…NaN
217https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Adorac…NaN
218https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Koelne…NaN
219https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:The_vi…NaN

220 rows × 3 columns

It was that easy! You will notice that I used the display function to show the panda dataframe quite nicely. If you used the print() function it would have looked quite ugly:

print(pd)
                                             Image URL  ... Labels
0    https://d3d00swyhr67nd.cloudfront.net/w1200h12...  ...    NaN
1    https://d3d00swyhr67nd.cloudfront.net/w1200h12...  ...    NaN
2    https://d3d00swyhr67nd.cloudfront.net/w1200h12...  ...    NaN
3    https://d3d00swyhr67nd.cloudfront.net/w1200h12...  ...    NaN
4    https://d3d00swyhr67nd.cloudfront.net/w944h944...  ...    NaN
..                                                 ...  ...    ...
215  https://upload.wikimedia.org/wikipedia/commons...  ...    NaN
216  https://preview.redd.it/4tc6qepayw521.jpg?widt...  ...    NaN
217  https://upload.wikimedia.org/wikipedia/commons...  ...    NaN
218  https://upload.wikimedia.org/wikipedia/commons...  ...    NaN
219  https://upload.wikimedia.org/wikipedia/commons...  ...    NaN

[220 rows x 3 columns]

Ommitting the pandas display function

If you are too lazy to call display you can ommit it.

For instance:

pd
Image URLWeb page URLLabels
0https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/nativity-1…NaN
1https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/the-adorat…NaN
2https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/winter-lan…NaN
3https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/the-nativi…NaN
4https://d3d00swyhr67nd.cloudfront.net/w944h944…https://artuk.org/discover/artworks/the-adorat…NaN
215https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Herri_…NaN
216https://preview.redd.it/4tc6qepayw521.jpg?widt…https://www.nationalgallery.org.uk/paintings/g…NaN
217https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Adorac…NaN
218https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Koelne…NaN
219https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:The_vi…NaN

220 rows × 3 columns

What happened there?

The python notebook called the display function for us because it was the last value in the the python block.

Changing the column names

If you don’t like the column headers you can easily change it when reading the CSV:

pd = pandas.read_csv("nativity_dataset.csv", names=["Precise Image URL", "Precise Source URL", "Precise Labels"])
pd
Precise Image URLPrecise Source URLPrecise Labels
0Image URLWeb page URLLabels
1https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/nativity-1…NaN
2https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/the-adorat…NaN
3https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/winter-lan…NaN
4https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/the-nativi…NaN
216https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Herri_…NaN
217https://preview.redd.it/4tc6qepayw521.jpg?widt…https://www.nationalgallery.org.uk/paintings/g…NaN
218https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Adorac…NaN
219https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Koelne…NaN
220https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:The_vi…NaN

221 rows × 3 columns

It was almost what we wanted. But seems like the previous column names are still there. No worries, it is easy to fix.

pd = pandas.read_csv("nativity_dataset.csv", header=0, names=["Precise Image URL", "Precise Source URL", "Precise Labels"])
pd
Precise Image URLPrecise Source URLPrecise Labels
0https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/nativity-1…NaN
1https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/the-adorat…NaN
2https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/winter-lan…NaN
3https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/the-nativi…NaN
4https://d3d00swyhr67nd.cloudfront.net/w944h944…https://artuk.org/discover/artworks/the-adorat…NaN
215https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Herri_…NaN
216https://preview.redd.it/4tc6qepayw521.jpg?widt…https://www.nationalgallery.org.uk/paintings/g…NaN
217https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Adorac…NaN
218https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Koelne…NaN
219https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:The_vi…NaN

220 rows × 3 columns

You will notice that we have a column without any data(“Labels”). Let’s delete it. In [ ]:

del pd["Precise Labels"]

pd
Precise Image URLPrecise Source URL
0https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/nativity-1…
1https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/the-adorat…
2https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/winter-lan…
3https://d3d00swyhr67nd.cloudfront.net/w1200h12…https://artuk.org/discover/artworks/the-nativi…
4https://d3d00swyhr67nd.cloudfront.net/w944h944…https://artuk.org/discover/artworks/the-adorat…
215https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Herri_…
216https://preview.redd.it/4tc6qepayw521.jpg?widt…https://www.nationalgallery.org.uk/paintings/g…
217https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Adorac…
218https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:Koelne…
219https://upload.wikimedia.org/wikipedia/commons…https://commons.wikimedia.org/wiki/File:The_vi…

220 rows × 2 columns

The column is gone. Let’s save the panda dataframe to a CSV

pd.to_csv("nativity_dataset_updated.csv")

That was easy, right? There is a lot more that we can do with pandas and CSVs but I am sure this will help you get started.

That will be all for now. Happy Coding!

RESOURCES

Google Colab Notebook:

Open In Colab

Github Repository:

https://github.com/armindocachada/working-with-pandas-series

Submit a Comment

Your email address will not be published. Required fields are marked *