Selecting columns when reading a CSV into pandas

Просмотров 13 тыс.

% 180

Are you reading a CSV file into pandas? Your analysis might not require all of the columns - and you can save a lot of memory by selecting only those that you need. In this video, I show you how to select specific columns from a CSV file, either by name or by position.
Jupyter notebooks for my RU-vid videos are all at: github.com/reuven/youTube-notebooks
And don't forget to subscribe to "Better Developers," with free, weekly articles about Python: BetterDevelopersWeekly.com/

Наука

Опубликовано:

3 мар 2022

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 19

@zidnely 2 года назад

I love the way you explain this , Thanks for a NICE video.

@ReuvenLerner 2 года назад

So glad to hear you enjoyed it; thanks for the kind words!

@zidnely 2 года назад

@@ReuvenLerner sir, help me to read the values of a specific column in multiple csv files

@ReuvenLerner 2 года назад

@@zidnely You'll need to read from multiple files, only grabbing the column of interest to you in each one, and concatenate them together.Try this: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-NBKMDWBWwwI.html

@KT-oz1md 7 месяцев назад

Thank you so much! This is what I've been looking for

@ReuvenLerner 7 месяцев назад

Great to hear -- glad I could help!

@KunjaBihariKrishna Год назад

I really fell in love with CSV files. I recently started learning python to handle transaction data and capital gains calculation with pandas.

@ReuvenLerner Год назад

CSV isn't the best format, but it's definitely the most popular. Glad you're using Python for this sort of thing, and that you enjoyed the video!

@KunjaBihariKrishna Год назад

@@ReuvenLerner I'm getting by on RU-vidrs and some ChatGPT (who seems drunk half the time). But all the time I used to spend on videogames is now spent on programming, because it's apparently scratching whatever itch the games were scratching. I've got quite a fun project lined out. I basically need to create a sort of persistent dataframe that I (probably) need to save locally. I need to save the average price of several thousand different cryptocurrencies for each day until the date the currency was listed. I first worked on connecting a Google sheet directly to Coingecko API (which has that data), but quickly realized that this would mean that each transaction would send a request to the API, which would take hours considering the rate limit. So.. I basically have to set up a script that runs for several days/weeks that slowly puts together a local database of cryptocurrency prices. I suppose that could be appended to a CSV and pandas could then load that into a dataframe in the scripts where I calculate the $ value of a list of transactions.

@ReuvenLerner Год назад

@@KunjaBihariKrishna If you can get the data, then you can create a data frame and analyze it - and it doesn't matter whether it's weather data, crypto rates, or the price of oil. That's the amazing thing about this technology; it works with anything you can throw at it, and the limits are typically going to be size (i.e., how much can fit in memory), time (i.e., how long it takes to process), and your understanding. And yes, it is addictive!

@KunjaBihariKrishna Год назад

@@ReuvenLerner Thanks for your reply. I'm going to try it first with a small list of currencies, so I can deal with size limitations later. I have so many basics to figure out. Like, I've managed to create a script that aggregates interest payments to one payment per day. (Which cuts the total transaction amount by 80%) However, after using group.by, I'm left with a datafram that only contains the columns that were relevant to the aggregation process. And I haven't figured out how to put the original CSV back together I wonder if after the aggregation, I need to create a new dataframe with the same header, put the aggregated transactions in there (while keeping the column position correct) And then merge it with the original dataframe? But then I'd also have to remove each aggregated row from the original fist, I guess. Oh well.. I'm just thinking out loud. I obviously am very new at this. Watching these videos saves me a lot of trial and error

@juhakumpula8070 Год назад

If you're getting Usecols do not match columns, columns expected but not found: just add sep=';' or whatever your separator is.

@ReuvenLerner Год назад

Yes, the separator not matching is one of the most common problems i encounter.

@shaswatachowdhury9032 5 месяцев назад

Awesome! Thank you very much!

@ReuvenLerner 5 месяцев назад

My pleasure!

@hadikarimi2818 2 года назад

Great explanation, I have precipitation value in all columns, how can I select all the columns I have that starts with D_ which is 7400 date time? (for example: D_20001201)

@ReuvenLerner 2 года назад

If you have a list of columns, then you can get only those columns from a data frame. So if you say mycols = ['D1', 'D2', 'D3'] df[mycols] You'll get all rows in df, but only the columns D1, D2, and D3. Given that, one way to get all columns that start with D_ could be: mycols = [one_colname for one_colname in df.columns if one_colname.startswith("D_")] df[mycols]

@aliosaid1374 Год назад

How can choose only range of columns such as 5 to 9, I tried 5:9 it will not work

@ReuvenLerner Год назад

The documentation says that you need to pass a list-like object. 5:9 is translated into a slice object when it's inside of [], so that won't work. But range(5,9) might work - or at worst, list(range(5,9)). I'll look into thismore, and maybe I'll have a video about it in a few days!