Тёмный

Vinayak Mehta - Extracting tabular data from PDFs with Camelot & Excalibur - PyCon 2019 

PyCon 2019
Подписаться 14 тыс.
Просмотров 10 тыс.
50% 1

"Speaker: Vinayak Mehta
Extracting tables from PDFs is hard. The Portable Document Format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. A simple copy-and-paste from a PDF into a text file or spreadsheet program doesn't work.
This talk will briefly touch upon the history of the Portable Document Format, discuss some problems that arise when extracting tabular data from PDFs using the current ecosystem of libraries and tools and demonstrate how Camelot and Excalibur solve this problem better and in a scalable manner. These easy-to-use packages automatically detect and extract tables from PDFs and give you access to the extracted tables in pandas DataFrames. You can also download them as CSVs or Excel files.
Slides can be found at: speakerdeck.com/pycon2019 and github.com/PyCon/2019-slides"

Опубликовано:

 

4 май 2019

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 18   
@EricPalmer_DaddyOh
@EricPalmer_DaddyOh 5 лет назад
Awesome. I'm going to try this out soon on open data pdf files. Looks like just what I need?
@christianlira1259
@christianlira1259 4 года назад
Thank you Vinayak Mehta for the great presentation and the tons of work you made. This is a great tool and I lok forward to read abut your OCR read capabilities.
@Yelonek1986
@Yelonek1986 2 года назад
Awesome, thanks for this library! It works like a charm.
@muhammadahsam1346
@muhammadahsam1346 4 года назад
Awesome library, but what do we do for swapping the columns after converting it into excel or csv format ?
@venkateswaraotella6581
@venkateswaraotella6581 Год назад
I need to extract document as same where i need to change the code..?
@hayathbasha4519
@hayathbasha4519 3 года назад
Hi,
@csdevendrajain9114
@csdevendrajain9114 4 года назад
Ghostscript is not work in my pc, I have done everything like adding path or environment variables every time error shows this app not work in your PC and access denied in Windows 8.1
@Mach7RadioIntercepts
@Mach7RadioIntercepts 3 года назад
Nice talk! Monty Python LOL. Dude, I knew I was going to be a big MP fan when I was punished in grade school for acting out the stoning scene ib "The Life of Brian"
@srikantpadhy9476
@srikantpadhy9476 4 года назад
is camelot and Excalibur work for scanned pdf
@amitkumdixit
@amitkumdixit 4 года назад
not working failed miserably. It only showed first row of the tables. Tabula gave me perfect result.I wanted to extract table from the bank account statement.
@ShiquanWang
@ShiquanWang 5 лет назад
For the first question saying no good tool to convert a PDF file to HTML with its original layout/look.
Далее
Мой инстаграм: v1.ann
00:13
Просмотров 116 тыс.