How to perform text analytics in R on Multiple PDF Documents

Подписаться 3,9 тыс.

Просмотров 14 тыс.

50% 1

Text analytics is the automated process of translating large volumes of unstructured text into quantitative data to uncover insights. This tutorial discusses how to perform text analytics / text mining in R. Focus is placed on how to perform analysis on multiple PDF documents in R \ R Studio using the pdftools library and the tm library for text analysis. Three documents are loaded simultaneously and compared. Script available at @ www.buymeacoffee.com/opaldona...
If you wish to support me in the creation of future content you can buy me a coffee here: www.buymeacoffee.com/opaldonaldc
Thanks for your support
Learning R can be intimidating however R for Everyone is a book written by Jared lander which I use to simplify many concepts in R. This book provides extensive hands-on practice and sample code. You will learn basic program control, data import, manipulation, and visualization, etc. You can purchase the text here: amzn.to/3ov8QaW
💥💥💥DISCLAIMER💥💥💥
This link provided is an affiliate link, which means I may receive a small commission at NO ADDITIONAL cost to you if you decide to purchase this text. This is a great text for beginners or persons who need a refresher.

Опубликовано:

7 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 73

@prometeo34 2 года назад

Madame, you are one of the best teachers I have seen...well done! Thanks so much for these videos.

@DataCentricInc 2 года назад

You are welcome Carlos

@ehecatl3830 2 года назад

Your english is very clear Thanks

@christopherkhaddockphd9511 Год назад

This video is excellent. You have a tremendous talent for teaching!

@DataCentricInc Год назад

Thank you 😊

@rebeccadsolson1207 2 года назад

You are a great teacher! Such clear explainations. Thank you so much!

@DataCentricInc 2 года назад

You're very welcome Rebecca! Glad it was helpful.

@dawitzewde6654 Год назад

You're fantastic, as always. Thanks so much for your help.

@pcsksa5 2 года назад

That's brilliant. Thank you for sharing.

@kemalgunay 2 года назад

Very helpful content, thanks for sharing

@DataCentricInc 2 года назад

Thank you Kemal😊

@carlitofernandes5491 2 года назад

Fantástico, thanks, i search skill work most pdfs, obrigado, from Brazil

@DataCentricInc 2 года назад

Thank you Carlito

@MsBambi01 2 года назад

Thank you for a great video! It has helped me so much :)

@DataCentricInc 2 года назад

You are welcome

@vengateshprasathramamurthy2801 2 года назад

Great Video! Thank you!

@DataCentricInc 2 года назад

You are welcome Vengatesh

@kats_pajamas6908 2 года назад

thank you so much! Amazing video!

@DataCentricInc 2 года назад

You are welcome @ kats_pajamas

@saqibwarriach Год назад

The missing thing in Data Centric Inc series of tutorial is annotaion of Function words and content words as pre-processing steps, it will be highly pleasing to get your insights and hands-on annotaion and removal of function words prior to analysis.

@andreubrito11 Год назад

Very good tutorial!!

@DataCentricInc Год назад

Thanks 😊

@brianisinga918 Год назад

This is fantastic. Thank you. Could you kindly consider making a video on how to remove the fist say 5 lines from several pdf files and merging them. Or rather combining data from different pdf files after the 5th line/row.

@jahanzebtube Год назад

Great explanation by explaining concepts in an easy way. You do it with simple ease. Thank you. I was running the same codes and I came across a problem. I was wondering if you could put some light on it. Basically, when I run the Corpus function is gives this error: Error in file(con, "r") : invalid 'description' argument Can you please help?

@alancelaya3123 8 месяцев назад

THANKS FOR THE TUTORIALS... I HAVE A QUESTION: I need to apply OCR on pdfs before starting to analyze them do you have a tutorial about this issue?

@lowperformer_berlin Год назад

hey, really cool video! thank you very much! I have one question for the results of line 21. (Frequency analysis) So if we do not count the words with that function, what are the numbers in the [...] brackets tell me?

@vincentdepaulsavarimuthu779 Год назад

really you are great madam.

@DataCentricInc Год назад

Thank you

@universoflearningacademy9503 3 месяца назад

i tried lots of time by creating different project but always object database not found. what I will do when I run this pdfdatabase

@affanasif7506 Год назад

how to know the frequency of some particular words. for example I want to know the frequency of certain words like "Technology, blockchain, peer to peer transaction, new systems etc '

@shehurufai9273 2 года назад

I look forward to working with you for my PhD thesis. Hope you will respond soon.

@DataCentricInc 2 года назад

Hi Shehu, how may I be of assistance?

@christopherbrown576 Год назад

How do you search for individual specific terms, rather than frequently used terms? Thanks!

@igwegbehenrychinaza7908 Год назад

Thank you ma'am Kindly share the link to download the PDF so that I can repeat what you did at home. Thanks in anticipation.

@SanjayFuloria Год назад

Thank you very much. I have a problem. When I run the Corpus function to create the pdfdatabase, I get the following error: PDF error: Unknown Metadata type: 'XMP'. Could you please help me with that?

@justdrawing9207 2 года назад

Hello, thank you for your videos, they help us so much! Please how many papers we can analyze? We can analyze more than 3 PAPERS ??

@DataCentricInc 2 года назад

You are welcome JustDrawing. You can analyze more than three. I have done up to 30 and you could probably do more.

@justdrawing9207 2 года назад

@@DataCentricInc Thank you so much professor 🙏🏻🙏🏻🙏🏻

@agustincsn Год назад

I tried and followed the scripts given but when I load command opinion

@17Adamovic 2 года назад

thank you for the great work/video! One question, what would be the line to run to search for a specific set of words?

@DataCentricInc 2 года назад

Thanks 17Adamovic. If you watch parts 2 & 3 of text analytics on PDF, you will see additional ways to analyze the content on page level, document level and filter by words. Kindly see the following titles: How to perform Text Analytics on PDF Documents in R? Multiple PDF Analysis in R

@17Adamovic 2 года назад

@@DataCentricInc as im brand new to learning R, and need it to do some research work for a professor, I've been watching and learning from your videos! I did watch the other parts, but I don't believe the search/count of a specific word was shown, unless I missed it. You show us how to filter or search for the most frequent words, but I was wondering if we could simply count the amount of a specific word, like "cyber"

@DataCentricInc 2 года назад

@@17Adamovic Kindly see code below that you can use to filter the frequency of words in the Term Document Matrix. Hope this helps :). inspect(opinions.tdm[c("cyber"),])#search for specific words

@17Adamovic 2 года назад

@@DataCentricInc Ahh!! You are the best... thank you!

@17Adamovic 2 года назад

@@DataCentricInc do you have a video on the cleaning code that needs to be done to avoid missing out on the search words with " ' " in them (like, cyber's)? When i apply the cleaning code in your current videos such as removePunctuation, stopwords, tolower, stemming, removeNumbers, bounds, and then search for a specific word, it still avoids the words with any apostrophes in them, even if i change the search term to say "cybers" since the previous coding might remove the apostrophe

@zachabenz 2 года назад

Thanks you for your interesting video. I just ask plz where to get the "tm" pkg

@DataCentricInc 2 года назад

Hi Zacha B, you can type the following line to install the tm library: install.packages("tm")

@zachabenz 2 года назад

@@DataCentricInc Thank you very much. 👍🙏

@harmandeepsingh8903 2 года назад

Great Work Mam, i have one thing if you help me out on that, for example, we took only a specific term from the pdf and then want to analyze for that specific term. Is it possible

@DataCentricInc 2 года назад

Yes it is possible to focus on a term.

@harmandeepsingh8903 2 года назад

Thank you for response, please do a video on that for your subscriber

@agatabreczko6388 2 года назад

Hello! When I am writing the code, in the line 9: "pdfdatabase

@DataCentricInc 2 года назад

Ensure you run the line to require pdftools

@josephjohns1336 2 года назад

Could you please make a video about how to scrape, clean, and visualize data from within tables in a pdf using R? Preferably not a video that uses the tabulizer library or family of libraries. Only pdf tools please.

@DataCentricInc 2 года назад

Hi Joseph, I will take a look at this and let you know.

@DataCentricInc 2 года назад

Hi John, you can look out for this video next Monday.

@itumelengmosala5335 2 года назад

Am continuing to struggle with the code below. Giving me error list.files(path = folder , pattern = "pdf$") folder

@DataCentricInc 2 года назад

Hi itumeleng, I have asked you to send me an email. Check the previous replies I have sent.

@kripa_dristi 2 года назад

Can you please make a video on text mining in search of pdf online by using one keyword

@DataCentricInc 2 года назад

Thanks for your feedback Kripa however I need a little more clarity on this request. Is it that you want to search for a PDF file on the web using R and then perform text mining on the results?

@kripa_dristi Год назад

@@DataCentricInc can you directly implement text mining to search & download any pdf available in web or from any publisher

@er2759 2 года назад

Hello thanky for the great videos!! I have some issues with line 4 its not working. I sent you an mail hopefully you can help me. The error is: Error in lapply(files, pdf_text) : object 'files' not found

@DataCentricInc 2 года назад

Hi ER, it is difficult to diagnose the problem just from this error, ensure you run the line that create the files variable just to so be sage because that could cause an error as well.