"Probabilistic scripts for automating common-sense tasks" by Alexander Lew

Подписаться 82 тыс.

Просмотров 75 тыс.

50% 1

As engineers, we love automating tedious tasks. But when those tasks require common-sense reasoning, automation can be difficult. Consider, for example, cleaning a messy dataset-full of typos, NULL values, numbers in the wrong units, and other problems. People have little trouble fixing these errors by hand, but it can be difficult to express the rules for doing so programmatically.
In this talk, I'll introduce a new declarative-programming approach for automating common-sense reasoning tasks: probabilistic scripting. Probabilistic scripts encode (possibly uncertain) domain knowledge declaratively, and leave the compiler to synthesize an efficient inference algorithm that will solve the task at hand. This is all made possible by recent advances in the field of probabilistic programming-in particular, programmable inference engines.
I will demonstrate how this technique can be used to design and implement a scripting language for automating real-world data-cleaning tasks, which achieves state-of-the-art accuracy on data-cleaning benchmarks. More broadly, attendees will come away with a sense of how probabilistic programming can be used to bring common-sense reasoning to the automation of all sorts of tasks.
Alexander Lew
MIT Probabilistic Computing Project
Alex Lew is a Ph.D. student at MIT's Probabilistic Computing Project, and a lead researcher for Metaprob, an open-source probabilistic programming language embedded in Clojure(Script). He aims to build tools that empower everyone to use probabilistic modeling and inference to solve problems creatively. Before coming to MIT, Alex designed and taught a four-year high-school computer science curriculum at the Commonwealth School in Boston. A native of Durham, NC, he also returns home each summer to teach at the Duke Machine Learning Summer School (and spend time with his family and their dogs!).

Наука

Опубликовано:

30 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 74

@nuclearlion 4 года назад

As a former trucker, could have used a trigger warning for that low bridge :) anything less than 13' 6" clearance is potentially traumatic. Loving this talk, thank you!

@grex2595 Год назад

Don't look up can opener bridge.

@davidknipe1179 4 года назад

10:20: "At HP in Cambridge they've made a neon sign of the formula." I used to work at that office. That sign was removed a few years ago. The rumour was that it was judged to be intimidating to customers. And in fact the sign predated the takeover of that company (Autonomy) by HP, so it's not true to say that HP made the sign. I believe they also removed one that said "S = k log W" (entropy) at the same time.

@rban123 4 года назад

When you said “cleaning data is the most time-consuming part”....... I felt that

@GopinathSadasivam 4 года назад

Excellent, Concise and Clear presentation! Thanks!

@123TeeMee 4 года назад

Wow, amazing talk! Realy enjoyed it. I can see this being refined into a powerful tool that many people can and should use.

@mateuszbaginski5075 2 года назад

That's awesome. Maybe one tiny but significant step closer to formalizing and operationalizing common sense.

@catcatcatcatcatcatcatcatcatca Год назад

”We can’t just train a neural network on a slide-deck and tell it to use common sense” captures quite well how far we have gotten in three years. More formal rules might still provide better data, and there is no way chatGPT could parse such dataset in one go (and so it can’t utilise the already included data properly). But still, we now can. And if properly implemented it is likely on par with human common sense, if not just better. It might not even need the slide-deck! That’s wild

@sau002 4 года назад

I like the scenario based approach. Demonstrate the problem from an end user's perspective and then arrive at the solution.

@DKL997 4 года назад

Great talk! A very useful subject, and presented quite well.

@aikimark1955 4 года назад

Wonderful presentation, Alex.

@Spookyhoobster 4 года назад

This seems so crazy, definitely going to give Gen a look. Thanks for the video.

@jimmy21584 4 года назад

That was a fantastic talk. I was waiting for some pseudocode or a deeper technical breakdown of how they implemented it, but it was great for an introduction.

@ukaszMarianszki 4 года назад

What? You got both actual code [or if what he showed wasnt actual code, then it's most certainly pseudocode], and a bit of a peek on how they actually implemented it. [altho not a complete breakdown]

@jjurksztowicz 4 года назад

Great talk! Excited to dig in and try Gen out.

@artzoc 4 года назад

Excellent and superb! and a little bit unbelievable. Thank you

@fabriceaxisa 4 года назад

I implemented such aalgorythm in 2016 I am always happy to learn more about it

@gregmattson2238 3 года назад

wow, three talks, three times my mind has been blown. in particular, the idea of iterating over the data itself to get valid values is just plain insane. I may actually use alloy, pclean, stabilizer and coz.

@thestopper5165 4 года назад

This is a nice way of presenting stuff that good data-mungers have done since I was a grad student (last millennium). Before everyone gets all wet about it, you need to consider that the proposed solution *requires data that is not part of the original table* - specifically, population by city by state. May as well just assume that your dodgy data can be saved by some other data you've got lying around. Good old *Deus ex machina* . Also, not for nuthin'... *rental prices are also subject to typos* - so in the example at 16:00 the sign of the outcome is reversed if there is a typo in the rental price. Let's say it should have been $400 rather than $4000 - although there will be more available rentals in Beverly Hills CA, very few of them will be at $400/rm per mth (or more accurately, $400/mth will be further from the mean for BH, CA than it is for BH, MO.). TL;DR: CompSci guys need to learn actual statistics before they learn statistics buzzwords - and not just a 1-semester Intro Stats that people take as undergrad, but something with some genuine meat. Once you accept that *everything* is subject to typos, the number of tailored kludges needed to clean a set of data is arrived at by iteration. And that's leaving aside things like spatial data: if you know that a polygon representing a property boundary can't have self-intersections, what is the correct mechanism when ST_IsValid() fails? (Answer: it depends). If parcel boundaries and property boundaries can overlap, and sets of properties and parcels represent a specific set of coordinates, what is wrong when the total area of the parcels exceeds that of the properties? (Answer: it depends). Part of my role used to involve cleaning a geospatial dataset that contained 3 million property boundaries, and 3.7 million parcel boundaries, that cover a state. It had to be done every month, even though a very large proportion of the data had identifiers that showed that the data was unchanged for the month (more polygon ST_isValid() errors would pop up in the supposedly unchanged data, than total actual changes due to subdivision or consolidations). *FML* in other words.

@arielspalter7425 4 года назад

Fantastic talk.

@drdca8263 4 года назад

Wow! This sounds really cool! I wonder if this would work for, like, Ok so, where I was working until recently (left to go to grad school), they had an “auto assign” process, where certain tasks are assigned to different people according to a number of heuristics that we had to change from time to time. I’m wondering if pclean (or if not pclean, something based on gen?) could help with that. Because putting in all those heuristics got complicated.

@verduras 4 года назад

fantastic talk. thank you

@nikhilbatta4601 4 года назад

Thanks for this helped me open my mind and think of a new way to solve a problem

@Larszon89 3 года назад

Very interesting talks. Any news on the package, when will it be available? Tried looking the related articles in arxiv that were to contain the source code for the examples as well, but didnt't manage to find it either.

@adamdude 6 месяцев назад

The question that is most important is does doing data cleaning this way improve your final result? Like, does it make your final conclusions about the data more accurate to the real world? Does it help you gain insight you didn't already have? Does it make it more or less likely to reveal new true conclusions about the data? I'd imagine that by making so many assumptions about messy data, you're not learning anything new from that data.

@ewafabian5521 4 года назад

This is brilliant.

@aristoi 4 года назад

Great talk.

@Here0s0Johnny 4 года назад

good talk, but couldn't this screw up the dataset? you might just remove outliers and fill the table with biases.

@Zeturic 4 года назад

Isn't that true of any attempt at cleaning a dataset?

@timh.6872 4 года назад

I wonder if certain human computer interaction situations can be framed as an inference over unknowns. Not from the standpoint of making a computer guess the intent of the human, but from the "embedded domain knowledge" standpoint. Logic programming and super high-level type theory also have this mindset, but approach it from a very cut and dry formal proof perspective, where the programmer has to fully specify what they mean and then the computer checks to see if that makes sense according to deterministic rules. I'm not sure if this probabilistic programming works in this way, but it seems to open up an avenue to quantify and reason about unknowns. Consider a self-employes user that wants to have their computer organize documents for their business. Between hard logical constraints (need to know the "shape" of the data before reasoning about the contents, need to know about the overall format of the files before figuring out the shape) and some existing knowledge (preferences and opinions the user holds, internal implementation knowledge, general encyclopedic knowledge in areas the user looks up), the computer should be able to use this probabilistic inference to start asking "useful" questions, in the sense that statistical uncertainty in the data set indicates a need to consult the user. "I found some "FooBar" forms from customers X, Y, and Z. They don't match the current catrgories very well, where should they go?" If probabalistic programming could actually work in that way, it starts to make "interactive" interfaces possible, in the sense that it provides a question-asking heuristic. A computer transforms from a petulant tool that must be placated and carefully callibrated into a useful assistant.

@borstenpinsel 4 года назад

2009: this is the web, user input must be cleaned and validated, best to give them a list of states 2019: let's assume a person meant post an ad for Beverly hills, ca, simply because more people posted ads there before. Reminds me of my Dad trying to order a "Big Mac" at Burger King. Of course they insisted that they dont have a big mac. Should they have assumed he wanted a "whopper"? Maybe this is exactly what he wanted and he just mixed up the names, or maybe he wanted a big mac and mixed up the stores. In this case he would have been disappointed. I guess you could conjure up all kinds of shenanigans when you know theres a computer program guessing missing inputs. Like listing a ton of fake apartments in a rural area and thus any following, erroneous listing would be placed in the wrong state. Just think how many times you thought "i don't quite understand, he probably means this and that" and you turned out to be wrong. This is all very interesting but also kind of dangerous.

@buttonasas 4 года назад

That's why the feature of the added metadata seems really good to me - you could run this script anyways and then choose whether to discard dirty data or keep it. Or maybe change your approach afterwards.

@RegularExpression1 4 года назад

Having done many database conversions of one kind or another over the years, I can say that sometimes it is necessary to “make a call.” If you’re lacking a State but have a ZIP, fine. But if you have a policy number that begins with R but don’t know the name of the insurance carrier, and the person is a government employee, you have a pretty good sense they’re on Federal” BCBS and you can safely make the change. The damage done if it is wrong is a denied claim which is what you’d have either way. In real life you get data that is lousy and you can’t just refuse to work with it. You refine and re-refine until the data is as clean as you can get it. I like this guy's implementation of Bayesian techniques and one could see a strong, canned solution for basic functionality with Bayes. I like it.

@benmuschol1445 4 года назад

I mean, obviously validate input where possible. This talk is referring to one of the many, many circumstances where you don't control the data set but have to do some sort of analysis. There are obviously dumb ways to use this tech. Don't use it in a web app like the one your example. But it's still a helpful and pretty innocuous technology.

@ukaszMarianszki 4 года назад

Well this is a bit more meant as a tool for data scientists to use to clean their data sets for use with things like neural networks that usually can detect outliars and in fact benefit from having them in the data set in some cases. So i don't really think you would actually use this to validate user input in your live database

@fedeoasi 4 года назад

I thought this was of the best talks at Strange Loop this year. Does anyone know if pclean is available somewhere? I see that Gen is available but I found no mention of pclean in the Gen project or anywhere else.

@pgoeds7420 4 года назад

27:28 Might those be independent typos or copy/paste from the same landlord?

@DylanMaddocks 4 года назад

When he first proposed the problem I immediately thought neural networks would be the best way to go. It really makes me wonder how many of the problems that are being solved by neural networks would be better solved by this probabilistic programming approach. I'm also wondering how fast/slow the expectation maximization algorithm is, because that could be a big constraint.

@tkarmadragon 4 года назад

The reason why GPT2 and other GAN-AIs are so powerful is because they are also using probabilistic programming to generate their own training sets ad-infinite.

@Muzika_Gospel 4 года назад

Excellent

@anteconfig5391 4 года назад

Is it possible to use what was seen in this video to write summaries of what was in a chapter of a book or website?

@wujacob4642 4 года назад

In the preliminary results slide, PClean and HoloClean have exactly the same scores(Recall 0.713, F1 Score 0.832). Is it coincidence( looks unlikely ) or HoloClean has something in common with PClean ?

@joebloggsgogglebox 4 года назад

The text description under the video says that metaprob is written in clojurescript, but the video makes it clear that Julia was used. Is there also a clojurescript version?

@phillipotey9736 4 года назад

Amazing, manually programmable AI. Let me know where I can get this scripting language.

@sefirotsama 4 года назад

I still miss in what areas can I use that sort of programming other than guessing fuzzy values of bulky data.

@chaosordeal294 4 года назад

Does this method ever render demonstrably better results than just ejecting dirty (and "suspected dirty") data?

@quasa0 4 года назад

it does, the quality of the results are comparable to the results of by-hand data cleaning, and in most cases, i believe, it would be much better than the human made script, just because we are removing the human error and bias from the equation

@y.z.6517 3 года назад

Ejecting an entry, just because New York was written as NewYork would lead to unrepresentative data.

@y.z.6517 3 года назад

If a row has 10 columns, falsifying 1 column still allows the other 9 to be used. Better than ejecting all 10. An alternative approach is to take the average value for the 1 invalid value, so it's equavalent to being ejected.

@mjbarton7891 4 года назад

What is the benefit of generating the most likely value for the record? How is it more beneficial than eliminating the record from the data set entirely?

@alchemication 4 года назад

Hi Mike, this is only a simple example. Imagine if your data-set has 100 columns and in a random record only 2 columns are missing values. By discarding this observation entirely you might be loosing important information. Now imagine that 90% of your data has at least 1 column missing. Are you seeing a pattern here? Hope it helps.

@brianrobertson781 4 года назад

Two reasons: 1. You might not be able to delete any records (for instance, real-time service delivery, or the data has a sufficiently-high error rate you’d be eliminating most of the data set); and 2. You don’t know which records are of issue - you have to validate both complete and incomplete fields and a likelihood score tells you where to focus your efforts. I really could have used this 10 years ago. And whole armies of financial analysts are about to get automated.

@F1mus 4 года назад

Wow so interesting.

@jn-iy3pz 4 года назад

Typo in slide at 5:20, if statement should be `if r[:city] in cities[s]`

@buttonasas 4 года назад

Same mistake was in the @pclean part, it seems...

@walterhjelmar1772 4 года назад

Really interesting talk. Is the pclean implementation publicly available?

@SaMusz73 4 года назад

as said in the talk it's at github.com/probcomp/Gen

@gabbrii88 4 года назад

@@SaMusz73 it is not in the source code of Gen. I think they did not make it available as built in library to Gen.

@lorcaranr 4 года назад

So why do we trust what the user entered for the rent, perhaps they entered that incorrectly? If they can't spell then surely they can make a mistake entering the rent?

@tkarmadragon 4 года назад

You are right. Actually the beauty of the proposed software is that if you do not trust the rent, then you can choose to incorporate the median rent. It's your call as the admin.

@guilloutube 4 года назад

Great work. Awesome. is it open source? I didn't find pclean source code. I'll take a look at Gen.

@SaMusz73 4 года назад

go there : github.com/probcomp/Gen

@PinataOblongata 4 года назад

This might be a silly question, but I don't understand how you could possibly get such messy data in the first place - you don't create online web forms where people can just enter in whatever you like, you use drop-down boxes where they have to choose an option that will be 100% correct unless they somehow click on the wrong one without realising or purposefully mislead. You can't even type numbers in to date fields anymore for this reason! Is this more a problem for paper forms that are scanned, or am I missing something?

@buttonasas 4 года назад

Yes - something like scanning paper forms. You never know when you get some nulls in your data, even when completely digital.

@averageengineeer 4 года назад

based on naive-bayes theorem ?

@Keepedia99 10 месяцев назад

Not the point of the talk but wonder if we can make programs guess/correct their own bugs by checking the distribution of their other outputs

@MaThWa92 4 года назад

Wouldn't making estimators for P(B|A) and P(A) from the dataset and then using these to evaluate the same dataset be an example of overfitting?

@remram44 Год назад

Using probabilistic methods and a priori knowledge to generate data encoded as definite facts (rather than something that would come with confidence values) seems really dangerous. In your example, your data is no longer an unbiased source to conclude anything about the distribution of rents in the US, because you assumed the conclusion to produce the data: of course the results will show that rents are higher in California than Missouri, because otherwise you would have "corrected" a good portion of it as typos! It would be nice to have a framework that goes end-to-end, providing you with a way to check those assumptions.

@PopeGoliath 4 года назад

I'm 20 minutes in, and the domain knowledge he needs to have to write the checker seems like exactly what the data set is designed to unveil. Is this a chicken/egg problem? Your study is to answer questions about a knowledge domain, but you need to already have that knowledge to error check your data. Where does it start? Edit: At 22 minutes he says he'll get to that. Oh good.