New LLM BEATS LLaMA3 - Fully Tested

Подписаться 281 тыс.

Просмотров 42 тыс.

50% 1

Qwen2 was released and I tested the biggest and smallest versions of it.
Be sure to check out Pinecone for all your Vector DB needs: www.pinecone.io/
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
👉🏻 LinkedIn: / forward-future-ai
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
huggingface.co/spaces/Qwen/Qw...
Disclosure:
I'm an investor in LMStudio

Наука

Опубликовано:

7 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 274

@matthew_berman 21 день назад

Which size of Qwen 2 will you be using?

@BlauerGMI 21 день назад

@paelnever 21 день назад

52b-Qi2

@chapicer 21 день назад

@gileneusz 21 день назад

I would happily use 35B model so I would be able to fit q8_0 into 48GB in 6000 ADA, but no such model is available 😥 you can get q8_0 boost using optimum-NVidia. Can you test aya model for translation btw?

@SellamAbraham 21 день назад

Headline of video does not match performance of subject bot.

@JustinArut 21 день назад

I suggest you retest LlaMA3. I tested it yesterday with several from your rubric. It aced the killers test, the marble test, and the apple test. After it passed the apple test, I threw it a twist by prompting: Give me 10 sentences with the third word in each sentence being the result of mixing red and blue colors. It used purple as the third word in all 10 sentences.

@brunodangelo1146 21 день назад

Don't test using rubrics. Base your impressions on real use cases.

@JustinArut 21 день назад

@ brunodangelo1146 Prompting AI to structure sentences in specific ways, for example in creative or technical writing, is a common use. Also, we're testing common sense and reasoning abilities, which are applicable in all domains.

@jeffdavis5196 19 дней назад

@@brunodangelo1146 In which case it really blows Qwen2 away.

@14supersonic 10 дней назад

The thing is that most of these models don't have a physical representation of the real world mapped in their neural pathways yet. Not like models like GPT4o and SORA have. Once these models have a good understanding of physics and mathematical calculations, then they'll be able to generalize any reasoning exercise with extremely high accuracy.

@s0ckpupp3t 20 дней назад

Protip: if you change the system prompt midway through a conversation it'll break the output. When changing the prompt start a new convo with no history

@meh2285 21 день назад

The larger model's method of getting every sentance to end with Apple was pretty funny if you taker a closer look 😂

@4.0.4 21 день назад

I don't see the problem, Apple It technically did it right Apple.

@Happ1ness 21 день назад

Just like a true AGI, Apple

@TheRealUsername 21 день назад

It's actually smart, Apple.

@bsarel 20 дней назад

True, it went bananas, Apple

@halunke3211 20 дней назад

and destroyed it, when it was right, Apple

@hotbit7327 21 день назад

Apple fooled Mat... It's at the end of each sentence but in some childish way!

@sncncd 20 дней назад

Haha I looked for this comment😂

@nanuqcz 20 дней назад

15:45 True.

@nacs 20 дней назад

Yeah almost none of those answers made sense grammatically. It literally just stuck ", Apple" at the end of each sentence.

@jeffdavis5196 19 дней назад

Clearly designed to fool the testers and benchmarks... good job Qwen2.

@jamesjonnes 21 день назад

I'm more impressed by the 0.5b model than the large one. If you compare this with the previous models these answers are amazing.

@zenn4752 21 день назад

The small model gives such bad answers, it's actually funny

@SellamAbraham 21 день назад

It's not unlike a LLM version of Engrish.

@GeorgeG472 20 дней назад

Has anyone tested it out with RAG? Maybe it could perform better with that or fine-tuning.

@elchippe 20 дней назад

I guess the smaller model are made to trainable on specific tasks.

@alpha007org 19 дней назад

Even 7B is very very bad. Look at my comment if you want to see a miserable fail, which is actually funny and sad.

@elchippe 18 дней назад

@@alpha007org The quantize models have an issue with the weights and the output, as far I can test the 7B is as good as LLAMA 8B or better.

@RWS0022 20 дней назад

Congrats! on the new Dell Precision 5860 Tower I had kind of given up on seeing LLM reviews run on windows on your channel. I hope that this is a worry of the past! Post the Full Specs if you would to give us an idea of the performance of these models on this machine!

@TroyDoesAI 20 дней назад

Always excellent reviews from your channel. Thank you.

@4.0.4 21 день назад

57B-A14B means "57B parameters" (how much memory you need) "14B active parameters" (how fast it goes). Would be the same as Mixtral 8x7B, in terms of memory requirements and speed.

@patrickwasp 21 день назад

Will the uncensored version know about Tiananmen?

@AndrewDee 21 день назад

Who cares 😂 if you want to ask about Tiananmen, go to wikipedia

@carlharrison3637 21 день назад

Doubtful lol

@Egal0190 21 день назад

Brainwashed NATO sheeps be like "tIaNaNmEn sQuArE!". Even if you read on the English page of wikipedia what happened there you get a whole different story to what CNN/BBC and co sell you. You should learn to verify information with primary sources instead of having blind faith into the US State Department narrative.

@4.0.4 21 день назад

@@AndrewDee Despite how pointless it is, it would be a good metric of uncensoring efforts.

@malloott 21 день назад

Not available in China tho.. @@AndrewDee

@zeloguy 20 дней назад

MBPP -- (mostly basic python programs). A14B -- It is a 57 billion parameter LLM with an additional 14 billion parameter vision module (hence the "A14B" in the name), making it a multimodal model capable of processing both text and images.

@mbalireshawal8679 21 день назад

Hey Mathew. I have noticed that these models format their responses in latex. And it is correct latex code. Could it be a setting in your gui that you can edit to inorive the response. This is a defect of your gui I believe not the model

@edsonjr6972 19 дней назад

this was the best answer for the shirt problem so far

@eliezerjimenezsanchez8556 21 день назад

Read the sentences that end in apple; most of them make no sense.

@justtiredthings 19 дней назад

Now that I think about it, he capitalized "Apple" between the quotation marks, so the model probably treated it as a name. Ending in ", Apple" makes considerably more sense in that case, though they're still not great sentences.

@eliezerjimenezsanchez8556 19 дней назад

@@justtiredthings In those cases in which it wrote a good sentence, it added an extra apple.

@davidlavin4774 21 день назад

I still maintain that the marble in the glass question needs to add a description to be more like "glass with an open top". I think as human's we're assuming that the word "glass" means a drinking container that is open on top but I don't think you can assume that a language model will default to that.

@littleking2565 20 дней назад

I was thinking the same thing, but it kind of acts as a test itself.

@davidlavin4774 20 дней назад

@@littleking2565 maybe. But had any model passed that test as written? It just seems that we're now interested in testing that the model understands the physics with the marble rather than knowing implicitly that the glass has an open top. They might still all fail, but it seems every response talks about the marble staying at the top of the glass.

@relaxingheart682 20 дней назад

“put in a glass” could just be a glass sheet. drinking glass is the correct term.

@Foloex 20 дней назад

That's also what I thought, "glass" could be interpreted as the material not the object. This could be confusing to anything, human or not.

@pon1 19 дней назад

It should be aligned with human assumptions though, it's both a physics problem and an alignment problem, it might get the physics, but AI has a lot of trouble with alignment still.

@homematvej 19 дней назад

15:45 it follows instructions very well even if it's incapable of forming a proper sentence with an apple at the end.

@dr_harrington 20 дней назад

I'm going to take a wild guess and say that most of your followers don't have a GPU that can run the 72b model at a reasonable speed, and the 0.5b model is so bad as to not be very useful. I would prioritize the 7b model since more people are likely to use that one.

@abdelhakkhalil7684 21 день назад

Thank you Matthew. I wish you could have tested the 7B and/or the MoE 57B since those can be run locally for many users.

@TomPane-pl4lk 20 дней назад

Be aware of backdoors and data leakage. Install in an air-gapped environment only.

@setop123 20 дней назад

TWO A6000 for free ?! Matt, i'm extremely jealous woaw...

@pmarreck 18 дней назад

Quen 2 72b runs quite usably on my 64GB M1 Mac using Ollama. Astounding! Apple needs to push this advantage that their SoC chips have over the Wintel hegemony!

@salehmohamed9650 20 дней назад

Every time Matthew says Yup, it's bad, It's terrible, or This is not the answer, I just can't stop laughing. That was a funny video LOL

@normanlove222 20 дней назад

I am not sure how good it is for us to see you testing on a computer that 90% of us wont have anything close to that. Who are you testing for, the 10% that do? I dont get it.

@Leto2ndAtreides 20 дней назад

0.5GB had almost no chance of working. They really need to massively constrain these micro-models and be clear about what exactly they can do well.

@SlyNine 21 день назад

Where's the marble? Model: what marble?

@carlharrison3637 21 день назад

Who is the marble?

@user-nz6vw9fj1h 21 день назад

Why marble?

@lukefig3765 21 день назад

I lol’d which is rare for a comment. Thank you stranger.

@spelcheak 20 дней назад

No one ever asks “How is the marble?” 😔

@kristianlavigne8270 20 дней назад

lol 😂

@AI_Pyramid 20 дней назад

The Last AI: of Humanity Climbing the AI Pyramid

@tonyppe 20 дней назад

8:53 - when someone told you the answer and your teacher said "explain your working out" lol

@erb34 Час назад

I only have 8GB of RAM would love more focus on 7B models. I think a lot of people at home have similar hardware and I would like to see how it performs on that sort of kit. Keep up the amazing work! :-)

@PestOnYT 20 дней назад

Hi Matthew, on the question with the killers. I did this with the smaller model with the same result. I asked it afterwards what is with the person who just killed one of the killers. It replied that the legal status is currently unknown. Maybe it assumes that "killer" means a "convicted killer". Not sure about the legal language here (as I'm not a native English speaker), but if it was an action in self-defense (not specified, though), the person may actually not be counted as a killer. Furthermore, a really good answer would point out these specifics and also mention that the dead person may also (still) be counted as a killer as well. Love you vids!

@toadlguy 20 дней назад

Generally, I would think you would want an LLM to determine that someone who kills is a killer. You would also expect a killer to be alive so he/she can kill. However it is certainly proper to say "the killer was buried in the cemetery", so props to the model that pointed that out.

@PestOnYT 20 дней назад

@@toadlguy I think it is a question of what each of us considers what the term "killer" means. So, it is sort of a bias in our own and the LLM's assumptions. My point here being it would be nice if Matthew would ask the LLM (after each question), what about the person performing the killing is; or why it didn't consider parallel drying; etc. When I play with different models I imaging talking to a person and therefore ask it why it didn't consider a or b. The response sometimes reveals my own mistakes. :)

@ItsaSimDrive 20 дней назад

HEY HEY HEY WHOA WHOA. Is that evolution we just witnessed in ai chat bots. 15:40 no chat bot ever got that right lol.

@HCG 19 дней назад

The benchmark rankings don’t mean much anymore in regards to real world performance. Nearly every single model is trained on the benchmark questions, which allows it score highly but does not reflect its actual quality. I’ve tried so many highly ranked models that are just either terrible or average in real word usage.

@Yipper64 20 дней назад

14:05 ok so notice how it uses the word "container" one thing ive noticed with this test is that the LLM essentially sees "marble in placed in" and then idk somehow that takes some kind of precedent making it assume "container" rather than the specified "cup" and what do containers typically have? Lids. So it assumes the marble is not just in the cup but *contained in the cup.* Which would mean that it stays in the cup.

@-7-man 20 дней назад

Q: How are you? Small model: 42.. I think you meant How many? I am available in 42 versions. You are one of them. Thank you for your services.

@zippytechnologies 15 дней назад

They released 7b model - seems to perform well

@IbnShaalan 21 день назад

57B A14 is the active params

@erikpro0073 20 дней назад

One Thing to Note, if you using LM Studio, on the right Panel, check if you use the correct chat template and also Set the Temperature which is often 0.8, to 0.0 (Temperature makes text more creative but increases error rate.)

@shambong2371 18 дней назад

Any more issues to know about LMStudio?

@ArianeQube 20 дней назад

What are the specs of your new Dell machine?

@Happ1ness 21 день назад

10:19 that's what I'd like to see in an LLM. Any other LLM I've tried would just stick to the answer they've assumed to be "correct", trying their hardest to prove it even when you ask it to be cautious. Giving the right answer right off the bat is cool, but ability to catch their mistakes is more important imo. Edit: 14:42 ok, nevermind, lmao

@marcfruchtman9473 20 дней назад

Congrats on the Hardware! This model has very interesting but subtle differences in behavior that I was not expecting. The Shirts Drying in Sun question had a really great reply. I found the PEMDAS response odd but useful. Unfortunately it didn't do well with the as many questions as I was hoping for. I am beginning to wonder about the benchmark results for this model. So, just out of curiosity, if they repeat the same bench mark 1000 times, until it finally does better and only publish that... is it valid?

@P2000Camaro 20 дней назад

Lmao!!! You: Where is the marble now?? Qwen: "There is no marble."

@timojosunny1488 11 дней назад

57b-a14b means it's a moe model where in every request the active parameters is 14b

@felipe21994 21 день назад

Would be interesting seeing the 0.5b model running on a phone, also if you could try the 1.5b and the 7b to see if they are actually good it would be awesome

@Takenzo75 20 дней назад

It's outputting mathematical equation in LaTeX. If you'd have a way to render it the output would actually be quite nice.

@paulsimpson9544 19 дней назад

@matthew_berman thx for your videos. Can you please include a function calling test in future? Something involving langchain structured output

@turnkit 19 дней назад

Matthew, please leave a little outro space. I watch on a TV and when you remind me to give a like I need a couple seconds to grab the remote, etc. Surprised so many RU-vidrs miss likes this way.

@BuPhoonBaba 20 дней назад

That was really cool of Dell & Nvidia to give you that!!!!

@OrinThomas 20 дней назад

Would you consider a video going over how you built/configured the new host computer?

@thedudely1 20 дней назад

I'm impressed by the 0.5B parameter model. The 0.5B version of Qwen 1.5 is practically unusable in my experience, but maybe I was doing something wrong.

@ThuanSuBa 20 дней назад

Summarized by AskTube AI: Introduction to Quen 2 🔥 00:00 Quen 2 was released by the Alibaba group, and it's better than LLaMA 3 across the board. In this video, we'll test two variations of the model: the 72 billion parameter version and the 5 billion parameter version. Quen 2's Performance and Capabilities 💻 00:18 Quen 2 has extended context length support of up to 128k tokens for the 7B and 72b instruct versions. It wins on every single evaluation benchmark compared to LLaMA 3. Testing and Evaluating Quen 2 02:02 We'll test Quen 2 0.5b instruct fp16 version locally using LM Studio on a new PC with two A6000 GPUs and 48 GB of VRAM each. 03:14 We'll test the small model and the larger model, writing a Python script to output numbers 1 to 100 and checking the performance of each model. 04:01 We'll use Quen 2 to write the game Snake in Python, using Pygame, and check the performance of each model. Logic and Reasoning Capabilities 🤔 06:52 We'll test Quen 2's ability to respond to a sensitive question and its logic and reasoning capabilities. 07:54 We'll test Quen 2's math and logic capabilities with a series of questions, including one about drying shirts in the sun. 09:14 We'll test Quen 2's math and logic capabilities with more questions, including one about hotel room rates and taxes. 10:45 We'll test Quen 2's ability to solve the Killer's problem and more, including counting words in a response. 13:04 We'll test Quen 2's reasoning capabilities with a question about a marble in a glass and microwave. Evaluating Quen 2 with Logic and Reasoning Problems 🤔 14:42 John and Mark's Ball Problem: Two people, John and Mark, are in a room with a ball, basket, and box. They each put the ball in the basket and leave. When they return, where do they think the ball is? John thinks it's in the box, and Mark thinks it's in the basket. 15:02 The tiny model fails to solve the problem, while the large model succeeds in understanding the scenario. Testing Language Models with Sentence Generation 📝 15:21 Tiny model fails: Given a task to generate 10 sentences that end in the word 'Apple', the tiny model fails to get a single one right. 15:45 Large model succeeds: The large model generates all 10 sentences correctly. Logic and Math Problems 💡 15:45 Tiny model fails: When asked how long it would take for 50 people to dig a single 10-ft hole, the tiny model incorrectly calculates the time. 16:08 Large model succeeds: The large model correctly understands that the time it takes for multiple people to complete a task is not always directly proportional. Model Capabilities and Limitations 🚀 16:29 Model limitations: The quen 2 model lacks vision capabilities. Other models with varying parameters (1.5 billion, 7 billion, and 57 billion) are available, each suitable for different use cases.

@Ginto_O 20 дней назад

tbh the tkinter one is the best looking snake i've seen

@oscarstenberg2449 19 дней назад

The weird formatting regarding math seems to be right. Looks like TeX/LaTeX, which is very standard for academic math stuff. Not the models fault if it is rendered incorrectly.

@ToddWBucy-lf8yz 20 дней назад

16:07 I think it actualy got it right...remember your declaration next time you pass road construction, more people often means a lot of them are in the way.

@nascentclouds-a.i.collecti9918 19 дней назад

The best part of using llama 3 is its prompt steer-ability

@NickZaden 19 дней назад

haha, the small model almost short-circuited with your questions... It's funny how these smaller models try to come up with elaborate but senseless answers.

@RealStonedApe 19 дней назад

Fake it till you make it!! 🤣

@hqcart1 21 день назад

it's not better than LLaMA3 70b based on 10k+ votes @lmsys

@00prometheus 20 дней назад

I am slightly concerned about your new shiny machine. With it, you will be able to run much larger models than us mere mortals and that means you won't naturally find and mention various cheap ways to run large models.

@timeflex 21 день назад

What about 7b version?

@alpha007org 19 дней назад

Very bad performance. Look at my main comment.

@JanBadertscher 20 дней назад

xxB-AyyB means it's an Mixture of Experts model architecture having a total of xx B parameters and yy Billion Active parameters.

@OTISWDRIFTWOOD 14 дней назад

Wow you used a PC! Great to not hear the Apple Apple Apple for once 😄

@mshonle 20 дней назад

What’s the machine that you have exactly and where could we find it?

@helloworldcsofficial 21 день назад

Nice

@monkeysrightpaw 21 день назад

Very very interesting that it tells you to import random. Suggests that it's capable of after though. The way a human would. Might be better if it showed the answer with a clock as it runs back though it then inserted the import at the correct place.

@user-ty9ho4ct4k 20 дней назад

Matt Berman, Mattvidpro, Matt wolf... Why are there 3 popular Matt's in the A.I. sphere? I'm almost suspicious that this is the work of a misaligned algorithm. Although they all do great work. 🤔 Also, am I late to the game here? Did all the regulars already notice?

@PauloSilva-dz6uh 20 дней назад

@mathew_berman thanks for sharing. Do you know if is possible training these LLMs in a 3D scanning point cloud to teach it how to draw/read 2D/3D drawings? Cheers.

@ekranium 20 дней назад

You should test whether the LLMs are capable of handling multiple languages, not just English.

@poldiderbus3330 21 день назад

I'm so jealous! 🙂

@Leto2ndAtreides 20 дней назад

Really should try describing what a glass is just to see how that plays out.

@MasamuneX 19 дней назад

I think you should redo the coding metric to use something where there is not a lot of training material like a custom problem using a library in finance or something ive found llm's work great for simple stuff but when you need to think about a problem it really struggles

@notme222 20 дней назад

Give us hw stats on the new machine so we can be properly jealous!

@PauloSilva-dz6uh 20 дней назад

Hi Dan, any chance for us training any of these LLM with 3D scanning point cloud so they can learning how to read and draw 3D/2D drawings?

@AdrienSales 20 дней назад

Hi Matthew : did you bench the 7B it on function calling against gpt-3.5-turbo ?

@TuxedoMaskMusic 21 день назад

i am using ollama 3 locally through command line on desktop and its pretty good, but it derogates (fairly quickly) over time if you keep it running. Hopefully this new model will derogates a bit slower and can be ran a bit longer sessions wise without breaking down as a result.

@lirothen 21 день назад

You'd think training on books that have a long form beginning middle and end would have helped with that

@erikjohnson9112 20 дней назад

57B-14B would indicate 57B total which is 4 x 14B (which adds up to 56B, but probably rounds to 57B).

@bsarel 20 дней назад

It would be awesome if it supports function calling out of the box 🙏🏼

@MonkeySimius 21 день назад

It basically said if there is more people it is going to take more time... I think the answer was correct. It was just trained on group projects that nobody wants to do.

@Nandarion 20 дней назад

These simple python coding test is not what we want from models for software engineering. we need some different things: 1. search for a bugs or logical errors in large codebases/long functions. (500+++ lines,) especially search for memory leaks and UB in c/c++ programs(because it is hard to debug) 2. code auto-completion task: complete code of some not very difficult function (to use with IDE autocomplete plugins). 3. as stackovervlow replacement: answer a questions how to make some functionality, why some code have problems, etc 4. provide code using some not very popular but complex algorithm for given data structures. and explain code so, programmer can easy check it correctness, or write tests for this code.

@richardhack5202 19 дней назад

Here is how Microsoft Phi-3 MINI model responded to the shirt drying test. This is a Phi-3-mini-4k-instruct-q4.gguf model - MUCH smaller than the Qwen2 70B. See how excellent this response is: User: if we lay five shirts out in the sun to dry and it takes 4 hours to dry them, how long will it take to dry 20 shirts? Explain your reasoning step by step. Assistant: Step 1: Understand that the number of shirts doesn't affect drying time The problem suggests a correlation between the number of shirts and the total drying time. However, assuming we have sufficient space to lay out all the shirts without overlapping them or impeding each other's exposure to sunlight, the amount of time it takes for one shirt to dry is independent of how many there are in total. This means that if you can only place five shirts at once and they take 4 hours to dry fully, then this process will remain constant regardless of whether we have two or twenty shirts. Step 2: Apply the same conditions for larger batches If we lay out 5 shirts in a row (assuming that each has equal access to sunlight and airflow), it takes 4 hours for them all to dry. This is because only one shirt needs its own dedicated space, so five spaces are required for the five shirts - regardless of how many additional shirts there might be in total. Step 3: Determine the time needed for larger batches Since we're assuming that each shirt can dry independently within those constraints (space and exposure), it would still take 4 hours to lay out and dry all twenty shirts. This assumes a sufficient number of spaces are available so there is no competition or interference among the individual shirts during the drying process. So, with these assumptions in mind, the time required for twenty shirts to dry would still be 4 hours. It's essential to note that this answer relies heavily on our ability to maintain consistent conditions across all batches of shirts and may not hold true under more complex circumstances (e.g., limited space, overlapping exposure). However, based solely on the information provided in the question, twenty shirts would still take 4 hours to dry if we follow a similar procedure as for five shirts. User: 👍👍Perfect answer. You beat a 72B Qwen2 model with this answer. Well done. Assistant: I'm glad you found my explanation satisfactory! While the problem's conditions have been simplified, it's important to remember that real-world scenarios might require more considerations regarding space and environmental factors. Nevertheless, based on those simplifications, drying time doesn't scale linearly with the number of shirts due to independent exposure possibilities: twenty shirts would still take 4 hours to dry under these conditions.

@Yipper64 20 дней назад

My usual storytelling test came up pretty standard. I use it to both test the understanding of sacrifice and anthropomorphic animals. Just a kind of weird test I like to do. It pretty much always results in a forest setting with a drought as the main problem, rather than any more personal threat. And something about the community banding together and the sacrifice it usually has is individual needs for the needs of the many. Which I always find a little commie adjacent. And especially when it doesnt even go the route of someone dying for it, I just find the LLM is not morally aligned with what I would like, personally. I wanna see death. And like sacrifice being a personal thing not a community thing. The most interesting thing was the fact that the story started with a fox giving a suggestion, his suggestion is just *completely dismissed* and the story moves on rewarding a rabbit girl for giving a generally similar solution, like with the same issue, being it causes exhaustion to gather the water. But just because the rabbit said to travel on the land to look elsewhere and the fox said to dig, the rabbit gets a golden acorn. Justice for my man Felix he did not deserve this fr fr.

@OscarTheStrategist 20 дней назад

The tiny model is made for comedic relief purposes xD

@dataprospect 20 дней назад

Mat you didnot conclude whether this model reflects the public benchmarks on your overal test results. I have tried qwen1.5 7b, the previous generation. They were really good on benchmarks but it failed so bad on my coding use cases. Qwen2 could be same?

@francoisotis3560 16 дней назад

Apparently there won't be any good AI coding before AGI arrives.

@artrichards1764 20 дней назад

For the shirt drying question, do we know if this is now part of the training data? That response was very odd! Should we change our test now to something similar but different?

@theh1ve 20 дней назад

Quantum marble!

@dbzkidkev2 21 день назад

It did not get the apple question right... "New York City is often referred to as the Big Apple, Apple" Really?? Its technically correct, but not in the spirit of the question.

@damien2198 20 дней назад

What licence are they using ? proper open source without limitation ?

@8eck 19 дней назад

So where it is a KING?

@Djungelurban 18 дней назад

I'm not sure it got the apple problem correct, you should look at it again, Apple.

@uae2ae 21 день назад

The mic seems to pickup your phone signal. not a glitch, just keep the phone away from sight of the mic

@electromigue 20 дней назад

Not really. I work in IT and this has been happening to most of the users in my company, it also happens to me in my personal Windows system and I have an Audio Engineering background so I use professional external audio interfaces and still the same issue. In all of these the common denominator is Windows. From my personal research, this bug has been occurring in Windows for 2 years now and the work around to fix usually involves either changing the sample rate and changing it back, or changing the audio device's buffer size or changing the audio device and changing it back.

@uae2ae 20 дней назад

@@electromigue Huh, weird stuff indeed. It just sounds exactly like a signal being picked up, but it might also be some internal issue caused by the audio driver not having enough memory to work with. This should have been addressed since it has been picked up a while ago f it has been there for two years, but maybe it's hard to replicate to debug.

@cyborgmetropolis7652 21 день назад

My job blocks all LLM api ips, but I can run local LLMs without a problem (Ollama, LM Studio) and that helps me do my job better.

@jackflash6377 21 день назад

Take the 0.5B model and train it on specific tasks such as voice recognition, data input, pattern recognition etc.. build it into embedded devices and there you go. Super powerful IoT. Who cares if it can write python, it can control my house with voice ala Ironman

@retrotek664 21 день назад

Are these question available online for personal testing?

@nicosilva4750 20 дней назад

There is NOTHING wrong with the math output. LMStudio is not rendering LaTeX, that is why it looks wrong.

@devlogicg2875 20 дней назад

I think the Chinese AIs catching up and 4o being a boring release tells us something deep about where OpenAI actually are....Did they peek into the void?

@keithprice3369 19 дней назад

Hah! It wasn't entirely wrong. There WERE 7 words in the answer! :D

@nathanbanks2354 20 дней назад

Where's the marble? I lost my marbles!

@ringyring 19 дней назад

Shouldn't it be "the new queen?" Since it's QWEN...

@mikezooper 21 день назад

Unless it has better algorithms, more good data and better compute then it’ll just be an LLM gaming the tests rather than being better.

@mithrixx 21 день назад

I'm interested to see how much it would cost to buy a PC of these specs.. I know it's probably not out yet but I wanna know how much they're planning to sell it for

@robboerman9378 21 день назад

At least 20k with 2 A6000 48gb GPUs

@paul1979uk2000 21 день назад

@@robboerman9378 Yep, it won't be cheap with those cards. We really do need someone to deliver a modest gpu but with a lot more vram, being that vram isn't that expensive but AMD, Nvidia and Intel don't want to deliver that without making our noses bleed lol.