A bit of a different video today about something that's been on my mind. I know it's a bit of a rant and more or less a clip from my livestream, but I thought some people might benefit from it! Let me know if you like this type of content as well. If so, I am happy to do more "lecture-style" videos on various topics.
Hell, yes! Also media. The problem is particularly noticeable in libraries where one project's "issues" tend to infect other, "innocent" projects, who thought they would just be consuming an "API" and instead ingested a can of worms. Sci/media projects tend to be rife with platform and abandon-ware dependencies, idiosyncratic build frameworks, undocumented behavior, hard-coded constants, nonstandard object persistence, premature optimization, low-level programming abuse, non-adherence to coding standards, and any other offense to maintainability and code-safety imaginable. It's not your imagination.
Often the code I see coming from people with a scientific background also tends to use a lot of single letter variable names instead of descriptive ones that could aid in the understanding of the code. Has this been you experience as I was surprised it wasn't mentioned?
Having recently started as a "real" software engineer after finishing my PhD, I recognize many of these problems. We did do version control and unit-testing for our research software, but I often passed up on good software documentation in favor of writing the actual research articles. I've also had many requests from colleagues to share my code for making high-quality graphs. Most of the time I had to reply with: "You can have my code, but it won't work directly on any other data than mine. Please take my code as-is, and use it as an example to try writing something of your own." I know I could have made my graphing tools much more modular and general, but at the end of the day I needed to have my thesis finished.
Honestly, using version control and doing proper testing is still pretty good! In my opinion, software is only really as useful as its documentation, but if the code was meant as a script for a single publication and not meant for reuse, I think it's acceptable to have less documentation... As long as people can still understand the code enough to replicate the results!
As someone who has worked in both pure software development and pure CS research positions, I completely agree. Specially when it comes to documentation and peer review of code, I’m shocked by a lack of standardization. Asking a researcher for access to their code is a true roll of the dice.
This has been my experience as well, but I definitely started on the academic side. The moment you leave the academic bubble, you start to realize how poor the software standards actually are in academia.
I worked as a research assistant in a chemistry laboratory that primarily deals with simulation. The lab head is still using a FORTRAN for nucleation simulation. I believe that code is at least 20 years old. When I tried to read the code it has variables like 'xxx' and 'yyy'.
In my experience, Fortran is very convenient for that kind of work. (Fast, low-level, clean syntax, built-in support for matrices and complex numbers... it feels like the right tool for the job. With C, on the other hand, it often feels like working against the language to get it to do what I want.) A determined scientist can write unreadable code in any language, though.
@@altaroffire56 A lot of people say fortran is bad, but I 100% agree. Usually fortran is seen as a bad programming language only because people programming it use bad programming practices.
@@LeiosLabs Lets not forget that what is now bad programming practices may have been the best at the time. Short identifiers due to length restrictions, terse programs due to small screen sizes, few inline comments due to file limitations, etc. It is the improvements in technology that have allowed us to write more human readable code.
My experience with researchers writing code was that the piece of software they needed most was git. So much version-control-via-making-copies-and-emailing-it-to-yourself.
This was very helpful. I'm going to look more into JOSS. As a Physicist interested in Scientific computing, unit testing seems like almost a foreign concept, and I feel fairly inadequate compared to my computer science peers. I've had enough exposure to the importance of version control prompting me to learn git myself. For anyone else in a similar position, look at the MIT Missing Semester Jan2020 IAP for similar computer sciencey-"filler" education. More videos about CliMA would be cool : )
I was in your exact position at the start of my PhD. I knew about unit testing, but never "needed" it for my code and used version control, but couldn't really get my peers to use it, so I was stuck. It was an uphill battle for me, but learning proper programming practices helped out my research tremendously!
@@LeiosLabs thanks for the tip. I feel fairly lucky in this regard as there's so much to learn from online, that hopefully I'll have it easier. Content like yours helps so thank you once again
I work in the DSP field and we work closely with people in academia. I 100% agree with what you say. So much time could have been saved if the code handed to us was written better or even followed the paper. I think a big thing is that some older people in academia have the attitude of "if you used simulations, you didn't solve the problem." I personally think it's weird to see people not use software as a tool for verification on both generated and real data.
Competitive programmers may be able to help. You can get relatively clean and simple code from very complex new algorithms if you ask competitive programmers. We are trained to code common algorithms really quickly and occasionally search for better (faster, more memory efficient, working online, etc.) algorithms to implement so we can use them as "secret weapons" during contests. As an example: given a tree graph of N nodes, it is widely known that you can find its centroid decomposition in O(N log N) time. However, a quick Google search will lead you to a paper demonstrating O(N) centroid decomposition which has no code. To verify, we usually just read the paper, code the algorithm ourselves, and stress test it against the verified slower algorithm with thousands of randomly generated cases. Might it be possible for researchers to get competitive programmers verify their work?
How do you get them to work together? Academia has a very traditional structure. Software engineers are sometimes hired as "lab assistants", which means everyone ignores them until last minute when some bug shows up in a big mess of unreadable code. If they get accepted into academic position, like PhD student, or post-doc, they are pressured to publish their own work asap or get lost.
Thank you for posting this. Going through my PhD now, I experience many of these pains that you've clearly outlined here. If we could continue to grow this discussion and build a scientific community more embracing of software engineering practices, starting with git and code re-usability, the long term gains would certainly outpace the short term learning pains.
Congrats to your PhD! Thank you for your perspective on research software engineering. I have never seen course offers at my university for scientists on how to write good software and in the end it comes down to teaching yourself. I work in the same field (PhD candidate in computational fluid dynamics with LBM / physics) and I've seen lots of bad code as well, due to the points you've discussed. But that's not always the case. The incentive to write clean code is given at least once you work on software as a team. We do refactoring and on a regular basis and make sure every line is properly documented. Because of the teamwork, version control becomes a necessity as well. Testing code is actually most of the work. If code is not testable and the results are not reproducible, it is trash code, no matter in which field. The main incentive for our software project actually was hardware (GPU) efficiency and performance. No other software on the market is capable of comparable performance, so we had to write our own. Regarding job chances, research software engineering is not a dead end at all. If you really master scientific programming, you don't have to apply for a job because companies will apply for your time.
I think we have almost the same perspective here. I'm happy to see more people writing the best code they can given the circumstances! As a note: I've been considering doing a video on heterogeneous computation (CPU / GPU) in Julia for HPC relatively soon.
I don't know about your team, but many scientists seem to think "testing" means running the whole 40 hours experiment and comparing the results with the last run. When software engineers talk about testing, they mean small and readable unit tests which take seconds to run and verify automatically. But again, we can't blame self taught programmers of not knowing all the best practices.
This is an essential topic for research. More incentive should be given towards research software development. Many of the high quality research depend on how well a simulation or model has been formulated and executed. Better programming practices in developing research works will lead towards better research scopes.
Right on point. I am currently trying to refractor an old academia codebase consisting of Matlab, Python, Java, and C++ that are glued together using Matlab, and it is just a nightmare. And yes, Matlab is evil - you often see thousands of lines of code without encapsulation and a huge namespace. I genuinely think that much more people would have used the code if it was written in a more professional fashion.
That sounds awful! I have had similar experiences, no where near *that* bad. On the other hand, at least they are giving you time to refactor! I really feel we need to be honest about the fact that software is how people conduct research. Poor documentation / software practices are precisely the same as keeping a bad lab notebook / sloppy methodologies.
Thank you so much for posting this video. What I've heard for a lot about algorithms is that when a paper is written, and it says that it has great performance, it's very likely that the implementation will be very costly and won't have better performance than the current solution. OFC, there are also some breakthroughs.
This video speaks to me so much. I was a software engineer/systems engineer before going back to grad school, and I was the only computational-focused person in my lab for Neuroscience. There were other folks who knew how to program (and some who couldn't do more than a stats script), but writing "good" code (as loaded as that is) was just not a priority because no one else was ever going to see it (because there was no avenue to share and no one wants to replicate results anyway). Lo and behold, my code ends up being pretty useful for some other work (related to TBI), and it is fortunately very documented so I was able to share it. It's far from perfect, and finding the balance of where to stop on it because it was good enough was a huge challenge. I would've loved to submit it to a journal and get it more polished, but there was no value in that (at least relative to the other priorities I had to graduate). I wish I knew how to help push the culture forward in this space. I left academia after graduating, so I'm afraid I'm not being very helpful. I've started publishing again recently around my volunteer work, so maybe that's my avenue to help.
I'm glad you are still thinking about helping out! I think creating well-documented code is already a good step forward. If people can use your code easier, then they will start to see the value in good programming practices.
I wish someone would make a tutorial where the student can follow along and learn to make a Julia package that does something trivial but the point is learning to make a package and putting in on GitHub plus all the documentation and tests and making branches and all that.
I recently came across JOSS and made a submission to it. In so doing I found that there are lots of thing I didn't know, including writing tests, documentation using sphnix, proper packaging of the code. At least I knew a bit of git, which I learned during my spare time, when making side projects totally unrelated to the research projects. These things we have to learn by ourselves, the institution does not provide such trainings, and many of my colleagues don't care about these things at all. And I also dislike matlab, array doesn't even start from 0.
Arrays starting from 0 isn't a problem. R, Julia, Lua, etc all start with 1. In functional / array programming languages it doesn't matter. You'd be basically writing the same code in Haskell for e.g. (a 0-indexed functional language). Function composition instead of loops always!
Thank you for this video. I'm a 3rd year doctoral student in Applied Math, and specifically the scientific computer subdisciplines you mention. I'm currently finalizing a moderate size (about 4000 lines of C) codebase to be open sourced along with a paper submission. There a serious crunch-time feeling which is causing various holes in documentation as well as crappy inefficient fixes. You're definitely right, writing well documented code feels impossible when one is also supposed to also be pushing out theoretical breakthroughs of some flavor. On the other hand, it is also very hard to write code that works without a strong grounding in the theory of a subject.
Thank you ! That's exactly why I did not went into research :(. I was shocked that in operational research, code was not standardized, shared nor reviewed ! People are publishing results (sometimes modified or cherry-picked, impossible (or long and hard) to verifiy. Please we need more people to pay attention on this problem which in my opinion slows down research and hinder its credibility ! It's also true in economics, especially with the infamous case of a research paper wrong and badly reviewed with a badly written excel file driving most of the modern politics regarding public debt based on false assertions :( !
Excellent video. Empathise with the points you've made. They're not only relevant in academia but also within commerical settings where there is pressure to release and not enough resource is committed towards building robust systems. Or, conversely, systems are engineered well but the solutions aren't scientifically rigorous.
I don't often see production code, but don't doubt there are similar issues there! In general, we need to do better to write high-quality software whenever possible!
And you're probably only talking about software engineering or at least from that perspective. However, as a PhD student in the field of structural engineering, I wasn't specifically trained to write code, but some/many problems can't be solved without a bit of coding. Now, imagine that horrible google-stackoverflow-slapped-together-frankenstein code. But most of the time, if it works it works and I'm more than happy I made something that actually does what it should. And indeed, usually, once the publication is done, the project is set aside, as well as the code. Fully agree though!
I tried to speak both from the academic perspective (where there is no incentive to write clean code) and the software engineering perspective (where there is not incentive for software engineers to stay in academia). I also left out some content that was particularly ranty about academia. I want to start a conversation, not an argument.
@@LeiosLabs I hear you and I wanted to acknowledge this from my own experience in academia. I'm curious how it would evolve over time, as programming will become increasingly more important (I think).
Good video. I think we all have the right to get access to source codes and to all the work in generally which got funded by taxes. It's annoying to pay for papers when I have payed for them already.
I think there's a lot of researchers that see Matlab as a necessary evil. There's just such a developed ecosystem of tools and labs are reluctant to migrate. I would like to see some sort of bounty system for migrating Matlab code to Julia. In ultrasound physics there's packages like Field II and FOCUS for Matlab which I would like to see migrated and I'd be happy to chip in to some fund to make that happen.
Yeah, matlab for experimental work is one thing. If there are no other packages that allow users to connect with their experiments, then it's the only option. I would love to see Julia take over the role of Matlab, though. Have you gone on the Julia Slack or Discourse to see if there is development in the areas you need for your research?
@@LeiosLabs I've looked but nothing has anywhere close to feature parity with the Matlab packages I referenced. To dethrown Matlab you need to get a critical mass of labs to switch over and that's not going to happen quickly if necessary packages are missing. Unfortunately, the same labs that are reluctant to switch are the ones that are best suited to migrate the libraries to Julia. It's a bit of chicken and egg problem. What do you think of a bounty system to reward open source research programmers?
Watching this video while working on a Matlab AppDesigner web app for a paper. PhD in chemistry. Everything you say is true. I've been watching python tutorial lately - I hope to escape this landscape into a proper developer, because I know full well this is not the proper way to do programming.
Hello James, congrats on your PhD, and thanks for bringing up this topic. As a computational scientist, I needed to write a lot of different codes from my bachelor's project till my postdoc, yet only very recently I learned about git. Such a shame! Regarding publishing codes and make them open access, I totally agree with that: the very least advantage will be a required documentation, besides the code itself. However, I'm not really sure whether reviewing the codes can be a good idea. Imagine for a particular work, you need to write in different languages, e.g. bash codes for Linux, Matlab and Fortran, which doesn't happen all the time, but it's possible; expecting the reviewers to be familiar with all of these languages doesn't seem that realistic to me. Moreover, if a code is supposed to model a phenomenon, shouldn't its results to be the main concern in the review process? Perhaps it would help if the publishing houses make it mandatory to publish the accompanied codes (+a proper documentation) of each paper, even without a proper review, and let the readers/users decide for themselves.
This is a great perspective and good discussion! I still say that we need some sort of review. If a reviewer doesn't know fortran (or the language the code is in) and is reviewing for a journal, that's totally fine, but *someone* should know fortran in the review process. A good solution would be to link most paper reviews to JOSS, so people can review the software independent of the scientific review. Part of the current problem is due to the fact that people don't know the languages / methods used in the field, but still review for that field. I would argue that if they can't read the code, they shouldn't review the paper. Obviously, this is unrealistic in practice, so for a good first step, we should at least publish code with the paper (like you said).
The issue is not about the correctness of the code, rather the reproducibility of the results which is (was maybe) a pillar of the scientific process. You can always describe your algorithms with pseudocode, but that's not feasible for large projects. Moreover, people in academia usually want to keep their software for themselves, so as to retain an advantage above potential competitors. That's why services like CodeOcean are gaining popularity. They effectively provide an interface "shield" between your code and the outside world, allowing people to experiment with it as a black box (very important for scripting and interpreted languages).
As a theoretical physicist researcher , it is hard to imagine software engineers complain about funding , literally all the funding in my University goes either to software engineers or Bio- Med- researchers :p
That's interesting to hear / thanks for the input! I am sure a lot of funding goes into software engineering, but is it for research software or for general-purpose software?
Have you had any GUI related problems with academically developed, commercial research software? I find that research software GUI tends to follow similar anti-design-patterns to one's encountered in other topic-specific software fields like music composing as showcased by Tantacrul on his channel in his "Music Software & Interface Design" series. I watch his "diatribe" videos because I find it healing after long hours working with research software GUI.
peer-reviewing research code is next to impossible because you need a constant supply of academics with interdisciplinary knowledge. for eg someone who has a very good understanding of computational Quantum Mechanics for a specific experiment AND simulation methods in python including familiarity with a particular tech stack
This is something that troubles me and made me feel weird about pursuing research for long term (pursue a PhD), I love it but I feel it requires a lot of reforms.
I see the research code problem. Additionally, the code is usually not maintained after publishing. So even it was well developed, there is no guarantee it works when someone is interested in it. This is not 10 years case, but just a few years in nowadays. (For instance, I suppose CUDA version, compute architecture, and GPU architecture would be the issue in your case. Probably a matlab code runs longer.) But this is quite hard problem. And I think it should not depend on an individual's effort. Ideally, some systematic support is great. Thanks for focusing a fundamental problem.
totally agreed! I think a lot of researchers use the fact that software is constantly evolving as an excuse *not* to review it. The way I see it, maintenance of code is a huge issue as well and hard to do right. I think the best we can do is provide the version numbers and such at publication. This will allow the code to be run for at least a decade or so. At this stage, just reviewing code is already a big step forward.
This is only tangentially related but, as someone who learned to program long before learning any higher mathematics or physics, it has always irked me that variables and constants in mathematical formalism are even worse than hungarian notation in programming. Too many papers fail to define all their domain-specific symbols and ,if you read papers from 100 years ago, you have to chase down obsolete definitions. That's before you even get to all the situations that mix multiple definitions of the same symbol (like elementary charge and Euler's number) in the same equation. It's just begging for people to make mistakes. Mathematics could stand to import some best practices from software development.
Yeah! I completely agree with this! In some codes, theta is an angle. In others, it's temperature. It's hard to keep everything straight and is another complaint I have about a lot of research scripts! I am alright with it iff there is a comment somewhere in the code to some text that has similar notation or if there is a table of symbols somewhere available, but most people don't do this.
@@LeiosLabs Make that 3. I learned Matlab in university and actually sort of like it, but I'm genuinely interested in what you have to say about its flaws.
Hello! I found your video through hearing of "research software engineers," and I am very curious. I understand your present concerns with your current career. I am looking for advice. I received a bachelor's degree with math and cs. I am personally very interested in getting into the HPC background and would like some recommendations because I am not sure where to even begin. I am considering going back to school too, but I am also not sure about funding and such. What would you recommend to do if you were me? Also, I would like to try out my own pet project to get some introduction to the subject. I've heard of OpenMPI and some other things for parallel computing. If you could suggest a beginner level project, what would you recommend?
So there are a bunch of different "ways to start" with RSE. The best way is probably to just e-mail people, state your background, and ask if they need some programming help. You will learn the tools along the way. As for a pet project, it's kinda hard to say. There's a difference between parallelism and *distributed* parallelism that you need for HPC. Going from parallel to distributed is not an easy step, but can only really be done if you have a cluster available to mess around with. You can still learn the tools necessary, though (MPI, mainly, but also CUDA for GPUs). I might recommend looking into the Julia ecosystem at this time as I know people are looking for help with their distributed setup and having your name associated with those tools will probably help out if people are looking for Julia positions. That said, most of the RSE code is either in C(++) or Fortran, so knowing those languages might be a bit more useful.
@@LeiosLabs Thank you for the prompt response! I can definitely have a look into those languages and technologies you mentioned. When you say e-mailing people, do you mean university professors or people via LinkedIn? I imagine professors. One issue I came across was one professor couldn't take my assistance unless I was a student at their university. Can some professors be flexible regarding that?
@@thej680 Not necessarily university professors. People who are writing papers or software you are interested in. Most of the code for RSE is open source and you can probably start collaborating on github pretty quickly.
I once had a summer project improving someones simulation software, and it was dreadful wading through the pages of nonsense code. (Also, Nice shirt change at 6:37)
As a high school graduate, is it worthwhile economically to get into research in fields related to Machine Learning, Computer Science, and Software Engineering if very interested in the fields?
Yeah, definitely! If you want to go down the pure research route, there is funding and interesting research opportunities. If you want to do these fields in industry, there is also plenty of funding available. My point here is just that software engineering is not well integrated into research in all fields. It's getting better and it's a good idea to ride the wave now while people are starting to see the need of better integration.
Yikes. Physicist here. I can confirm everything you say here. I've used code written in Fortran 20 years ago in my research and it's constantly bugging out with every fortran compiler update. Why? because it was no one's priority to rewrite it in a more convenient language. When the project fell on my hands I had some time to get results and then I moved on. Someone is going to inherit this code and probably do the same thing. Then there is my analysis code on Mathematica. It contains my own implementation of a few different algorithms described in papers but there is no way to check if my implementation really is bullet proof. Collaborators don't really care as long as it produces publishable results. What gives me some pace of mind is that in an active field of research a result wildly different from the expected will be under more scrutiny.
Fortran is a perfectly convenient language for scientific computing. You could even argue that it is the best for certain applications. Chances are that your code was written badly from the get go (maybe using non-standard, compiler specific extensions). Fortran is and will ever be back compatible with previous versions of the standard, but compilers issue warnings if you use obsolete constructs.
@@LeiosLabs The problem is that you can't really defend any other solution because that's all researchers know and telling them to use anything else is seen as trying to "break something that works"
In my opinion, the fact that research code is so bad (for computer science research) is inexcusable. Many times a CS undergrad could write code that actually works, with better documentation, that can be run by anyone on any machine and reproduced. It's just laziness, partially incompetence, and I don't think the code should be written at all if it is going to be bad. It's just that you have people who aren't trained as software engineers (yet are in computer science!) writing code and it's complete spaghetti. Do researchers know what they're doing? Yes, when it comes to writing papers. Should researchers be writing code? In my opinion, no not really. It's a waste of time if you're terrible enough at software engineering because it won't be reproducible anyways. If researchers HAVE to write bad code though, I would really hope that those who do just put in the readme something along the lines of "we just did this so we wouldn't get yelled at by reviewer #2 for not having an implementation. Our software is terrible and probably doesn't even work, and we barely remember how to get it to work because we're writing another paper already. You're better off writing your own implementation." That way I can stop wasting my time with the code, because 99% of the time it's easier to implement something from scratch with good software engineering practices than it is to try to get something built with terrible software engineering practices to work.
It definitely wasn't! I've had this discussion at least a dozen times with other folks in a similar position as I am, and we all kinda agreed on these points. I hope all is well!
Yeah, I tried to make sure people knew I was biased about that part. I have not found a single use-case in my own research where using matlab would be better than python/julia, and most of the time, it's flat-out infuriating, losing me hours of time. I appreciate that others have a different perspective and find the language useful, but I do not think I will ever be able to get over my biases about it.
@@LeiosLabs I completely agree, matlab has its quirks and aggravates me just as any other language would. Personally, I was taught my in my engineering degree with a math backbone, so the syntax of matlab for math makes somewhat sense. that being said, counting from 1 is stupid
@@Ddddddddddd381 Counting from 1 is the least aggravating thing, honestly. I also don't mind the formula specification because I was trained as a physicist. I genuinely hate how radically different its syntax is from everything else, but I could see why some people like that. It's everything else that bugs me, like how: - you can only have 1 function per file - loops are slow - structs / classes are poorly optimized - good luck with graph / tree methods - you cannot edit files outside of the IDE provided, otherwise matlab doesn't recognize the change - it's licensed and checks for that license on startup... Which is doubly annoying when running matlab on distributed nodes, because it then checks licenses n times, where n = number of nodes. The license here, combined with radically different syntax actually makes the software predatory because it locks users into a system that they *have* to pay for and cannot escape from easily. - it crashes almost every time I try to run anything reasonably complex. I mean, if it's an esoteric language, it's an esoteric language. It might have some form of historical precedence as well, but there it still boggles my mind how people put up with it in 2020 when other, better languages exist for prototyping.
@@LeiosLabs what a fantastic explanation. thank you so much! I haven't personally came across those issues because I haven't done anything really complex with it, but I really do respect your expert opinion
It was there, under "languages and frameworks." I just gave julia as my example instead of numpy or matlab. It's underlying libraries were also there under the same section with blas and lapack. Again, there are way too much research software out there, so it was not possible to list them all on that slide
28 right now. Started doing research at 21, so I am still relatively new to the scene. I was doing software development well before that, though! In addition, the thoughts presented here came about from long discussions with my peers, so it's not like the arguments made were only from my own perspective!