Rahul Gopinath
Lecturer at the University of Sydney, Australia. ശ്രീദേവി's Dad. I work in the junction between Software Engineering and Cybersecurity. Interested in Program Analysis, Automatic Repair, Mutation Analysis, Specification Mining, Grammar Based Generators and Parsing. My website is at https://rahul.gopinath.org

Just recently read the paper "Delving into ChatGPT usage in academic writing
through excess vocabulary"
. by Kobak et al. Their premise is that (from the abstract) the [models] can produce inaccurate information, reinforce existing biases, and can easily be misused. So, the authors analyse pubmed abstracts for vocabulary changes, and identify certain words that have become more common post LLM. They find that words such as "delves", "showcasing", "underscores", "intricate", "excel", "pivotal", "encompassing", "enhancing" are all showing an increased usage, and hence suspect.

While this data is indeed interesting, I wonder why LLMs tend to use these words. Aren't LLM outputs supposed to be more of a reflection of the data they are fed in training? Surely that means that these words are more common in some data set than we expect?

Our paper "Empirical Evaluation of Frequency Based Statistical Models for Estimating Killable Mutants" on evaluation on models for estimating equivalent and killable mutants were accepted by ESEM 2024.  The paper is here. #ESEM2024 #Equivalentmutants #mutationanalysis

I am visiting ANU Canberra on Friday. If you are around, and is interested in what I do, please come talk to me.

Parsing JSON is indeed a minefield. However, a commenter in HN has a suggestion: Use postscript instead of JSON. It has a binary format, has comments, and generally looks much better. Here is their provided example:

    /first_name (John)
    /last_name (Smith)
    /is_alive true
    /age 27
    /phone_numbers [
        /type (home)
        /number (212 555-1234)
        /type (office)
        /number (646 555-4567)
    /spouse null

And I agree, it is much better than JSON. There are many other interesting things to like here. For one, the keys are symbolic. There is only one character `/` indicating the keys in a dictionary (indicated by `<<`). This reduces the visual clutter to a great extent. Using `<<` for dictionaries is also great.  Dictionaries are one of the largest units in such formats containing data, and it is better to use two characters for their delimiters. By using `()` for strings, it provides a starting and ending delimiter for strings, and is better visually parsable than `"`. There are no commas in arrays. or dictionaries, removing the question of trailing commas. Overall, PON (Postscript Object Notation) is much better designed than JSON for human readability.

Are you attending the Singapore Summer School on Fuzzing? Here's what my students and I have planned for Monday, fitting into more great talks and tutorials by Abhishek Arya, Marcel Böhme, Lim Min Kwang, Mathias Payer, and
Thuan Pham. Details at fuzzing.comp.nus.edu.sg

I am co-founding a new startup! #InputLab creates test data for thousands of formats from electronic invoices to retail orders, covering all input features – and we just got an 800k€ initial funding to start as a #CISPA spin-off in September.

We are #hiring, especially experts in constraint solving and test generation – and we are looking out for #collaboration partners and early #adopters from industry and public service. Check us out at inputlab.net! #startup #XML #softwaretesting

Is fussing the Australian spelling for fuzzing?

One great thing about LLMs such as GPT is that I no longer have to suffer through the very condescend "Oh you are asking for X, but surely you meant Y?" followed by closing the question as duplicate/irrelevant etc.

A very important visitor to #UsydSE #USyd


From the left, Andreas Zeller, me, Jack, Nelum and Alistair. Danushka (tacking the picture) is not in the frame.

USyd software engineering team with @andreaszeller@mastodon.social on our recent trekking.

USyd Software Engineering Team