The Declining Performance of chatGPT Plus/GPT-4 Over the past 4-6 weeks?

If you follow the AI space closely, you may have read or watched the OpenAI first time developer conference earlier this week. It is hard not to be amazed by what they announced: both from the current product usage perspective and from the new product announcement perspective.

This post is not about that. It is about the recent performance decline over the past 4-6 weeks. This is purely based on my personal experience. (and no, I haven’t done scientific research about it.)

So what is the TL;DR:

  1. chatGPT 4.0 (web version) performance is noticeably worse for both writing and coding tasks over the past 4-6 weeks.
  2. GPT-4 Turbo reasoning capability seems to be worse than Gpt-3.5 or GPT-4.
  3. I am still a chatGPT plus subscriber and using OpenAI API for my chatbot.

Let me share more

chatGPT 4.0 performance is noticeably worse for both writing and coding tasks over the past 4-6 weeks

As someone who is using the chatGPT plus web interface daily, I can painfully notice the performance issue, especially over the past 4-6 weeks. What are the symptoms?

For writing

  • The writing quality (especially tone of voice, and ability to follow detailed instructions) is noticeably worse
  • It repeatedly fails to follow revision asks for writing. It got bad enough that I started to pay Anthropic to use Claude Pro.
  • For the past 1 year, I have developed the habit of relying on ChatGPT extensively for writing, proofreading, etc… and I was afraid that this has made me too lazy to try new tools. Well no more, now, I use Claude Pro more and more for drafting, content review, and other writing tasks.
    • I also like Claude’s much longer context window vs chatGPT 4 (for now until GPT-4 turbo is widely rolled out.)
  • Claude is still quite bad with basic math though. For example, I often need to have a meta description for each blog post (for SEO purposes) so I write something like this very often “Give me 5 different meta descriptions for the above blog post content, in different styles, with the purpose of encourage users to click and read the blog post content. The meta description has to have a maximum of 140 characters, including spaces“.
    • Claude repeatedly gave me much longer meta descriptions, even after telling it to cut them short.
    • ChatGPT used to do this task well but not in the past 4-6 weeks.

Side note: I also noticed that recently, when you ask chatGPT to write the entire article for you, it refuses to do so, which I think is a good step. It will help to reduce spammy content online. Previously it was too easy to ask chatGPT to write a 4000-word article about a specific topic. It first gave you the outline, asked for feedback, then proceeded to write the entire 4000-word article. It is no longer doing that.

For coding

  • chatGPT (web version) gets lost easily in coding tasks, it can’t seem to remember the code it wrote just a few minutes ago, during the same session.
  • It fails to follow detailed instructions to correct a coding issue. For example, I gave it the entire code for my application, and then shared an example of another project with a function I would like to include.
    • Then I asked GPT 4 to use the example and revise the code for my application. Its response was so off the mark that it was of no use to me. I tried to steer chatGPT back to the right direction a few times but it still couldn’t do it.
    • When I repeated the same exercise on, the machine gave me exactly what I needed to do after 1 try. (Caveat: I just started trying Phind so I don’t know how it will perform vs. chatGPT for coding overall yet, but the first impression is good for Phind.)
    • For those who like the specifics, the example I gave to chatGPT is this. I told it that I liked step 6 in the example where the model was asked to evaluate its response to see if it sufficiently answered the user’s query. chatGPT failed to use this example and revised my application code to include this function.
# Step 6: Ask the model if the response answers the initial user query well
    user_message = f"""
    Customer message: {delimiter}{user_input}{delimiter}
    Agent response: {delimiter}{final_response}{delimiter}

    Does the response sufficiently answer the question?
    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    evaluation_response = get_completion_from_messages(messages)
    if debug: print("Step 6: Model evaluated the response.")
  • Its ability to debug is noticeably worse. 🙁

GPT-4 Turbo reasoning capability seems to be worse than Gpt-3.5 or GPT-4

What do I mean by this?

Well same as many people, I was eager to try GPT-4 Turbo because it is a lot cheaper than GPT-4 and has much longer context window. As mentioned earlier, I couldn’t use GPT-4 API for my chatbot because it is too expensive. I recently implemented a self-evaluation step for the chatbot before its reply can be shown to users. The question is “Does the response sufficiently answer the user question?”

GPT-4 Turbo fails repeatedly at this step while GPT-3.5 and GPT-4 are doing fine. I am using the exact same code and prompts. The only change is the API model. I tested this across multiple questions/prompts.

So what did I end up using? Well, continue with GPT-3.5 for now until the “reasoning” capability of GPT-4 Turbo is getting better or more reliable.

So why am I sharing all of these?

Based on my limited real-life experiences working with chatGPT and OpenAI API, I think there are still so many opportunities to improve these models and functions. If you just watch the developer conference, you may feel that OpenAI is so far ahead of everyone else and there is no chance for catching up. But I think the race is still very much alive. Yes, OpenAI has a giant leg up since they “solved” the distribution problem given the word-of-mouth growth and their current scale (100M weekly active users). But if you have a truly better product, you still have a very good chance of reaching massive scale. These are the improvement areas right now to 10X or 100x model performance according to No priors hosts:

1. Multi-modality

2. Long context window

3. Model customization

4. Memory: AI remembers what it was doing

5. Recursion

6. AI router: smaller/specialized models being controlled/orchestrated by the main/larger model.

Last but not least, while the tone of voice of this blog post can be seen as quite negative, I am still a chatGPT plus subscriber and I am still using OpenAI API for this blog chatbot. 🙂

I hope that over the next few weeks, as GPT-4 Turbo is officially out and all the issues are being worked on by OpenAI, we can get the same quality back. Also, I suspect that they are experiencing this negative bump in performance because too many people are using or trying to use the API/web version.

That’s it from me. Let me know what you think using the comment below.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.