Jeff's diary on LLM / generative AI

Jeff's diary on LLM / generative AI
Photo by Barbara Zandoval / Unsplash

Jeff recorded the headlines in AI field and put some personal comments on them.

4 May 2023 - Leaked Memo from Google "We Have No Moat, And Neither Does OpenAI"

Google "We Have No Moat, And Neither Does OpenAI"

Leaked Google Memo Claiming “We Have No Moat, and Neither Does OpenAI” Shakes the AI World

The memo argues that despite significant investments by Google, Meta, and OpenAI in generative AI chatbots, open-source alternatives may rapidly outpace them all.

  • Innovation on LLMs from open-source is very rapid (both on the performance AND cost-effectiveness) and outpacing closed-source model from big techs
  • The rise of open-source AI raises concerns about responsible AI release and potential misuse for criminal purposes, challenging safety commitments made by AI companies
Jeff: This is baffling, because big techs have been benefiting from open-source by offering cloud services that are value-added, or enhanced versions of certain open-source technologies.

8 May 2023 - I received project queries asking me if LLM can be connected to a knowledgebase to make it answer questions giving 99.9% factual answers. 

Their requirements are simple, they are not interested in using ChatGPT. They want the LLM to read private company data or documents, and answer Qs based on them.

  • Private law firms: the knowledgebase is the past cases.
  • Property firms: past property transactions, info of properties currently under their care.

Example hobby project: Your AI second brain.

Posted Q to Kuan Pern.

TKP replied:

  • There is a product already doing that in the market.
  • Langchain - this is an open-source library that can be used to do that.
  • Llama - there’s a param where you can specify to make the model produces output that overlaps with the input context
  • Use LoRA adapter to finetune the model with the knowledgebase specified.

I don’t have any clue about tuning or fine-tuning LLM. So I explored the LangChain route.

9 May 2023 how to build a Q&A bot with LLM and private knowledgebase?

The way to do the project is to store the documents in a vector database or vector index. Before  posting the question to LLM, run a semantic query on that vector db, and get back the relevant documents. Feed the question and the relevant docs to the LLM. Hopefully get back the factual answers.

1 ~ 7 July 2023 Explored LangChain.

It is a framework for making AI app with LLMs, by building the correct prompts to the LLM or connected LLMs in a graph. It can connect to a vector DB of domain knowledge using Llama index (another library).

GPT models are stateless, so memory of past conversations has to be input in prompt in every interaction. LangChain simplifies and automates that.

You can build a graph of multiple LLMs, where one’s output is fed into the next. The idea is a domain-specific model could be used to answer domain-specific questions, a combination of a multitude of such models could be more powerful than individual models, just like random forest ML.

Evaluation of the model could be automated with LLM too. The most advanced use of LangChain is to make agents

An agent is an app that helps to answer a complex question, by breaking it down into multiple questions, answering them one by one with the ordinary Q&A app, then combining the answers and applying reasoning to answer the original question.

10 July 2023 - I finished checking out LangChain. looks like the abstractions are very similar to DAG in airflow. The prompts to LLM are very important but langchain hides them. 

The library doesn’t have much value to add. You can build an AI app without using LangChain at all, and it could be easier to maintain.

Its abstractions are over-engineered, and the library was over-marketed in the AI hype. It may not be the best library. Some other new library will come to replace it.

11 July 2023 - roughly explored the cost of hosting an LLM model:

Compared OpenAI pricing, Google Collab Pro+, AWS Sagemaker, GPU + baremetal and serverless option.

rules of thumb, the cost of different options

  • 0 ~ 14400 queries / month: serverless < open ai chatgpt interface < Colab Pro+
  • 14400 ~ 28800 queries / month: serverless < open ai gpt3.5

Until you hit the cost of US$1000 / month: AWS Sagemaker < GPU + bare metal

Until you hit the cost of $5000 / month: GPU + baremetal

This is a cost-flexibility trade-off, the more cost you spend, the more you are able to tune the infra, and swap in the newest models.

18 July 2023 - Meta introduced Llama 2. The next generation of our open-source large language model.

19 July 2023 - Meta was criticized for abusing the word open-source. For they have not released the source and training data, merely the model binary and weights, and their license has restrictions for commercial use.

Jeff: Their intention was likely to replicate the accidental leak of Llama 1, where open-source innovations were successfully triggered and back-ported into their mainline development. This could become the de facto standard of how commercial companies release their AI models to benefit the community and then self-benefit.

The open-source status of LLM models.

13 August 2023 - Azure ChatGPT: Private and secure ChatGPT for internal enterprise use

This is a huge deal for all those enterprises looking to maintain control over their data.

But why not Llama2?

Llama 2 might by some measures be close to GPT 3.5, but it’s nowhere near GPT 4.

The closed source players have the best researchers - they are being paid millions a year with tons of upside - and it’s hard to keep pace with that. My sense is that the foundation model companies have an edge for now and will probably stay a few steps ahead of the open source realm simply for economic reasons.

Over the long run, open source will eventually overtake. Chances are this will happen once the researchers who are making magic happen get their liquidity and can start working for free again out in the open.

Jeff: Big tech making money with their closed-source models while they can.

14 August 2023

Jeff: Given the sensitivity of the data involved, I believe some companies prefer locally hosted, e.g. llama2 solutions to even cloud based ones.

A comprehensive guide to running Llama 2 locally

But GPT4 still ranks top in terms of benchmark.

12 September 2023 - People are trying to fine-tune llama2 to make it beat GPT3.5 (performance and cost).

It’s freaking hard, because OpenAI priced GPT-3.5 aggressively cheap in order to make it a non-brainer to rely on them rather than relying on other vendors (even open source models). And finetuning requires a high-quality dataset which is not easy to construct.

25 September 2023 - MS$ released autogen. A framework like LangChain, that helps to build agent or multi-agent apps.

28 September 2023 - Amazon Bedrock is now generally available

Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs (foundation models) from leading AI companies including AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon, along with a broad set of capabilities that you need to build generative AI applications, simplifying development while maintaining privacy and security.

31 October 2023 - Mozilla cautions against hasty regulation that might lead to concentrations of power. Joint Statement on AI Safety and Openness

3 November 2023 - In a rare moment of global unity, leaders from 28 countries and select technology companies came together recently to agree on the safe and responsible development of artificial intelligence (AI).

Cynical Jeff: Is this a legitimate concern? Or a move from the big brothers and the corporate to suppress open-source AI development, and control the tech for their benefit? Contrast this with crypto: the regulation of crypto was many years after the inception of crypto.

6 November 2023 - OpenAI released a bunch of new features or releases to increase locked-in into their ecosystem, including custom GPTs, a new feature that allows users to create custom versions of its AI assistant that serve different roles or purposes (will compete with many OpenAI “wrapper” startups).

Jeff: Built around other people’s API, priced in their prices means no moat in the biz.

7 November 2023 - Met with Kelvin Quee. He talked about an AI usecase in lead qualification: validation of phone numbers at scale.

  • Lead gen firm making phone calls to business contacts.
  • Using LLM + text-to-speech + speed-to-text to interpret the response of the phone calls.
  • Detect automatic answering machine
  • Add additional info to the lead, depending on the conversation
    • Whether the number belongs to the company
    • Whether the person works for the company
    • Whether the person owns the number

He also mentioned certain orgs like to host the LLM model on-premise, because of data privacy and security.

Met with Andrew Liew. He talked about an AI usecase he has discussed with GIC’s HR and at 1 more unmentioned company: vetting of CVs/resumes at scale.

Groq's Language Processing Unit (LPU) is a significant development in the field of AI hardware that can process large language models much faster than current solutions.

The LPU was designed by Groq to specifically handle the computationally intensive tasks required for processing natural language with very large AI models containing billions of parameters. It aims to overcome limitations of GPUs and CPUs by providing greater compute density and memory bandwidth optimized for language tasks.

The LPU has demonstrated the ability to run extremely large language models with over 70 billion parameters at record speeds, significantly faster than what can be achieved using GPUs and CPUs from companies like NVIDIA, AMD, and Intel.

17 November 2023 ~ 23 November 2023 - Sam Altman’s firing and return at OpenAI. This is more likely a peculiar phenomenon in SV tech startups, where egos play into the heads of some young tech leaders.

28 November 2023 - Andrew asked for advice on how to meet the customer’s requirement of processing 200 CVs in 5mins, this is impossible given OpenAI GPT3.5 API’s limitation of 40k tokens per min. I asked him to try using a foundation model llama2 on Amazon Bedrock. I further explained using a self-hosted model is out of Q, unless he has scaled up to US$1000 / month revenue.

Llama-2-13b-chat

Amazon Sagemaker ml.g5.12xlarge (default)

$9.923175/hour or $7145/month

Amazon Bedrock

$0.00075/1k input tokens (omitted output)

how much tokens can I process for $7145? 9.5 billion

(Note: GPT3.5 accuracy is on-par with Llama-2-34b after fine-tuning)

1 December 2023 - AI toolings review

  1. Ollama: Get up and running with large language models locally.
  2. On-demand GPU cloud: Lambda, Runpod
  3. oobabooga / text-generation-webui: A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models. OpenAI-compatible API server with Chat and Completions endpoints. Training with LoRa
  4. TheBloke: some dude who produced many models on huggingface
  5. WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions
  6. Anyscale: Run, Fine Tune and Scale LLMs via production-ready APIs, serverless service somewhat like Amazon Bedrock.

3 December 2023 - rough calculation about the cost if we self-host LLM on runpod for Andrew's usecase.

assume A100 GPU instance $1500 / month, e.g. 1 customer, 20 jobs opening x 100 CVs each = $0.75 / CV

sunk cost: there is no upfront cost per se, if you turn on the VM 1 month, you pay for 1 month

time to build: the API should be compatible with OpenAI, so your code should work after switching endpoint, but you need time to find out which model works the best: 7 days trial and error

token limit: an A100 GPU instance can process 380 tokens per second (llama ish)

22.8k / min, 984mil / month

3-way comparison


OpenAPI GPT3.5 API

Amazon Bedrock serverless

Runpod self-hosted

Input token limit

tier-0 40k TPM

tier-1 60k TPM

300k TPM

22.8k TPM

(a) $ per 1k token 

(assume max demand)

$0.001

(this is in addition to any monthly tier payment to lift limits)

$0.00075


$0.00152

(b) Heavy-lifting (HA, scaling, deploy)

done

done

you do

(c) go-to-market

fast

slightly slower

slower

overall cost (a+b+c)

cheap

slightly cheaper

very expensive

model choices

restrictive but state-of-the-art

restrictive open-source

plenty 

open-source

privacy

none (data at vendor)

abit (data in your cloud)

strong

flexibility

custom GPT fine-tuning feature

a lot more allowed, including fine-tuning

full

OpenAI -> Bedrock -> Self-hosted

The OpenAI API is probably a lot cheaper, a lot less time go-to-market, less risky financially and simply better in most cases. However, a local LLM gives you a lot more control on the data flow and privacy which I think is the weakest aspect of any external API.

New open-source models to watch out:

  • Qwen 72B (and 1.8B) - 32K context, trained on 3T tokens, <100M MAU commercial license, strong benchmark performance
  • DeepSeek LLM 67B - 4K context, 2T tokens, Apache 2.0 license, strong on code (although DeepSeek Code 33B it benches better)
  • Also recently released: Yi 34B (with a 100B rumored soon), XVERSE-65B, Aquila2-70B, and Yuan 2.0-102B, interestingly, all coming out of China. Since it's not allowed to use ChatGPT in China, there is a huge opportunity to build a local LLM.
  • OpenChat 3.5 released the first 7b model that achieves results comparable to ChatGPT in March 2023 [1]. Only 8k context window, but personally I've been very impressed with it so far. On the chatbot arena leaderboard it ranks above Llama-2-70b-chat

6 December 2023 -

  • Tool to fine tune models: 
  • Start with evaluation first: One lesson from ML of old that I don’t think has been adopted enough among the prompt engineering wizards: you should always build a good evaluation before starting your experiments.

Google released Gemini AI (integrable on GCP) to compete with ChatGPT. They claimed Gemini Ultra beats GPT4 (proven false).

Jeff: Google is really late to the scene.

8 December 2023 - Attended AI barcamp in Nanyang Poly

  • A designer shared his workflow on how to deliver a logo design with AI within a week (usually more than that), powered by Midjourney. The customer interaction part cannot be automated. However, Midjourney can be used to generate lots of ideas. The designer still needs to drive the creative workflow.
  • Elderly care industry has a lot of problems potentially solvable with AI
    • Wearable device that tracks the elderly’s health
    • AI companion that chats with elderly, and takes care of their emotions, reminding them things to do
  • AI replacing human work. A person mentioned an AI conversation agent capable of processing and replying to complex queries has already been developed (but I did not ask which one). The person also mentioned Alexa API which can now engage in conversation.

15 December 2023 - look into the tooling of RAG AI

  1. Langchan and llamaindex are believed to be bloated.
  2. Haystack is great for prototyping.
  3. Use pgvector, milvus for vector db if latency is not a concern
  4. Otherwise use FAISS or any derivatives
  5. Llamafile seems like an alternative to text-generation-webui for deploying LLM to production.

23 December 2023 - Apple released Ferret. A multimodal model optimized for Apple hardware. They've been consistently evolving their hardware + software AI stack.

27 December 2023 - a major news site, New York Times sues OpenAI and Microsoft for copyright infringement.

Jeff: Results of this suit will determine if copyrighted material used for AI training is considered copyright infringement.

29 January 2024 - Meta releases Code Llama 70B. a new, more performant version of LLM for code generation.

AI coding tools:

14 February 2024 - According to benchmark comparisons, Gemini Ultra has a slight edge over GPT-4 in reasoning tasks, beating it in three out of four categories in the 2024 reasoning benchmarks, however, the lead is small.

Gemini Ultra was designed and optimized for multimodal tasks involving text, audio, images and video. GPT-4 can handle some multimodal inputs like text and images, but its strengths are more in language modeling. Gemini Ultra thus has an advantage when it comes to tasks involving multiple media types.

15 February 2024 - OpenAI releases Sora. Sora is an AI model that can create realistic and imaginative scenes from text instructions. It’s touted as a milestone, because previously generative models generate text, image and voice only.

26 February 2024 - Microsoft partners with Mistral. It’s the 2nd partnership the company is in, 1st one being OpenAI.

27 February 2024 - Microsoft releases 1-bit LLM paper. 1-bit model allows for higher efficiency at a small cost of performance. Microsoft's introduction of 1-bit LLM technology is considered groundbreaking and set to redefine the landscape of language models.