Home / Blog / Tutorial

From first click to prompt output in 1m38s - Running Llama2 in Codesphere

Published August 18, 2023

Learn how to get your very own ChatGPT clone up and running in under two minutes. In this tutorial we show you how to run Llama2 inside of Codesphere.

From first click to prompt output in 1m38s - Running Llama2 in Codesphere
From first click to prompt output in 1m38s - Running Llama2 in Codesphere

Simon Pfeiffer

Head of Product @ Codesphere

Simon’s responsibilities cover our product marketing efforts & the roadmap. He is a former green tech founder & IT consultant at KPMG. He shares trends & insights from building Codesphere.

Full Bio

What started out of pure curiosity: how difficult can it be to run your own LLM inside a Codesphere workspace, will it even run and if so will it be performant enough, running inference on our CPUs? Turned into one of the most exciting projects I've worked in recent weeks.

Unlike OpenAi and their GPT models, Meta has open sourced their entire suite of Large Language Models called Llama1 & Llama2 alongside the pre-trained chat versions. In independent performance tests these land somewhere between GPT3.5 and GPT4 and they are actually a bit faster than GPT4 in their responses.

These models come in different sizes, based on how many parameters they were trained on. Typically inference for LLM's is run on GPU instead CPU processors because these computation are very memory intense and GPU's have a clear edge there. GPU servers are still very expensive and not as widely accessible as CPU servers - Codesphere is planning to offer shared (cheaper) & dedicated GPU plans in the near future but as of today we only offer them after receiving a pre-order requesting early access.  

Therefore today we are going to test if we can still run the smaller model (trained on 7 Billion parameters) on a CPU based server inside of Codesphere. Since we know it will challenging to get a smooth response, we are going with our pro plan, providing 8 state of the art vCPUs, 16GB RAM and a 100GB of storage.

Getting Llama2 running on Codesphere is actually very easy thanks to the amazing open source community, providing C++ based wrappers (llama.cpp) and huggingface offering pre-compiled and compressed model versions for download.

It is actually so easy that I decided to do a timed run. From the click to create a new workspace button to the first chat response took 1 minute 38 seconds. It really blows my mind. Let's take a look how it's done inside of Codesphere. If you still need to create a Codesphere account now is as good a time as any. If your machine is strong enough this tutorial will also work locally (at least for Linux & MacOS with small adjustments).

Zero config cloud made for developers

From GitHub to deployment in under 5 seconds.

Sign Up!

Review Faster by spawning fast Preview Environments for every Pull Request.

AB Test anything from individual components to entire user experiences.

Scale globally as easy as you would with serverless but without all the limitations.

Step 1: Create a workspace from the Llama.cpp repository

Sign in to your Codesphere account, click the create new workspace button at the top right and paste this into the repo search bar at the top:


Next you'll want to provide a workspace name, select the pro plan and hit the start coding button. This plan is 80$/m for a production always on plan and 8$/m for a standby when unused deployment mode. We know 80$/m seems like a lot but consider that renting a GPU is typically more than 1000$/m it might even be considered a bargain.

Step 2: Compile the code

Open up a terminal and type:


This command will compile the c++ code to be readable for Linux. The Llama.cpp repository contains a Makerfile that tells the compiler what to do.

Step 3: Download the model

First type cd models in the terminal to navigate to the folder where Llama.cpp expects to find the model binaries. There are a wide variety of versions available via hugging face, as mentioned we are picking the 7b params size and opt for the pre-trained chat version of that.

Even for this specification there are ~10 different flavours available, pick the one that suits your use case best - we found this repo to contain good explanations alongside models: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML

In the models directory run this command and replace the model name with the flavour that suits you best:

wget https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGUF/resolve/main/codellama-7b-instruct.Q5_K_M.gguf

Step 4: Run your first query

Now we can ask our very own chatbot the first question. Navigate back to the main directory with cd .. and then run this command from the terminal:

make -j && ./main -m ./models/codellama-7b-instruct.Q5_K_M.gguf -p "Why are GPUs faster for running inference of LLMs" -n 512

It's going to take a few seconds to load the 4GB model into memory but then you should start to see your Chatbot typing an answer to your query into the terminal.

Once completed it will also print the timings, the initial load can take up to 30s but subsequent runs take less than 1s to start providing a response - also the speed is not quite as fast as interacting with chatGPT in the browser but it still returns around 4 words per second which is pretty good.

The images show the timing of the initial run vs. subsequent runs.

Initial run takes ~30 seconds to start
Subsequent runs start immediately but are equally fast in answering

[Optional] Step 5: Getting the example chatbot web interface running on Codesphere

The llama.cpp repository comes with simple web interface example. This provides an experience closer to what you might be used to from ChatGPT.

Navigate to CI pipeline and click the Define CI Pipeline button. For the prepare stage enter make as command.

And for the run stage enter this command, making sure the model name point to the binary of the version you downloaded:

./server -m ./models/codellama-7b-instruct.Q5_K_M.gguf -c 2048 --port 3000 --host

We need to set the port to 3000 and the host to in order to expose the frontend in Codesphere.

Now you can run your prepare stage (which won't do anything if you previously ran make already via the terminal) but might be needed after workspace restarts.

Next run your run stage and click the open deployment icon in the top right corner. Now you and anyone you share the URL with can have chats with your self-hosted ChatGPT clone 😎

Let us know what you think about this! Also feel free to reach out to us if you are interested in getting early access to our GPU plans.

Happy Coding!

About the Author


Simon Pfeiffer

Head of Product @ Codesphere

Simon’s responsibilities cover our product marketing efforts & the roadmap. He is a former green tech founder & IT consultant at KPMG. He shares trends & insights from building Codesphere.

More Posts

Self-hosted vs. API-based LLMs: Which One is Better?

Self-hosted vs. API-based LLMs: Which One is Better?

LLAMA 2 has considerably bridged the gap between the quality and performance of open-source LLMS vs. the API-based expensive models. Self-hosted LLMs are just as efficient and not so expensive anymore.

Build an LSI Keywords Tool with Node JS

Build an LSI Keywords Tool with Node JS

To make our short tutorial articles more SEO friendly we decided to add LSI keywords to them and to find those keywords we built an LSI keyword tool using Node JS.