By Göran Sandahl -

Takeaways from AI Engineer World Fair, San Fransisco 2024

We spent three days at the AI Engineering World Expo between the 24th and 27th of June in San Francisco. We got to connect with hundreds of developers building with LLMs and talk about the what, how and why - and of course talk about Opper, alongside meeting other vendors.

Simon Wilson presenting at AI Engineering World Expo

Main Takeaways

LLMs are Flawed

A common theme in many talks and discussions was the difficulty of building with LLMs. Of course there is no generative AI conference without lots of talks about prompts and so was the case here as well! It's clear that a major challenge for developers continues to be prompt engineering. Even the best prompts can't fully mitigate the long list of edge cases that often arise when users starts playing with some feature. The problem is that language allows for weak task descriptions that are open to interpretation by the models, resulting in varying outputs. While prompts can be expanded to be more detailed, they are always subject to interpretation by the models which happens every time tokens are generated.

An interesting talk related to this topic was from the Outlines founder, who described a method for controlling token generation of open source models on a low level (specifically at the level of logits) and making models follow an output schema. This not only helps the developer get the desired structured output but also improves the model's intelligence by constraining it correctly. In fact, he showed data on how Mistral8x7b could beat GPT-4 on many tasks that involves reasoning. Interestingly we have done our own tests on this approach with good results, and this can actually be seen in our most recent blog post Controlling output of models with examples where Opper/mistral-7b-instruct is in fact running with this library. Anthropic also had a workshop on Control Vectors as an approach to steer models at the level of token generation. Safe to say, quality and reliability continue to be an active research area.

Short, Simple Prompts

While we attended a couple of workshops on creative prompt engineering, the clear pattern coming out of many of the success related talks and hallway discussions was that the best prompts are small and not very creative. Longer prompts can end up working well but are also fragile, model-dependent, and hard to maintain.

The best talk on this topic was from Discord, where they described the steps they took in building their chatbot that today interacts with some 20M users. Their main lesson was to keep prompts simple and break them up into multiple steps to achieve quality. This talk contrasted with others, suggesting that there are divergent approaches to the problem.

I would say that the spirit of keeping things short and simple aligns well with what we are seing and what we are optimizing for at Opper. We advocate for building task specific functions with small prompts and emphasizing structured input/output to help the model understand the task beyond what can be expressed in language alone.

Write Tests and Evals

While 2023 was the year of demos, 2024 is proving to be the year of putting features into production. There are largely two approaches: utilising more tooling or going back to basics. A clear trend is that LLM building is reverting to more classical software engineering approaches, including having test and observabilty in place through both experimentation and production. The recommendation was for tests to be very simple, such as regex pattern matching. Some tests are better than no tests.

GitHub also discussed evaluating the longer-term success of a task. Their example was to evaluate AI-generated code commits by seeing if the change remains after multiple iterations of the codebase. This is a deep topic, and it reminded me of our earlier blog post Guiding output of models with examples, where we covered a simple regex-based test as part of getting a function to produce good output. We hope to expand on this topic in future posts.

Keep Humans in the Loop

Another heavily covered topic was on keeping humans in the loop when features are in production. Two practices emerged as post-deployment practices: having observability in place to look at data around model outputs and integrating user feedback to quickly identify issues.

The runtime aspect of having LLM features in production is interesting. We feel we offer a good approach to this with observability and tracing in place from the start, along with the ability to add metrics from user feedback and the ability to curate datasets and examples for in context learning as a way to allow humans to quickly implement optimizations.

Knowledge is a Graph

Neo4j had great workshops on the topic of knowledge graphs. They showed how traversing a knowledge graph aligns well with finding the best information for a given query. The benefit of a knowledge graph is that when doing retrieval of knowledge, it can easily be expanded to include related concepts and the strenths of those relationships is natively expressed in the number of connections between entities. And combining it with semantic search can be very powerful.

As our early access users have experienced the need to implement things like page rank and other methods to retrieve the best knowledge for answering questions this an interesting field that we will likely explore more.

Success is in UI/UX

A major patterns of LLM features is to package them as assistants that is doing some form of knowledge retrieval. We saw great demos of various assistants, such as for construction workers interacting with vast amounts of building documentation, research assistants for lab workers, and personal assistants integrating context from all user apps. Many of these aren't limited to just knowledge retrieval, but also performs dynamic UIs with the ability to trigger various actions (such as sending emails).

One of the takeaways around this is that the key component of a great LLM powered feature is not necessarily the LLM but rather the the actual experience. We must not forget that features needs to solve end user problems, and that is the true magic. And the challenges is much broader than the LLM.

Agents are the Next Frontier

While many successful features are some form of knowledge-powered assistant, the general belief is that agentic implementations are the breakthrough everyone is looking for. We are not quite at the stage where this is fully possible in a generalized manner. But with deliberate crafting and a good idea of the task this is indeed doable today.

As we have built agents that are doing high value tasks for us on a daily basis we feel that with the right approach it is very doable. We hope to share more about this soon.

Sparks of AGI

Last but not least, we saw live demos from both Anthropic and OpenAI. From OpenAI we got a first hand live demo at their now famous voice assistant and theit was pretty mind blowing. OpenAI is really spear heading the consumer applications for generative AI.

For developers however, we are unsure how this translates to the API and how better apps can be built with this. Time will tell!

Conclusion

All in all, we spent 3 great days in San Fransisco where we got to hang out with a lot of other builders. And while San Fransisco may be the birth place of a lot of the underlying technology of LLMs, our conclusion is that European startups are doing very well on the actual application side of things. A lot of the interesting projects and ambitious founders that we got to connect with were from UK, Sweden and Poland etc.