Most recommendation systems run entirely on the backend. User interactions are sent to servers, models generate recommendations, and results are returned to the client.
Recently, I explored a different approach: running recommendation inference directly on an iPhone using a Two-Tower model and Core ML.
The goal wasn't to replace backend recommendation systems, but to understand the feasibility, trade-offs, and developer experience of on-device personalization.
Here's what I learned.
Why On-Device Inference?
Traditional recommendation flow looks like this:
User Action
↓
Backend
↓
ML Model
↓
Recommendations
↓
Mobile App
This approach works well, but comes with a few challenges:
- Network latency
- Backend inference costs
- Privacy concerns
- Offline limitations
With on-device inference:
User Features
↓
Core ML Model
↓
Recommendation Embedding
↓
Local Ranking
the recommendation logic executes directly on the device.
Benefits include:
- Near-instant inference
- Reduced backend dependency
- Better privacy
- Offline recommendation capabilities
Understanding the Two-Tower Architecture
Before implementation, I spent time understanding why Two-Tower architectures are widely used in recommendation systems.
The model consists of:
User Tower
Generates a user embedding from features such as:
- User ID
- Purchase history
- Preferences
- Behavioral signals
User Features
↓
User Tower
↓
User Embedding
Item Tower
Generates embeddings for products.
Product Features
↓
Item Tower
↓
Product Embedding
Matching
Recommendations are generated by comparing embeddings.
Similarity(
User Embedding,
Product Embedding
)
Products with higher similarity scores are ranked higher.
This architecture is particularly attractive because item embeddings can be precomputed and stored, leaving only user embedding generation to happen on-device.
Converting the Model to Core ML
One of the first challenges was model conversion.
I initially assumed that converting a PyTorch model into Core ML would be straightforward.
Reality was slightly different.
A common issue was discovering that the downloaded .pt file contained only model weights (state_dict) rather than the actual architecture.
For example:
type(torch.load("model.pt"))
returned:
collections.OrderedDict
which meant the model architecture had to be reconstructed before conversion.
After rebuilding the model and exporting it through Core ML Tools, I obtained:
TwoTower.mlpackage
which could be integrated directly into an iOS project.
Integrating with Xcode
Importing the model into Xcode automatically generated Swift bindings.
One lesson I learned was not to assume generated class names.
I initially tried:
let model = UserTower()
which failed because Xcode generated classes based on the model package name rather than internal tower names.
The generated class was actually:
let model = TwoTower()
A small detail, but one that cost more debugging time than I'd like to admit.
Running Inference
Once integrated, inference became surprisingly simple.
let prediction = try model.prediction(
userId: userId
)
The Core ML runtime handled execution using available hardware accelerators.
The developer experience was significantly smoother than expected.
Performance Observations
The most impressive part was latency.
Since inference happened locally:
- No network request
- No API dependency
- No server round-trip
The user experience felt immediate.
For recommendation systems, even a few hundred milliseconds can impact engagement.
On-device inference effectively removes an entire network hop from the critical path.
Challenges I Encountered
Model Conversion
This was by far the biggest challenge.
Questions that frequently came up:
- Is the model TorchScript?
- Is it only a state dictionary?
- What are the expected input shapes?
- Which outputs are generated?
The conversion pipeline often requires more ML engineering knowledge than mobile engineering knowledge.
Input Feature Engineering
A recommendation model is only as useful as the features provided.
Preparing inputs consistently between training and inference environments is critical.
Even small mismatches can significantly affect recommendation quality.
Debugging Core ML Models
Debugging application code is easy.
Debugging machine learning outputs is harder.
When recommendations look wrong, the issue could be:
- Data preprocessing
- Feature encoding
- Model conversion
- Training quality
Finding the root cause requires a systematic approach.
What Surprised Me Most
Before this project, I assumed recommendation systems were too heavy for mobile devices.
The reality is that modern phones are extremely capable.
For many personalization use cases, running lightweight recommendation models directly on-device is entirely practical.
The challenge isn't necessarily inference performance.
The challenge is building the surrounding ecosystem:
- Feature pipelines
- Embedding storage
- Ranking strategies
- Model updates
- Experimentation frameworks
Key Takeaways
- Two-Tower architectures are well-suited for on-device recommendation systems.
- Core ML integration is easier than expected once the model is correctly converted.
- Model conversion is often the hardest part of the workflow.
- On-device inference dramatically reduces latency.
- Privacy and offline capabilities become significant advantages.
- The surrounding recommendation infrastructure is usually more complex than the inference itself.
Final Thoughts
This project started as an exploration into mobile machine learning and quickly became a deeper lesson in recommendation systems.
As mobile hardware continues to improve, I expect more personalization workloads to move closer to the user.
Running recommendation inference directly on-device won't replace backend recommendation systems entirely, but it opens interesting possibilities for low-latency, privacy-preserving, and offline-first user experiences.
For mobile engineers interested in machine learning, Two-Tower models and Core ML are an excellent place to start.












