I am trying to deploy Generative AI solution built using Langchain (obviously with LLM at it's core) and Sagemaker. So, the code is not just an inference script but inference pipeline (challenge is that this one will be using LLM). How can I achieve this? Also, I want to add streaming.
Deploy LLM using Sagemaker and Langchain
893 views Asked by akshat garg AtThere are 2 answers
akshat garg
On
LLM's are huge and running in hundreds of GB. So, it is better to deploy the LLM's separately (since here we are trying to work in AWS, sagemaker endpoint makes sense) i.e. your app (using langchain) should call this endpoint (sagemaker endpoint within langchain) and consume predictions. Now, sagemaker endpoint cannot be simple sagemaker endpoint as some LLM's are huge and model optimization strategies have to be applied, with strong synergy between hardware and software is required. This is possible by the use of Large Model Inference Containers of Sagemaker. These containers run DJL serving+ Model Optimization Frameworks + LLM (Complete list here --> https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). Without optimization, don't deploy LLM's. But before taking this path, do give a check into Jumpstart models list and Bedrock (will save you a lot of time).
Related Questions in STREAMING
- One to One screensharing using WEBRTC
- MarkLogic 8 - Stream large result set to a file - JavaScript - Node.js Client API
- Adaptive Bitrate streaming in ios
- Streaming tweets with Hosebird
- how make 2 mountpoints from ices2 (icecast)
- Can npm request module be used in a .pipe() stream?
- Is twitter streaming api exactly 1 percent of the whole streaming?
- Error being thrown by ffmpeg and ffserver, not getting a stream
- How can I use the results of a batch spark execution to a streaming one?
- WebRTC Kurento Docker Image on EC2
- Python requests reading response while uploading request body
- live streaming using MediaLibDemos3x
- C++ Save dialog
- Streaming songs from pc
- Decrease delay during streaming and live streaming methods
Related Questions in AMAZON-SAGEMAKER
- Getting an anomaly score for every datapoint in SageMaker?
- Load Amazon Sagemaker NTM model locally for inference
- Train autoencoder in script mode on AWS sagemaker
- Update a Sagemaker Endpoint when changing the docker image
- Custom package installation from S3 in sagemaker
- How best to install dependencies in a Sagemaker PySpark cluster
- Load Python Pickle File from S3 Bucket to Sagemaker Notebook
- Load Snowflake data into Pandas dataframe using AWS Sagemaker
- AWS Sagemaker + AWS Lambda
- Pyathena is super slow compared to querying from Athena
- How can I deploy a re-trained Sagemaker model to an endpoint?
- ‘precision_at_target_recall’, ‘recall_at_target_precision’ on hyper parameters on AWS SageMaker , how does it train with that constraint?
- Why is Crowd HTML breaking this image?
- OCI runtime create failed: container_linux.go:349: starting container process caused on sagemaker
- How to upload packages to an instance in a Processing step in Sagemaker?
Related Questions in ENDPOINT
- How can I monitor an endpoint's status with Ruby?
- Temporary network outage when Symantec Endpoint Network protection is enabled
- REST api : correctly ask for an action
- WSO2 ESB environment-specific configuration
- Camel UriPath annotation inaccesibility
- Minimal spring ws (2.4.0) endpoint with xml payload
- Camel perform endpoint & route
- Relative import of a apackage in python flask application
- Could not find default endpoint element that references contract.................. in the ServiceModel client configuration section
- Lambda integration with VPC from payment gateway
- HighStock Chart x-axis end point
- RAILS: how to pass JSON API data to hash in controller
- PHP API endpoint, transmitting value with jQuery.get(..) containing an ampersand (&)
- generic htaccess script to make a subdirectory an endpoint by rewriting all sub-subdirectory requests to the subdirectory
- Cannot run backend in Android Studio
Related Questions in LARGE-LANGUAGE-MODEL
- Is it possible to fine tune or use RAG on the CoreML version of Llama2?
- Compare two strings by meaning using LLMs
- Implementation (and working) differences between AutoModelForCausalLMWithValueHead vs AutoModelForCausalLM?
- How do I know the right data format for different LLMs finetuning?
- I am trying to make a product which will reformat the answer using the question and Sql_answer as data
- CUDA OutOfMemoryError but free memory is always half of required memory in error message
- Query with my own data using langchain and pinecone
- Could not find a version that satisfies the requirement python-magic-bin
- Any possibility to increase performance of querying chromadb persisted locally
- Grid based decision making with Llama 2
- Methodology for Tracking Client Details in a Natural Language Bot using Langchain and RAG
- Filter langchain vector database using as_retriever search_kwargs parameter
- Exceeding LLM's maximum context length even using llama_index PromptHelper
- How can I re-train a LLaMA 2 Text Generation model into a Sequence-to-Sequence model?
- translation invariance of Rotary Embedding
Related Questions in GENERATIVE
- how handle categorical data in sklearn GMM mixture model
- include gpt model to comment my code or data
- Suggested methods of timing individual prompt / completion response from LLM? any options other than 'time'?
- Deploy LLM using Sagemaker and Langchain
- Question about transition matrix Q in D3PM diffusion model for discrete state space
- Can not solve this error in pytorch: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
- Creating a repeatable, random looking distribution of objects in Flash
- How would I go about plotting "live" stock market data with Processing, jQuery, Pure Data or Max/MSP?
- generative typography?
- Difference between Generative, Discriminating and Parametric, Nonparametric Algorithm/Model
- Writing phenaki video to file: Expected numpy array with ndim `3` but got `4`
- Does Keras nested models share attributes?
- Generating documents from LDA topic model
- LineTrace algorithim in processing
- How to update GAN Generator and Discriminator asynchronously in Tensorflow?
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
The usual architecture pattern is to separate the LLM from the client code (Langchain). Where the LLM is hosted in a SageMaker endpoint and the client is running in EC2, container or a Lambda function.
The advantages is much faster deployment (you'll update the app more often than the LLM), and an ability to scale out each of the components individually.
So, A much easier path to solution would be to deploy one of the LLMs available today in SageMaker Jumpstart (open-source or commercials), and deploy the application separately.
If you have good reasons to need full control of LLM, then you can try to build on this LLAMA2/SageMaker example (container, etc).
Then, if you want total control, you can build it all on top of your custom docker.