A Curious Case of Amazon Cognito HTTP 500 errors
Our team was assisting a large client with the migration of on-prem applications to AWS and we chose to place these applications behind an Amazon Application Load Balancer (ALB), which was integrated with an Amazon Cognito User Pool with Azure Directory Services as a SAML Federated Identity Provider.
What made this architecture pattern different from most popular use cases, is the use of private VPC without an Internet Gateway attached. It resulted in all traffic to the Internet being routed through an Amazon VPC Transit Gateway and Enterprise network fabric.
Also, Amazon Cognito does not make use of VPC endpoints at this time and communications between users' browsers and Cognito are routed over the Internet, while ALB access remains private.
After successful authentication through Azure Directory Services and redirection back to ALB, an HTTP 500 error would be returned without any meaningful details.
As part of the troubleshooting process, I opened a support case with AWS and they pointed out that this error typically indicates communication issues between ALB and public endpoints for Amazon Cognito.
Amazon VPC Flow Logs proved that there were successful TCP packet exchanges between ALB and Cognito endpoints, so the firewall ports were open.
There would be no errors when I ran curl against the Amazon Cognito endpoint from Amazon CloudShell or over the Internet
Note that the screenshot above was taken in a lab environment as an illustration for this blog post and does not represent the client's environment.
Running the same curl command from a Linux instance in one of the VPC subnets that hosted ALB network interfaces now required the use of the "-k" option for curl to ignore certificate errors and had something else in the "issuer" section of the server certificate.
$ curl https://asail-pilot.auth.us-east-1.amazoncognito.com -vkI
To protect confidential information, I cannot provide screenshots from the client's environment or share any details of the Enterprise choking device they used.
So the issue was most likely caused by an Enterprise Web proxy a network path from VPC-bound ALB to public Cognito endpoints.
AWS support was able to temporarily enable HTTP packet capture for traffic between ALB and Cognito. A few lessons learned:
You have to explicitly authorize AWS Support to temporarily enable the network capture, which is limited to your troubleshooting session, specific destination and source addresses (ALB network interfaces and Cognito endpoint IP addresses assigned to your Cognito or custom domain), and captures only a portion of the packet.
The capture did capture distinct HTTP headers returned by the Enterprise Web Proxy, so it was a clear giveaway.
Having access to a Business or Enterprise Support Plan with AWS saves you a lot of time, make sure you have one!
Armed with the evidence above, I came back to the firewall team and they finally found and fixed a rule that was incorrectly routing VPC requests to the same Web proxy that would normally be used for employees browsing the Internet. There were no issues for employees, as their browsers were configured to trust proxy-issued certificates, which of course is not possible for ALB.
Happy Cloud Sailing!