Results 1 to 6 of 6

Thread: Intermittent Network Issue on some of our servers

  1. #1

    Intermittent Network Issue on some of our servers

    Hello All,


    We are facing network issues with some of our servers resulting in intermittent issues with websites and emails. Our network team is working on getting this fixed as soon as possible.

    We will keep this thread updated with the latest developments. We apologize for any inconvenience caused due to this.
    Last edited by Gatorsonu; 09-10-2016 at 10:06 AM.
    Sonu Singh
    HostGator India

  2. #2
    It's been one hour since all sites are down! We are waiting for site up-time!
    ----------------------------
    WordPress Expert + PHP Programmer + Gamer.

  3. #3
    Hello Mayur,

    Please be informed that our network team is working on priority to have this fixed. We will keep this thread updated with more details as and when available.

    Meanwhile your patience is highly appreciated.
    Sonu Singh
    HostGator India

  4. #4
    Hello All,

    We have been able to restore the network connectivity and the servers seem to be working fine now.

    We are monitoring for any further issues and keep you updated.
    Sonu Singh
    HostGator India

  5. #5
    Hello All,

    Our Networking Team has been able to correct the connectivity issues. Please reach out to our Customer Support Team if you experience any additional interruptions going forward.
    Sonu Singh
    HostGator India

  6. #6
    RCA: GPX Downtime

    HISTORY
    10 AUGUST 2016

    Upgraded Junos from 11.4R4.4 to Junos 13.3R9 about 30 days ago as version 11.4 was EOL.


    PRECAP
    09:00 GMT 09 SEPTEMBER 2016 - RTR2 Failure

    We noticed alerts wherein all the links went down. On investigation we found hardware alarms. We tried to swap/reset the MIC but it didn't help. JTAC (Juniper Technical Support) identified the hardware issue and placed a RMA (Return Material Authorization). There was no downtime associated with this hardware failure as all the traffic was automatically switched over to RTR1 via VRRP.



    ISSUE
    03:30 GMT - 10th September - RTR1 Failure

    The next day, we got alerts confirming that RTR1 has also gone down and has affected all the services hosted out of GPX India data center. On troubleshooting we found that it was a similar to the incident that happened on the 9th September 2016. Reseating the MIC & rebooting the router did not help.


    TEMPORARY SOLUTION
    05:30 GMT - 10th September 2016

    We started to work on a temporary solution as getting the hardware replacement was not going to be possible until Tuesday 13th September 2016. We got few advanced license from Juniper for two of the Ex4200 switches( which act as a aggregation layer), moved the ISP links to those and configured the BGP sessions to bring the DC online.

    Slowly all the alerts cleared and services from the India GPX data center started working around 05:00 GMT


    HARDWARE FAILURE RCA
    08:30 GMT - 10th September 2016

    Failure of 2 routers back to back put doubts in our head regarding this being a hardware failure and hence we got senior JTAC’s to check this, upon investigation they confirmed that the issue was caused due to:

    When the juniper routers were delivered - they were delivered with the MIC cards on a restricted slot.

    Post upgrade, the license for these slots changed from a honour system to a restrictive one in the new JunOS release.

    This caused JunOS to disable those interfaces after 30 days.


    ROUTERS ONLINE
    01:30 GMT - 12th September 2016

    Since the root cause of the issue was identified and the routers didn’t have stability issues we decided to move all the traffic to the routers at around 01:30 GMT on 12th September 2016 (time when we have least traffic) resulting in a 2 minute downtime due to route convergence in internet routers.


    CONCLUSION

    We are still waiting on answers from JTAC on some critical business assumptions and actions which could have helped to solve this incident faster.



    ================================================== ==========================

    Update:

    Our Networking Team has been able to correct the connectivity issues. Please reach out to our Customer Support Team if you experience any additional interruptions going forward.

    ================================================== ==========================

    Update:

    We have been able to restore the network connectivity and the servers seem to be working fine now.

    We are monitoring for any further issues and keep you updated.

    ================================================== ==========================

    Hello All,

    We are facing network issues with some of our servers resulting in intermittent issues with websites and emails. Our network team is working on getting this fixed as soon as possible.

    We will keep this thread updated with the latest developments. We apologize for any inconvenience caused due to this.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •