PostgreSQL में उच्च उपलब्धता का प्रबंधन

हमारे पिछले ब्लॉग पोस्ट में, हमने क्लस्टर लैब्स द्वारा PostgreSQL ऑटोमैटिक फ़ेलओवर (PAF) और 2ndQuadrant द्वारा प्रतिकृति प्रबंधक (repmgr) की क्षमताओं और कार्यप्रणाली पर चर्चा की। इस श्रृंखला की अंतिम पोस्ट में, हम ज़ालैंडो द्वारा अंतिम समाधान, पेट्रोनी की समीक्षा करेंगे, और अंत में तीनों की तुलना करेंगे ताकि आप यह निर्धारित कर सकें कि आपके पोस्टग्रेएसक्यूएल होस्टिंग परिनियोजन के लिए कौन सा उच्च उपलब्धता ढांचा सर्वोत्तम है।

PostgreSQL में उच्च उपलब्धता का प्रबंधन - भाग I:PostgreSQL स्वचालित विफलता
पोस्टग्रेएसक्यूएल में उच्च उपलब्धता का प्रबंधन - भाग II:प्रतिकृति प्रबंधक

Patroni for PostgreSQL

Patroni की उत्पत्ति कंपोज़ की एक परियोजना, गवर्नर के कांटे के रूप में हुई थी। यह एक ओपन-सोर्स टूल सूट है, जिसे पोस्टग्रेएसक्यूएल क्लस्टर की उच्च उपलब्धता के प्रबंधन के लिए पायथन में लिखा गया है। अपने स्वयं के कंसिस्टेंसी प्रोटोकॉल के निर्माण के बजाय, पेट्रोनी एक डिस्ट्रीब्यूटेड कॉन्फ़िगरेशन स्टोर (DCS) द्वारा प्रदान किए गए कंसिस्टेंसी मॉडल का चतुराई से लाभ उठाता है। यह अन्य डीसीएस समाधानों का भी समर्थन करता है जैसे ज़ूकीपर, आदि, कौंसुल और कुबेरनेट्स।

Patroni स्ट्रीमिंग प्रतिकृति सहित PostgreSQL HA क्लस्टर्स का एंड-टू-एंड सेटअप सुनिश्चित करता है। यह एक स्टैंडबाय नोड बनाने के लिए विभिन्न तरीकों का समर्थन करता है, और एक टेम्पलेट की तरह काम करता है जिसे आपकी आवश्यकताओं के अनुसार अनुकूलित किया जा सकता है।

यह सुविधा संपन्न टूल REST API के माध्यम से और संरक्षक नामक कमांड लाइन उपयोगिता के माध्यम से भी अपनी कार्यक्षमता को उजागर करता है। यह लोड संतुलन को संभालने के लिए अपने स्वास्थ्य जांच एपीआई का उपयोग करके HAProxy के साथ एकीकरण का समर्थन करता है।

Patroni कॉलबैक की सहायता से ईवेंट नोटिफिकेशन का भी समर्थन करता है, जो कुछ क्रियाओं द्वारा ट्रिगर की गई स्क्रिप्ट होती हैं। यह उपयोगकर्ताओं को ठहराव/फिर से शुरू करने की कार्यक्षमता प्रदान करके किसी भी रखरखाव क्रिया को करने में सक्षम बनाता है। वॉचडॉग सपोर्ट फीचर फ्रेमवर्क को और भी मजबूत बनाता है।

यह कैसे काम करता है

प्रारंभ में, PostgreSQL और Patroni बायनेरिज़ को स्थापित करने की आवश्यकता है। एक बार यह हो जाने के बाद, आपको HA DCS कॉन्फ़िगरेशन भी सेटअप करना होगा। क्लस्टर को बूटस्ट्रैप करने के लिए सभी आवश्यक कॉन्फ़िगरेशन को yaml कॉन्फ़िगरेशन फ़ाइल में निर्दिष्ट करने की आवश्यकता है और Patroni इस फ़ाइल का उपयोग आरंभीकरण के लिए करेगा। पहले नोड पर, पेट्रोनी डेटाबेस को इनिशियलाइज़ करता है, डीसीएस से लीडर लॉक प्राप्त करता है, और यह सुनिश्चित करता है कि नोड को मास्टर के रूप में चलाया जा रहा है।

अगला चरण स्टैंडबाय नोड्स जोड़ना है, जिसके लिए पेट्रोनी कई विकल्प प्रदान करता है। डिफ़ॉल्ट रूप से, Patroni स्टैंडबाय नोड बनाने के लिए pg_basebackup का उपयोग करता है, और स्टैंडबाय नोड निर्माण के लिए WAL-E, pgBackRest, Barman और अन्य जैसे कस्टम तरीकों का भी समर्थन करता है। पेट्रोनी एक स्टैंडबाय नोड को जोड़ना बहुत आसान बनाता है, और सभी बूटस्ट्रैपिंग कार्यों को संभालता है और आपके स्ट्रीमिंग प्रतिकृति की स्थापना करता है।

#PostgreSQL में उच्च उपलब्धता का प्रबंधन - भाग III:पेट्रोनी बनाम पीएएफ बनाम repmgrट्वीट करने के लिए क्लिक करें

एक बार आपका क्लस्टर सेटअप पूरा हो जाने के बाद, पेट्रोनी सक्रिय रूप से क्लस्टर की निगरानी करेगी और सुनिश्चित करेगी कि यह एक स्वस्थ स्थिति में है। मास्टर नोड हर ttl सेकंड (डिफ़ॉल्ट:30 सेकंड) में लीडर लॉक को नवीनीकृत करता है। जब मास्टर नोड लीडर लॉक को नवीनीकृत करने में विफल रहता है, तो पेट्रोनी एक चुनाव को ट्रिगर करता है, और जो नोड लीडर लॉक प्राप्त करेगा उसे नए मास्टर के रूप में चुना जाएगा।

यह विभाजित मस्तिष्क परिदृश्य को कैसे संभालता है?

एक वितरित प्रणाली में, सर्वसम्मति स्थिरता निर्धारित करने में एक महत्वपूर्ण भूमिका निभाती है, और पेट्रोनी आम सहमति प्राप्त करने के लिए डीसीएस का उपयोग करती है। केवल लीडर लॉक रखने वाला नोड ही मास्टर हो सकता है और लीडर लॉक DCS के माध्यम से प्राप्त किया जाता है। यदि मास्टर नोड लीडर लॉक नहीं रखता है, तो इसे तुरंत पेट्रोनी द्वारा स्टैंडबाय के रूप में चलाने के लिए डिमोट कर दिया जाएगा। इस तरह, किसी भी समय, सिस्टम में केवल एक ही मास्टर चल सकता है।

क्या कोई सेटअप आवश्यकताएँ हैं?

पैट्रोनी को अजगर 2.7 और इसके बाद के संस्करण की जरूरत है।
डीसीएस और इसके विशिष्ट पायथन मॉड्यूल को स्थापित किया जाना चाहिए। परीक्षण उद्देश्यों के लिए, DCS को PostgreSQL चलाने वाले समान नोड्स पर स्थापित किया जा सकता है। हालांकि, उत्पादन में, डीसीएस को अलग-अलग नोड्स पर स्थापित किया जाना चाहिए।

yaml कॉन्फ़िगरेशन फ़ाइल इन उच्च स्तरीय कॉन्फ़िगरेशन सेटिंग्स का उपयोग करके मौजूद होनी चाहिए:

वैश्विक/सार्वभौमिक
इसमें कॉन्फ़िगरेशन शामिल है जैसे होस्ट का नाम (नाम) जो क्लस्टर के लिए अद्वितीय होना चाहिए, क्लस्टर का नाम (स्कोप) और डीसीएस (नेमस्पेस) में कॉन्फ़िगरेशन को संग्रहीत करने के लिए पथ।

लॉग करें
स्तर, प्रारूप, file_num, file_size आदि सहित पेट्रोनी-विशिष्ट लॉग सेटिंग्स।

बूटस्ट्रैप कॉन्फ़िगरेशन
यह क्लस्टर के लिए वैश्विक कॉन्फ़िगरेशन है जिसे DCS को लिखा जाएगा। इन कॉन्फ़िगरेशन मापदंडों को पेट्रोनी एपीआई की मदद से या सीधे डीसीएस में बदला जा सकता है। बूटस्ट्रैप कॉन्फ़िगरेशन में स्टैंडबाय निर्माण विधियां, इनिटडीबी पैरामीटर, पोस्ट इनिशियलाइज़ेशन स्क्रिप्ट आदि शामिल हैं। इसमें टाइमआउट कॉन्फ़िगरेशन, प्रतिकृति स्लॉट जैसी PostgreSQL सुविधाओं के उपयोग को तय करने के लिए पैरामीटर भी शामिल हैं। , सिंक्रोनस मोड आदि। यह खंड नए क्लस्टर के आरंभ होने के बाद दिए गए कॉन्फ़िगरेशन स्टोर के ///config में लिखा जाएगा।

PostgreSQL
इस खंड में पोस्टग्रेएसक्यूएल-विशिष्ट पैरामीटर जैसे प्रमाणीकरण, डेटा के लिए निर्देशिका पथ, बाइनरी और कॉन्फिगरेशन, आईपी एड्रेस सुनना आदि शामिल हैं।

REST API
इस खंड में आरईएसटी एपीआई से संबंधित पेट्रोनी-विशिष्ट कॉन्फ़िगरेशन शामिल है जैसे सुनो पता, प्रमाणीकरण, एसएसएल इत्यादि।

Consul
कॉन्सल डीसीएस के लिए विशिष्ट सेटिंग्स।

Etcd
Etcd DCS के लिए विशिष्ट सेटिंग।

प्रदर्शक
प्रदर्शक डीसीएस के लिए विशिष्ट सेटिंग्स।

कुबेरनेट्स
कुबेरनेट्स डीसीएस के लिए विशिष्ट सेटिंग्स।

चिड़ियाघर कीपर
ज़ूकीपर डीसीएस के लिए विशिष्ट सेटिंग्स।

वॉचडॉग
वॉचडॉग के लिए विशिष्ट सेटिंग।

Patroni Pros

Patroni क्लस्टर के एंड-टू-एंड सेटअप को सक्षम बनाता है।
REST API और HAproxy एकीकरण का समर्थन करता है।
कुछ क्रियाओं द्वारा ट्रिगर कॉलबैक स्क्रिप्ट के माध्यम से ईवेंट सूचनाओं का समर्थन करता है।
आम सहमति के लिए DCS का लाभ उठाता है।

Patroni Cons

पैट्रोनी रिकवरी कॉन्फ़िगरेशन में किसी अज्ञात या गैर-मौजूद नोड के साथ स्टैंडबाय के गलत कॉन्फ़िगरेशन का पता नहीं लगाएगा। नोड को दास के रूप में दिखाया जाएगा, भले ही स्टैंडबाय मास्टर/कैस्केडिंग स्टैंडबाय नोड से कनेक्ट किए बिना चल रहा हो।
उपयोगकर्ता को DCS सॉफ़्टवेयर के सेटअप, प्रबंधन और अपग्रेड को संभालने की आवश्यकता है।
कंपोनेंट संचार के लिए कई पोर्ट खुले होने की आवश्यकता है:
- पैट्रोनी के लिए REST API पोर्ट
- डीसीएस के लिए कम से कम 2 पोर्ट

उच्च उपलब्धता परीक्षण परिदृश्य

हमने पेट्रोनी का उपयोग करते हुए PostgreSQL HA प्रबंधन पर कुछ परीक्षण किए। ये सभी परीक्षण तब किए गए थे जब एप्लिकेशन चल रहा था और PostgreSQL डेटाबेस में डेटा डाल रहा था। एप्लिकेशन को पोस्टग्रेएसक्यूएल जावा जेडीबीसी ड्राइवर का उपयोग करके कनेक्शन विफलता क्षमता का लाभ उठाते हुए लिखा गया था।

स्टैंडबाय सर्वर टेस्ट

<वें शैली="चौड़ाई:25%; लंबवत-संरेखण:मध्य; पृष्ठभूमि-रंग:#d9fce9;">परीक्षण परिदृश्य <वें शैली ="चौड़ाई:70%; लंबवत-संरेखण:मध्य; पृष्ठभूमि-रंग:#d9fce9;">निगरानी

Sl. नहीं
1	PostgreSQL प्रक्रिया को समाप्त करें	Patroni ने PostgreSQL प्रक्रिया को वापस चालू स्थिति में ला दिया। लेखक के आवेदन में कोई व्यवधान नहीं आया।
2	PostgreSQL प्रक्रिया को रोकें	Patroni ने PostgreSQL प्रक्रिया को वापस चालू स्थिति में ला दिया। लेखक के आवेदन में कोई व्यवधान नहीं आया।
3	सर्वर को रीबूट करें	Patroni को रिबूट के बाद शुरू करने की जरूरत है, जब तक कि रिबूट पर शुरू न करने के लिए कॉन्फ़िगर न किया गया हो। पेट्रोनी के शुरू होने के बाद, इसने पोस्टग्रेएसक्यूएल प्रक्रिया शुरू की और स्टैंडबाय कॉन्फ़िगरेशन को सेटअप किया। लेखक के आवेदन में कोई व्यवधान नहीं आया।
4	Patroni प्रक्रिया रोकें	इसने PostgreSQL प्रक्रिया को नहीं रोका। संरक्षक सूची इस सर्वर को प्रदर्शित नहीं किया। लेखक के आवेदन में कोई व्यवधान नहीं आया। तो, अनिवार्य रूप से, आपको पेट्रोनी प्रक्रिया के स्वास्थ्य की निगरानी करने की आवश्यकता है - अन्यथा यह लाइन के नीचे मुद्दों को जन्म देगा।

मास्टर/प्राथमिक सर्वर परीक्षण

Sl. नहीं	परीक्षण परिदृश्य	निगरानी
1	PostgreSQL प्रक्रिया को समाप्त करें	Patroni ने PostgreSQL प्रक्रिया को वापस चालू स्थिति में ला दिया। उस नोड पर चलने वाले पेट्रोनी में प्राइमरी लॉक था और इसलिए चुनाव शुरू नहीं हुआ था। लेखक के आवेदन में डाउनटाइम था।
2	PostgreSQL प्रक्रिया को रोकें और स्वास्थ्य जांच की समाप्ति के तुरंत बाद इसे वापस लाएं	Patroni ने PostgreSQL प्रक्रिया को वापस चालू स्थिति में ला दिया। उस नोड पर चलने वाले पेट्रोनी में प्राइमरी लॉक था और इसलिए चुनाव शुरू नहीं हुआ था। लेखक के आवेदन में डाउनटाइम था।
3	सर्वर को रीबूट करें	विफलता हुई और एक स्टैंडबाय सर्वर को लॉक प्राप्त करने के बाद नए मास्टर के रूप में चुना गया। जब पुराने मास्टर पर पेट्रोनी शुरू किया गया था, तो उसने पुराने मास्टर को वापस लाया और pg_rewind का प्रदर्शन किया और नए मास्टर का अनुसरण करना शुरू कर दिया। टी लेखक के आवेदन में डाउनटाइम था।
4	Patroni प्रक्रिया को रोकें/मारें	एक स्टैंडबाय सर्वर ने डीसीएस लॉक हासिल कर लिया और खुद को बढ़ावा देकर मास्टर बन गया। पुराना मास्टर अभी भी चल रहा था और इसने मल्टी-मास्टर परिदृश्य को जन्म दिया। आवेदन अभी भी पुराने मास्टर को लिख रहा था। एक बार पुराने मास्टर पर पेट्रोनी शुरू हो जाने के बाद, यह पुराने मास्टर को रिवाउंड कर देता है (use_pg_rewind सत्य पर सेट किया गया था) नई मास्टर टाइमलाइन और lsn पर और नए मास्टर का अनुसरण करना शुरू कर दिया। जैसा कि आप ऊपर देख सकते हैं, मास्टर पर पेट्रोनी प्रक्रिया के स्वास्थ्य की निगरानी करना बहुत महत्वपूर्ण है। ऐसा करने में विफलता एक बहु-मास्टर परिदृश्य और संभावित डेटा हानि का कारण बन सकती है।

नेटवर्क अलगाव परीक्षण

Sl. नहीं

परीक्षण परिदृश्य

निगरानी

नेटवर्क-मास्टर सर्वर को अन्य सर्वरों से अलग करें

DCS संचार को मास्टर नोड के लिए अवरुद्ध कर दिया गया था।

PostgreSQL को मास्टर सर्वर पर अवनत कर दिया गया था।
बहुमत के विभाजन में एक नया मास्टर चुना गया।
लेखक आवेदन में एक डाउनटाइम था।

नेटवर्क-स्टैंडबाय सर्वर को अन्य सर्वरों से अलग करें

DCS संचार को स्टैंडबाय नोड के लिए अवरुद्ध कर दिया गया था।

पोस्टग्रेएसक्यूएल सेवा चल रही थी, हालांकि, चुनाव के लिए नोड पर विचार नहीं किया गया था।
लेखक के आवेदन में कोई व्यवधान नहीं आया।

सर्वश्रेष्ठ PostgreSQL HA Framework क्या है?

Patroni PostgreSQL डेटाबेस एडमिनिस्ट्रेटर (DBA) के लिए एक मूल्यवान टूल है, क्योंकि यह PostgreSQL क्लस्टर का एंड-टू-एंड सेटअप और मॉनिटरिंग करता है। DCS और स्टैंडबाय निर्माण को चुनने का लचीलापन अंतिम उपयोगकर्ता के लिए एक लाभ है, क्योंकि वे उस विधि को चुन सकते हैं जिसमें वे सहज हैं।

REST API, HaProxy इंटीग्रेशन, वॉचडॉग सपोर्ट, कॉलबैक और इसका फीचर समृद्ध प्रबंधन, पेट्रोनी को PostgreSQL HA प्रबंधन के लिए सबसे अच्छा समाधान बनाता है।

PostgreSQL HA Framework परीक्षण:PAF बनाम repmgr बनाम Patroni

नीचे दी गई एक व्यापक तालिका है जिसमें सभी तीन ढांचे - PostgreSQL स्वचालित विफलता (PAF), प्रतिकृति प्रबंधक (repmgr) और Patroni पर हमारे द्वारा किए गए सभी परीक्षणों के परिणामों का विवरण दिया गया है।

स्टैंडबाय सर्वर टेस्ट

<वें शैली ="चौड़ाई:29%; लंबवत-संरेखण:मध्य; पृष्ठभूमि-रंग:#d9fce9;">PostgreSQL स्वचालित विफलता (PAF) <वें शैली ="चौड़ाई:29%; लंबवत-संरेखण:मध्य; पृष्ठभूमि-रंग:#d9fce9;">प्रतिकृति प्रबंधक (repmgr) <वें शैली="चौड़ाई:29%; लंबवत-संरेखण:मध्य; पृष्ठभूमि-रंग:#d9fce9;">पैट्रोनी

परीक्षण परिदृश्य
Kill the PostgreSQL process	Pacemaker brought the PostgreSQL process back to running state. There was no disruption of the writer application.	Standby server was marked as failed. Manual intervention was required to start the PostgreSQL process again. There was no disruption of the writer application.	Patroni brought the PostgreSQL process back to running state. There was no disruption of the writer application.
Stop the PostgreSQL process	Pacemaker brought the PostgreSQL process back to running state. There was no disruption of the writer application.	Standby server was marked as failed. Manual intervention was required to start the PostgreSQL process again. There was no disruption of the writer application.	Patroni brought the PostgreSQL process back to running state. There was no disruption of the writer application.
Reboot the server	Standby server was marked offline initially. Once the server came up after reboot, PostgreSQL was started by Pacemaker and the server was marked as online. If fencing was enabled then node wouldn’t have been added automatically to cluster. There was no disruption of the writer application.	Standby server was marked as failed. Once the server came up after reboot, PostgreSQL was started manually and server was marked as running. There was no disruption of the writer application.	Patroni needs to be started after reboot, unless configured to not start on reboot. Once Patroni was started, it started the PostgreSQL process and setup the standby configuration. There was no disruption of the writer application.
Stop the framework agent process	Agent:pacemaker The PostgreSQL process was stopped and was marked offline. There was no disruption of the writer application.	Agent:repmgrd The standby server will not be part of automated failover situation. PostgreSQL service was found to be running. There was no disruption of the writer application.	Agent:patroni It did not stop the PostgreSQL process. patronictl list did not display this server. There was no disruption of the writer application.

परीक्षण परिदृश्य

Kill the PostgreSQL process

Pacemaker brought the PostgreSQL process back to running state.

There was no disruption of the writer application.

Standby server was marked as failed. Manual intervention was required to start the PostgreSQL process again.

There was no disruption of the writer application.

Patroni brought the PostgreSQL process back to running state.

There was no disruption of the writer application.

Stop the PostgreSQL process

Pacemaker brought the PostgreSQL process back to running state.

There was no disruption of the writer application.

Standby server was marked as failed. Manual intervention was required to start the PostgreSQL process again.

There was no disruption of the writer application.

Patroni brought the PostgreSQL process back to running state.

There was no disruption of the writer application.

Reboot the server

Standby server was marked offline initially. Once the server came up after reboot, PostgreSQL was started by Pacemaker and the server was marked as online. If fencing was enabled then node wouldn’t have been added automatically to cluster.

There was no disruption of the writer application.

Standby server was marked as failed. Once the server came up after reboot, PostgreSQL was started manually and server was marked as running.

There was no disruption of the writer application.

Patroni needs to be started after reboot, unless configured to not start on reboot. Once Patroni was started, it started the PostgreSQL process and setup the standby configuration.

There was no disruption of the writer application.

Stop the framework agent process

Agent:pacemaker

The PostgreSQL process was stopped and was marked offline.
There was no disruption of the writer application.

Agent:repmgrd

The standby server will not be part of automated failover situation.
PostgreSQL service was found to be running.
There was no disruption of the writer application.

Agent:patroni

It did not stop the PostgreSQL process.
patronictl list did not display this server.
There was no disruption of the writer application.

Master/Primary Server Tests

Test Scenario	PostgreSQL Automatic Failover (PAF)	Replication Manager (repmgr)	Patroni
Kill the PostgreSQL process	Pacemaker brought the PostgreSQL process back to running state. Primary got recovered within the threshold time and hence election was not triggered. There was downtime in the writer application.	repmgrd started health check for primary server connection on all standby servers for a fixed interval. When all retries failed, an election was triggered on all the standby servers. As a result of the election, the standby which had the latest received LSN got promoted. The standby servers which lost the election will wait for the notification from the new master node and will follow it once they receive the notification.Manual intervention was required to start the postgreSQL process again. There was downtime in the writer application.	Patroni brought the PostgreSQL process back to running state. Patroni running on that node had primary lock and hence election was not triggered. There was downtime in the writer application.
Stop the PostgreSQL process and bring it back immediately after health check expiry	Pacemaker brought the PostgreSQL process back to running state. Primary got recovered within the threshold time and hence election was not triggered. There was downtime in the writer application.	repmgrd started health check for primary server connections on all standby servers for a fixed interval. When all the retries failed, an election was triggered on all the standby nodes. However, the newly elected master didn’t notify the existing standby servers since the old master was back.Cluster was left in an indeterminate state and manual intervention was required. There was downtime in the writer application.	Patroni brought the PostgreSQL process back to running state. Patroni running on that node had primary lock and hence election was not triggered. There was downtime in the writer application.
Reboot the server	Election was triggered by Pacemaker after the threshold time for which master was not available. The most eligible standby server was promoted as the new master. Once the old master came up after reboot, it was added back to the cluster as a standby. If fencing was enabled, then node wouldn’t have been added automatically to cluster. There was downtime in the writer application.	repmgrd started election when master connection health check failed on all standby servers. The eligible standby was promoted. When this server came back, it didn’t join the cluster and was marked failed. repmgr node rejoin command was run to add the server back to the cluster. There was downtime in the writer application.	Failover happened and one of the standby servers was elected as the new master after obtaining the lock. When Patroni was started on the old master, it brought back the old master up and performed pg_rewind and started following the new master. There was downtime in the writer application.
Stop the framework agent process	Agent:pacemaker The PostgreSQL process was stopped and it was marked offline. Election was triggered and new master was elected. There was downtime in writer application.	Agent:repmgrd The primary server will not be part of the automated failover situation. PostgreSQL service was found to be running. There was no disruption in writer application.	Agent:patroni One of the standby servers acquired the DCS lock and became the master by promoting itself. The old master was still running and it led to multi-master scenario. The application was still writing to the old master. Once Patroni was started on the old master, it rewound the old master (use_pg_rewind was set to true) to the new master timeline and lsn and started following the new master.

Test Scenario

PostgreSQL Automatic Failover (PAF)

Replication Manager (repmgr)

Patroni

Kill the PostgreSQL process

Pacemaker brought the PostgreSQL process back to running state. Primary got recovered within the threshold time and hence election was not triggered.

There was downtime in the writer application.

repmgrd started health check for primary server connection on all standby servers for a fixed interval. When all retries failed, an election was triggered on all the standby servers. As a result of the election, the standby which had the latest received LSN got promoted. The standby servers which lost the election will wait for the notification from the new master node and will follow it once they receive the notification.Manual intervention was required to start the postgreSQL process again.

There was downtime in the writer application.

Patroni brought the PostgreSQL process back to running state. Patroni running on that node had primary lock and hence election was not triggered.

There was downtime in the writer application.

Stop the PostgreSQL process and bring it back immediately after health check expiry

Pacemaker brought the PostgreSQL process back to running state. Primary got recovered within the threshold time and hence election was not triggered.

There was downtime in the writer application.

repmgrd started health check for primary server connections on all standby servers for a fixed interval. When all the retries failed, an election was triggered on all the standby nodes. However, the newly elected master didn’t notify the existing standby servers since the old master was back.Cluster was left in an indeterminate state and manual intervention was required.

There was downtime in the writer application.

Patroni brought the PostgreSQL process back to running state. Patroni running on that node had primary lock and hence election was not triggered.

There was downtime in the writer application.

Reboot the server

Election was triggered by Pacemaker after the threshold time for which master was not available. The most eligible standby server was promoted as the new master. Once the old master came up after reboot, it was added back to the cluster as a standby. If fencing was enabled, then node wouldn’t have been added automatically to cluster.

There was downtime in the writer application.

repmgrd started election when master connection health check failed on all standby servers. The eligible standby was promoted. When this server came back, it didn’t join the cluster and was marked failed. repmgr node rejoin command was run to add the server back to the cluster.

There was downtime in the writer application.

Failover happened and one of the standby servers was elected as the new master after obtaining the lock. When Patroni was started on the old master, it brought back the old master up and performed pg_rewind and started following the new master.

There was downtime in the writer application.

Stop the framework agent process

Agent:pacemaker

The PostgreSQL process was stopped and it was marked offline.
Election was triggered and new master was elected.
There was downtime in writer application.

Agent:repmgrd

The primary server will not be part of the automated failover situation.
PostgreSQL service was found to be running.
There was no disruption in writer application.

Agent:patroni

One of the standby servers acquired the DCS lock and became the master by promoting itself.
The old master was still running and it led to multi-master scenario. The application was still writing to the old master.
Once Patroni was started on the old master, it rewound the old master (use_pg_rewind was set to true) to the new master timeline and lsn and started following the new master.

Network Isolation Tests

Test Scenario	PostgreSQL Automatic Failover (PAF)	Replication Manager (repmgr)	Patroni
Network isolate the master server from other servers (split brain scenario)	Corosync traffic was blocked on the master server. PostgreSQL service was turned off and master server was marked offline due to quorum policy. A new master was elected in the majority partition. There was a downtime in the writer application.	All servers have the same value for location in repmgr configuration: repmgrd started election when master connection health check failed on all standby servers. The eligible standby was promoted, but the PostgreSQL process was still running on the old master node. There were two nodes running as master. Manual intervention was required after the network isolation was corrected. The standby servers have the same value for location but the primary had a different value for location in repmgr configuration: repmgrd started election when master connection health check failed on all standby servers. But, there was no new master elected since the standby servers had location different from that of the primary. repmgrd went into degrade monitoring mode. PostgreSQL was running on all the nodes and there was only one master in the cluster.	DCS communication was blocked for master node. PostgreSQL was demoted on the master server. A new master was elected in the majority partition. There was a downtime in the writer application.
Network-isolate the standby server from other servers	Corosync traffic was blocked on the standby server. The server was marked offline and PostgreSQL service was turned off due to quorum policy. There was no disruption in the writer application.	repmgrd went into degrade monitoring mode. The PostgreSQL process was still running on the standby node. Manual intervention was required after the network isolation was corrected.	DCS communication was blocked for the standby node. The PostgreSQL service was running, however, the node was not considered for elections. There was no disruption in the writer application.

Test Scenario

PostgreSQL Automatic Failover (PAF)

Replication Manager (repmgr)

Patroni

Network isolate the master server from other servers (split brain scenario)

Corosync traffic was blocked on the master server.

PostgreSQL service was turned off and master server was marked offline due to quorum policy.
A new master was elected in the majority partition.
There was a downtime in the writer application.

All servers have the same value for location in repmgr configuration:

repmgrd started election when master connection health check failed on all standby servers.
The eligible standby was promoted, but the PostgreSQL process was still running on the old master node.
There were two nodes running as master. Manual intervention was required after the network isolation was corrected.

The standby servers have the same value for location but the primary had a different value for location in repmgr configuration:

repmgrd started election when master connection health check failed on all standby servers.
But, there was no new master elected since the standby servers had location different from that of the primary.
repmgrd went into degrade monitoring mode. PostgreSQL was running on all the nodes and there was only one master in the cluster.

DCS communication was blocked for master node.

PostgreSQL was demoted on the master server.
A new master was elected in the majority partition.
There was a downtime in the writer application.

Network-isolate the standby server from other servers

Corosync traffic was blocked on the standby server.

The server was marked offline and PostgreSQL service was turned off due to quorum policy.
There was no disruption in the writer application.

repmgrd went into degrade monitoring mode.
The PostgreSQL process was still running on the standby node.
Manual intervention was required after the network isolation was corrected.

DCS communication was blocked for the standby node.

The PostgreSQL service was running, however, the node was not considered for elections.
There was no disruption in the writer application.

PostgreSQL में उच्च उपलब्धता का प्रबंधन - भाग III:Patroni

Patroni for PostgreSQL

यह कैसे काम करता है

यह विभाजित मस्तिष्क परिदृश्य को कैसे संभालता है?

क्या कोई सेटअप आवश्यकताएँ हैं?

Patroni Pros

Patroni Cons

उच्च उपलब्धता परीक्षण परिदृश्य

स्टैंडबाय सर्वर टेस्ट

मास्टर/प्राथमिक सर्वर परीक्षण

नेटवर्क अलगाव परीक्षण

सर्वश्रेष्ठ PostgreSQL HA Framework क्या है?

PostgreSQL HA Framework परीक्षण:PAF बनाम repmgr बनाम Patroni

स्टैंडबाय सर्वर टेस्ट

Master/Primary Server Tests

Network Isolation Tests