Issue Details (XML | Word | Printable)

Key: CORE-2917
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: Alexander Peshkov
Reporter: Derryck welas
Votes: 0
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Firebird Core

Sever Hangs on I/O error during "open" operation for file "/tmp/firebird/fb_trace_ksVDoc"

Created: 10/Mar/10 11:18 AM   Updated: 25/Jan/19 10:28 AM
Component/s: Engine
Affects Version/s: 2.5 RC2
Fix Version/s: 2.5 RC3

File Attachments: None
Image Attachments:

1. error_sequence_of_events.PNG
(22 kB)
Environment:
Linux Redhat 64bits
FirebirdCS-2.5.0.25920-ReleaseCandidate2.amd64.rpm

QA Status: Cannot be tested

Sub-Tasks  All   Open   
No sub-tasks match this view.

 Description  « Hide
The server has been online for ~ 2 weeks then,

 hangs while nobody is using the system (4 AM) ( i had to kill all fb_inet_server instances, then connection was possible again)

this is written in the firebird.log (every second)

eusdlsdbb024.logistics.corp Mon Mar 8 04:05:04 2010
Error in isc_attach_database() API call when working with security database
I/O error during "open" operation for file "/tmp/firebird/fb_trace_ksVDoc"
Error while trying to open file
No such file or directory

/tmp seems to have enough space .



Client application jdbc stack error (no connection possible):

Caused by: org.firebirdsql.jdbc.FBSQLException: GDS Exception. 335544344. I/O error for file open "/tmp/firebird/fb_trace_ksVDoc"
Error while trying to open file
null, error code: HY000
Reason: I/O error for file open "/tmp/firebird/fb_trace_ksVDoc"
Error while trying to open file
null


i have ODS Version 11.2 and 11.1 databases running on this instance.


 All   Comments   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Derryck welas added a comment - 10/Mar/10 11:33 AM
b.t.w. trace is not enabled on this server. (default fbtrace.conf )

Vlad Khorsun added a comment - 10/Mar/10 12:38 PM
It is not clear, you wrote :

a) nobody is using the system
b) i had to kill all fb_inet_server instances

so how it can be at the same time ?

And a question : if there was file /tmp/firebird/fb_trace at that time ?

Derryck welas added a comment - 10/Mar/10 01:22 PM
See your point, but what i mean is there were no people actually testing on the server which means no "load". (It is a test server)

What was running are ~ 10 jboss instances connected to this server (10 databases) and there is some activity gooing on when these
things are running. (dooing occasional select statements etc..just basic dsql )

what was interesting is that all connections from multiple databases were affected.


based on the firebird.log this error appeared on Mar 8 ~ 04:05:04 (2 days ago, you can imagine firebird.log was very big)
the time i saw the error is today and that is the time i killed "fb_inet_server" instances.


When the server was unreachable (today) i did not see any " /tmp/firebird/fb_trace*** file


Derryck welas added a comment - 10/Mar/10 05:17 PM
other strange thing:

i have "TempDirectories =/local/apps/tmp" in my firebird.conf so dont know how the system comes up with "/tmp/firebird/"







Vlad Khorsun added a comment - 10/Mar/10 05:32 PM
Files, which backed shared memory (lock table, events table, monitoring snapshots, trace files, etc) are placed at /tmp/firebird folder independently of TempDirectories setting value

Vlad Khorsun added a comment - 11/Mar/10 09:31 AM
Are you sure you have RC2, not RC1 ? Any chance you have had embedded connection (for ex. gbak) using RC1 ?
I ask because there was similar (?) issue CORE-2807 resolved in RC2

Derryck welas added a comment - 11/Mar/10 09:26 PM
-its FirebirdCS-2.5.0.25920-ReleaseCandidate2.amd64.rpm
-connections are only over remote interface.
- gbak like this : gbak -B -V /local/apps/firebird/database.gdb -user sysdba -pass ****** /local/apps/dbbackup/database.bck

currently i dont use the trace api but i will start testing it from here on.

Derryck welas added a comment - 12/Mar/10 01:07 PM
How to recreate the error:

go to /tmp/firebird and delete the file fb_trace_****** file . (the one where the trace config is copied to)

i know it should not happen but when that file is not there your server is locked down ! (no new connections possible )

the moment you kill all instances of fb_inet_server , than you can make new connections to the server.

i think it would be better to just recreate an empty fb_trace_****** file in such situation ( it is empty in most situations anyway).

Alexander Peshkov added a comment - 12/Mar/10 01:28 PM
Manually removing file is not a way to reproduce a bug.
The problem is not 'what to do when file is deleted'.
The problem is 'why was a file deleted when it must exist'.

Removing file from /tmp is a known way to prevent normal OS operation.
I suppose you did not remove it manually when the bug happened first time?

Derryck welas added a comment - 17/Mar/10 03:24 PM
Hey will keep an close eye on this error (if happening on normal DSQL processing over jdbc),

but for now not able to reproduce bug.


the only things i encounter during load testing (something i already mentioned in CORE-2891)

eusdlsdbb024.logistics.corp Wed Mar 17 15:02:46 2010
        Fatal lock manager error: semop failed (release_shmem), errno: 0
eusdlsdbb024.logistics.corp Wed Mar 17 15:02:46 2010
        Shutting down the server with 1 active connection(s) to 1 database(s), 0 active service(s)
eusdlsdbb024.logistics.corp Wed Mar 17 15:02:51 2010
        Firebird shutdown is still in progress after the specified timeout
eusdlsdbb024.logistics.corp Wed Mar 17 15:02:51 2010
        Operating system call pthread_mutex_destroy failed. Error code 16


Yakushev.Y added a comment - 22/Mar/10 09:33 PM
OS:
Linux version 2.6.18-128.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Jan 21 10:41:14 EST 2009

RDBMS:
Firebird/linux Intel (access method), version "LI-V2.5.0.25920 Firebird 2.5 Release Candidate 2"
        on disk structure version 11.2

Firebird.log:

IS2.md.net Sat Mar 20 04:02:44 2010
Error in isc_attach_database() API call when working with security database I/O error during "open" operation for file "/tmp/firebird/fb_trace_XhBjYa"
Error while trying to open file
No Such file or directory

Firebird.conf

TempDirectories = /inf/fbtemp

The same situation.

Server is up for 11 days. Everything is fine till Mar 20. Server load is low at this time.
/tmp seems to have enough space.
I'm unable to reproduce the bug.


Vlad Khorsun added a comment - 22/Mar/10 09:45 PM
Yakushev.Y,

are you sure nobody deleted files in /tmp/firebird ?

Yakushev.Y added a comment - 22/Mar/10 09:56 PM - edited
Yes. 20 MArch is saturday. All IT specialists had a day off. This problem was detected on Monday.
Server is in the local network. No access from outside.

Vlad Khorsun added a comment - 22/Mar/10 10:23 PM
No cron tasks with disk cleanup ?

Yakushev.Y added a comment - 23/Mar/10 01:09 PM
This database is used with a special application in one of the Moscow companies. This company consists of central office (CO) and has about 30 units over Moscow. Each unit has it's own database. By now, most of the units have Firebird 2.0 installed. But the central database(CDB) was recently upgraded to FB 2.5.
They are connected via corporate network with no access from outside (internet), as a radical security solution.
All information from units transfers to central database with a replication program (AI, AU, AD triggers  «chagelog» table) for statistic purposes.
3 units work at night time. Most of the users application activity events are logged in the database. Sometimes, application in the unit get's information directly from the central database.
No one works in central office on Saturdays and Sundays. Server for the CDB is dedicated, there are no other services running on it, except Firebird and «standart».
Most of the users at CO don't shutdown their client applications when they go home.
This company has it's own IT department. I'm working on the application developer side.
Facts:
• None of companies IT department specialists worked on 20 March;
• Problem was detected at Monday morning, IT department had a talk off;
• It is a good rule, IT department never makes changes on the CDB server without asking me, I believe;
• There are no cron jobs running on the CDB server;
• There is one backup job on Windows 2003 server (backup and replication) with Firebird 2.5, running at 22.00. Backup time is not more then 1,5 hour;
• On 18 of March a reserve server (RS) was connected to the network (in case the main crashes).
A new backup CDB to RS- restore on RS job was installed on this server, backup - restore at 1.00;
• There was no user activity since 1.00 in units, according to the application log;
• No data was inserted(u,d) in CDB according to replication log since 3.00, only selecting from the «changelog» table.

There were no problems on the CDB server, exept 1. The RS DNS name was the same with the main server IS2.md.net. So since 18 March I I saw in the messages, and one of them was near the bug time:
Mar 20 04:03:06 IS2 nmbd[3299]: [2010/03/20 04:03:06, 0] nmbd/nmbd_packets.c:process_browse_packet(1067)
Mar 20 04:03:06 IS2 nmbd[3299]: process_browse_packet: Discarding datagram from IP 172.20.1.21. Source name IS2<00> is one of our names !
We have our application installed in several other companies with FB 2.5. There were no such problems detected. But Windows is used there.
This is the first bug in this company since FB 1.5 which is not solved before we saw it. Thanks to the Firebird team and all the bug reporters.
We have ibase support, so if the problem happens again, we will get them involved.
We'll try to get more information for you.
May be this is not FB problem. But this situation happened at least twice.

Derryck welas added a comment - 02/Apr/10 02:51 PM
Hey it happened again !
believe me nobody has the nerve to delete anything on this server :-)

I think the "fb_trace_*****" file is not deleted
but firebird is somehow losing the (active) name of the trace file .

firebird.log reports:

eusdlsdbb024.logistics.corp Fri Apr 2 04:03:01 2010
Error in isc_attach_database() API call when working with security database
I/O error during "open" operation for file "/tmp/firebird/fb_trace_VaqTGg"


While in the /tmp/firebird/ directory there is "fb_trace_kDlCCp" with date 03/17/2010

after i kill the fb_inet_server instances a new trace file is created "fb_trace_GamqsW" with date 04/02/2010

also see attached picture: "error_sequence_of_events.PNG"



Alexander Peshkov added a comment - 02/Apr/10 04:22 PM
Guys, I have collected here all known to me (not only from this ticket, fb-devel was also used) times when you had problems.

eusdlsdbb024.logistics.corp Mon Mar 8 04:05:04 2010
IS2.md.net Sat Mar 20 04:02:44 2010
eusdlsdbb024.logistics.corp Fri Apr 2 04:03:01 2010
DB2.gtk-sa.pl Sun Mar 28 04:19:16 2010

Pay attention - this is always soon after 4 o'clock in the night time. Be sure firebird DOES NOT have any scheduling to do something special a specific moment of time. But there is one program in unix, which does perform this activity - cron. And one of typical tasks - clear tmp. And the time when this activity is performed - night.

When you say 'we have no cron tasks' may be it means 'we did not add cron tasks'? Many linux distro already have /tmp cleanup task when installed. Please check files and dirs /tmp/cron* - I suppose you may find something interesting.

Derryck welas added a comment - 02/Apr/10 05:32 PM - edited
The 1e incident there was no crontab task on the server.

but now i have crontab calling a "java" program every minute for 9 different databases: ( one database even every 30 sec)
executed by the program is :

SELECT ATT.MON$SERVER_PID,ATT.MON$REMOTE_ADDRESS,10000*(CURRENT_TIMESTAMP-STMT.MON$TIMESTAMP) AS ELAPS_TIME,STMT.MON$SQL_TEXT FROM MON$ATTACHMENTS ATT JOIN MON$STATEMENTS STMT ON ATT.MON$ATTACHMENT_ID = STMT.MON$ATTACHMENT_ID WHERE STMT.MON$STATE = 1 ";

edit:
there are no /tmp/cron* directories/files
/tmp has 520M disk available ( 992M 421M 520M 45% /tmp)
also there are older files in /tmp/firebird/* wich are not removed.

edit2:

here is from "/etc/cron.daily/tmpwatch" ..
---->
flags=-umc
/usr/sbin/tmpwatch "$flags" -x /tmp/.X11-unix -x /tmp/.XIM-unix \
-x /tmp/.font-unix -x /tmp/.ICE-unix -x /tmp/.Test-unix 240 /tmp
/usr/sbin/tmpwatch "$flags" 720 /var/tmp
for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
    if [ -d "$d" ]; then
/usr/sbin/tmpwatch "$flags" -f 720 "$d"
    fi
done
<--

edit3:

the current "stat on fb_trace_GamqsW"
-->
File: `fb_trace_GamqsW'
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: fd03h/64771d Inode: 131085 Links: 1
Access: (0660/-rw-rw----) Uid: ( 84/firebird) Gid: ( 84/firebird)
Access: 2010-04-02 16:22:30.000000000 +0200
Modify: 2010-04-02 16:22:30.000000000 +0200
Change: 2010-04-02 16:22:30.000000000 +0200
<---

based on this it looks like "tempwatch" will remove the file (after 240 hours of no access)
also i dont see the access time or modify time changing. (means if i dont do some tracing the file will be deleted in 10 days......?


Vlad Khorsun added a comment - 03/Apr/10 01:42 PM
Derrick,

looks like you've found a correct reason for a problem.
I think this will help much with fixing it - in RC3.

So far, you can try as a workaround to

- add cron task to touch files in /tmp/firebird/*, or

- tune tempwatch to not delete (opened) files in /tmp/firebird/*, if it is possible

It seems --fuser option could help : http://linux.about.com/library/cmd/blcmdl8_tmpwatch.htm
I guess you can change
  flags=-umc by
  flags=-umcs

Derryck welas added a comment - 06/Apr/10 05:19 PM
Looking forward to RC3 !

Alexander Peshkov added a comment - 21/Apr/10 12:03 PM
Made any running instance of firebird perodically touch mentioned file.

John Franck added a comment - 18/Jan/19 02:47 PM
Almost 9 years later, I'm experiencing this same behaviour.

Firebird 3.0.4 SS 64bit on CentOS 7.

As reported here (in russian): http://www.sql.ru/forum/1292505/sever-hangs-on-i-o-error-during-open-operation-for-file-tmp-firebird-fb-trace-ksvdoc
it seems that "something" (I don't know what and I don't know how to find it out - I'm not a linux guru) is cleaning the /tmp directory every now and then (again, I don'k know how often, maybe it's a week like in the above russian report), then the absence of the file /tmp/firebird/fb_trace_* prevents local connections to work raising the error:

I/O error during "open" operation for file "/tmp/firebird/fb_trace_1tdYvH" <<--- (my example)
-Error while trying to open file
-No such file or directory

Luckily enough, remote connections continue to work. As this happens on our production server, it would be disastrous if it would block remote connections.

On this server, tracking is disabled, so these fb_trace files seems to be just spurious. As others mentioned, it's quite strange that the server stops responding to local connections if a seemingly useless file is not where it pretends to be, and if this is really needed to exist, it could be recreated by the engine if it doesn't find it.

Side note: I know that on SuperServer it's not a good idea to connect to databases locally, but for some batch operations that we run overnight when our service is already offline, this doesn't represent a problem, as there are no other connections to the DBs.

Alexander Peshkov added a comment - 18/Jan/19 04:49 PM
If there are files /tmp/firebird/fb_trace* that means that there are connections to the server. And if some file is deleted that connection(s) remain active for a very long time (about a week). But may be not to database you are going to access locally.

On the other hand if there are connections and if it's not super (but classic) missing such file may become a problem. I.e. I see need to fix an issue. Please create new ticket in the tracker.


Alexander Peshkov added a comment - 18/Jan/19 04:53 PM
BTW - this is not actually a bug in FB but a bug in linux - it may delete opened file dduring regular /tmp cleanup. That actions are performed by script in /tmp/cron.* directory, unfotunately I can't say for sure what script to delete to quickly fix your issue (I use gentoo which by default miss that broken feature) but useally looking at scripts and appluing some common sense to them helps to find that script easily.

Dmitry Yemanov added a comment - 25/Jan/19 10:00 AM
Alex, I've just discovered that your commit for this issue (touchThreadFunc for trace) was never frontported into v3/v4. This may be the reason why the bug strikes back for v3 users. Or am I missing something?

Vlad Khorsun added a comment - 25/Jan/19 10:20 AM
IIRC, there was intention to move touch function to the new timer thread

Alexander Peshkov added a comment - 25/Jan/19 10:28 AM
yes, will take a look after beta1