Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server Hangs on I/O error during "open" operation for file "/tmp/firebird/fb_trace_ksVDoc" [CORE2917] #1210

Closed
firebird-automations opened this issue Mar 10, 2010 · 34 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Derryck welas (welas)

Jira_subtask_outward CORE6009

Attachments:
error_sequence_of_events.PNG

The server has been online for ~ 2 weeks then,

hangs while nobody is using the system (4 AM) ( i had to kill all fb_inet_server instances, then connection was possible again)

this is written in the firebird.log (every second)

eusdlsdbb024.logistics.corp Mon Mar 8 04:05:04 2010
Error in isc_attach_database() API call when working with security database
I/O error during "open" operation for file "/tmp/firebird/fb_trace_ksVDoc"
Error while trying to open file
No such file or directory

/tmp seems to have enough space .

Client application jdbc stack error (no connection possible):

Caused by: org.firebirdsql.jdbc.FBSQLException: GDS Exception. 335544344. I/O error for file open "/tmp/firebird/fb_trace_ksVDoc"
Error while trying to open file
null, error code: HY000
Reason: I/O error for file open "/tmp/firebird/fb_trace_ksVDoc"
Error while trying to open file
null

i have ODS Version 11.2 and 11.1 databases running on this instance.

Commits: 0979b7c e8d1eea 6ea33a5 f7f1def

@firebird-automations
Copy link
Collaborator Author

Commented by: Derryck welas (welas)

b.t.w. trace is not enabled on this server. (default fbtrace.conf )

@firebird-automations
Copy link
Collaborator Author

Modified by: Derryck welas (welas)

Component: Engine [ 10000 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

It is not clear, you wrote :

a) nobody is using the system
b) i had to kill all fb_inet_server instances

so how it can be at the same time ?

And a question : if there was file /tmp/firebird/fb_trace at that time ?

@firebird-automations
Copy link
Collaborator Author

Commented by: Derryck welas (welas)

See your point, but what i mean is there were no people actually testing on the server which means no "load". (It is a test server)

What was running are ~ 10 jboss instances connected to this server (10 databases) and there is some activity gooing on when these
things are running. (dooing occasional select statements etc..just basic dsql )

what was interesting is that all connections from multiple databases were affected.

based on the firebird.log this error appeared on Mar 8 ~ 04:05:04 (2 days ago, you can imagine firebird.log was very big)
the time i saw the error is today and that is the time i killed "fb_inet_server" instances.

When the server was unreachable (today) i did not see any " /tmp/firebird/fb_trace*** file

@firebird-automations
Copy link
Collaborator Author

Commented by: Derryck welas (welas)

other strange thing:

i have "TempDirectories =/local/apps/tmp" in my firebird.conf so dont know how the system comes up with "/tmp/firebird/"

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Files, which backed shared memory (lock table, events table, monitoring snapshots, trace files, etc) are placed at /tmp/firebird folder independently of TempDirectories setting value

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Are you sure you have RC2, not RC1 ? Any chance you have had embedded connection (for ex. gbak) using RC1 ?
I ask because there was similar (?) issue CORE2807 resolved in RC2

@firebird-automations
Copy link
Collaborator Author

Commented by: Derryck welas (welas)

-its FirebirdCS-2.5.0.25920-ReleaseCandidate2.amd64.rpm
-connections are only over remote interface.
- gbak like this : gbak -B -V /local/apps/firebird/database.gdb -user sysdba -pass ****** /local/apps/dbbackup/database.bck

currently i dont use the trace api but i will start testing it from here on.

@firebird-automations
Copy link
Collaborator Author

Commented by: Derryck welas (welas)

How to recreate the error:

go to /tmp/firebird and delete the file fb_trace_****** file . (the one where the trace config is copied to)

i know it should not happen but when that file is not there your server is locked down ! (no new connections possible )

the moment you kill all instances of fb_inet_server , than you can make new connections to the server.

i think it would be better to just recreate an empty fb_trace_****** file in such situation ( it is empty in most situations anyway).

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

Manually removing file is not a way to reproduce a bug.
The problem is not 'what to do when file is deleted'.
The problem is 'why was a file deleted when it must exist'.

Removing file from /tmp is a known way to prevent normal OS operation.
I suppose you did not remove it manually when the bug happened first time?

@firebird-automations
Copy link
Collaborator Author

Commented by: Derryck welas (welas)

Hey will keep an close eye on this error (if happening on normal DSQL processing over jdbc),

but for now not able to reproduce bug.

the only things i encounter during load testing (something i already mentioned in CORE2891)

eusdlsdbb024.logistics.corp Wed Mar 17 15:02:46 2010
Fatal lock manager error: semop failed (release_shmem), errno: 0
eusdlsdbb024.logistics.corp Wed Mar 17 15:02:46 2010
Shutting down the server with 1 active connection(s) to 1 database(s), 0 active service(s)
eusdlsdbb024.logistics.corp Wed Mar 17 15:02:51 2010
Firebird shutdown is still in progress after the specified timeout
eusdlsdbb024.logistics.corp Wed Mar 17 15:02:51 2010
Operating system call pthread_mutex_destroy failed. Error code 16

@firebird-automations
Copy link
Collaborator Author

Commented by: Yakushev.Y (yyy)

OS:
Linux version 2.6.18-128.el5 (mailto:mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #⁠1 SMP Wed Jan 21 10:41:14 EST 2009

RDBMS:
Firebird/linux Intel (access method), version "LI-V2.5.0.25920 Firebird 2.5 Release Candidate 2"
on disk structure version 11.2

Firebird.log:

http://IS2.md.net Sat Mar 20 04:02:44 2010
Error in isc_attach_database() API call when working with security database I/O error during "open" operation for file "/tmp/firebird/fb_trace_XhBjYa"
Error while trying to open file
No Such file or directory

Firebird.conf

TempDirectories = /inf/fbtemp

The same situation.

Server is up for 11 days. Everything is fine till Mar 20. Server load is low at this time.
/tmp seems to have enough space.
I'm unable to reproduce the bug.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Yakushev.Y,

are you sure nobody deleted files in /tmp/firebird ?

@firebird-automations
Copy link
Collaborator Author

Commented by: Yakushev.Y (yyy)

Yes. 20 MArch is saturday. All IT specialists had a day off. This problem was detected on Monday.
Server is in the local network. No access from outside.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

No cron tasks with disk cleanup ?

@firebird-automations
Copy link
Collaborator Author

Commented by: Yakushev.Y (yyy)

This database is used with a special application in one of the Moscow companies. This company consists of central office (CO) and has about 30 units over Moscow. Each unit has it's own database. By now, most of the units have Firebird 2.0 installed. But the central database(CDB) was recently upgraded to FB 2.5.
They are connected via corporate network with no access from outside (internet), as a radical security solution.
All information from units transfers to central database with a replication program (AI, AU, AD triggers  «chagelog» table) for statistic purposes.
3 units work at night time. Most of the users application activity events are logged in the database. Sometimes, application in the unit get's information directly from the central database.
No one works in central office on Saturdays and Sundays. Server for the CDB is dedicated, there are no other services running on it, except Firebird and «standart».
Most of the users at CO don't shutdown their client applications when they go home.
This company has it's own IT department. I'm working on the application developer side.
Facts:
• None of companies IT department specialists worked on 20 March;
• Problem was detected at Monday morning, IT department had a talk off;
• It is a good rule, IT department never makes changes on the CDB server without asking me, I believe;
• There are no cron jobs running on the CDB server;
• There is one backup job on Windows 2003 server (backup and replication) with Firebird 2.5, running at 22.00. Backup time is not more then 1,5 hour;
• On 18 of March a reserve server (RS) was connected to the network (in case the main crashes).
A new backup CDB to RS- restore on RS job was installed on this server, backup - restore at 1.00;
• There was no user activity since 1.00 in units, according to the application log;
• No data was inserted(u,d) in CDB according to replication log since 3.00, only selecting from the «changelog» table.

There were no problems on the CDB server, exept 1. The RS DNS name was the same with the main server http://IS2.md.net. So since 18 March I I saw in the messages, and one of them was near the bug time:
Mar 20 04:03:06 IS2 nmbd[3299]: [2010/03/20 04:03:06, 0] nmbd/nmbd_packets.c:process_browse_packet(1067)
Mar 20 04:03:06 IS2 nmbd[3299]: process_browse_packet: Discarding datagram from IP 172.20.1.21. Source name IS2<00> is one of our names !
We have our application installed in several other companies with FB 2.5. There were no such problems detected. But Windows is used there.
This is the first bug in this company since FB 1.5 which is not solved before we saw it. Thanks to the Firebird team and all the bug reporters.
We have ibase support, so if the problem happens again, we will get them involved.
We'll try to get more information for you.
May be this is not FB problem. But this situation happened at least twice.

@firebird-automations
Copy link
Collaborator Author

Commented by: Derryck welas (welas)

Hey it happened again !
believe me nobody has the nerve to delete anything on this server :-)

I think the "fb_trace_*****" file is not deleted
but firebird is somehow losing the (active) name of the trace file .

firebird.log reports:

eusdlsdbb024.logistics.corp Fri Apr 2 04:03:01 2010
Error in isc_attach_database() API call when working with security database
I/O error during "open" operation for file "/tmp/firebird/fb_trace_VaqTGg"

While in the /tmp/firebird/ directory there is "fb_trace_kDlCCp" with date 03/17/2010

after i kill the fb_inet_server instances a new trace file is created "fb_trace_GamqsW" with date 04/02/2010

also see attached picture: "error_sequence_of_events.PNG"

@firebird-automations
Copy link
Collaborator Author

Modified by: Derryck welas (welas)

Attachment: error_sequence_of_events.PNG [ 11608 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

Guys, I have collected here all known to me (not only from this ticket, fb-devel was also used) times when you had problems.

eusdlsdbb024.logistics.corp Mon Mar 8 04:05:04 2010
http://IS2.md.net Sat Mar 20 04:02:44 2010
eusdlsdbb024.logistics.corp Fri Apr 2 04:03:01 2010
http://DB2.gtk-sa.pl Sun Mar 28 04:19:16 2010

Pay attention - this is always soon after 4 o'clock in the night time. Be sure firebird DOES NOT have any scheduling to do something special a specific moment of time. But there is one program in unix, which does perform this activity - cron. And one of typical tasks - clear tmp. And the time when this activity is performed - night.

When you say 'we have no cron tasks' may be it means 'we did not add cron tasks'? Many linux distro already have /tmp cleanup task when installed. Please check files and dirs /tmp/cron* - I suppose you may find something interesting.

@firebird-automations
Copy link
Collaborator Author

Commented by: Derryck welas (welas)

The 1e incident there was no crontab task on the server.

but now i have crontab calling a "java" program every minute for 9 different databases: ( one database even every 30 sec)
executed by the program is :

SELECT ATT.MON$SERVER_PID,ATT.MON$REMOTE_ADDRESS,10000*(CURRENT_TIMESTAMP-STMT.MON$TIMESTAMP) AS ELAPS_TIME,STMT.MON$SQL_TEXT FROM MON$ATTACHMENTS ATT JOIN MON$STATEMENTS STMT ON ATT.MON$ATTACHMENT_ID = STMT.MON$ATTACHMENT_ID WHERE STMT.MON$STATE = 1 ";

edit:
there are no /tmp/cron* directories/files
/tmp has 520M disk available ( 992M 421M 520M 45% /tmp)
also there are older files in /tmp/firebird/* wich are not removed.

edit2:

here is from "/etc/cron.daily/tmpwatch" ..
---->
flags=-umc
/usr/sbin/tmpwatch "$flags" -x /tmp/.X11-unix -x /tmp/.XIM-unix \
-x /tmp/.font-unix -x /tmp/.ICE-unix -x /tmp/.Test-unix 240 /tmp
/usr/sbin/tmpwatch "$flags" 720 /var/tmp
for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
if [ -d "$d" ]; then
/usr/sbin/tmpwatch "$flags" -f 720 "$d"
fi
done
<--

edit3:

the current "stat on fb_trace_GamqsW"
-->
File: `fb_trace_GamqsW'
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: fd03h/64771d Inode: 131085 Links: 1
Access: (0660/-rw-rw----) Uid: ( 84/firebird) Gid: ( 84/firebird)
Access: 2010-04-02 16:22:30.000000000 +0200
Modify: 2010-04-02 16:22:30.000000000 +0200
Change: 2010-04-02 16:22:30.000000000 +0200
<---

based on this it looks like "tempwatch" will remove the file (after 240 hours of no access)
also i dont see the access time or modify time changing. (means if i dont do some tracing the file will be deleted in 10 days......?

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Derrick,

looks like you've found a correct reason for a problem.
I think this will help much with fixing it - in RC3.

So far, you can try as a workaround to

- add cron task to touch files in /tmp/firebird/*, or

- tune tempwatch to not delete (opened) files in /tmp/firebird/*, if it is possible

It seems --fuser option could help : http://linux.about.com/library/cmd/blcmdl8_tmpwatch.htm
I guess you can change
flags=-umc by
flags=-umcs

@firebird-automations
Copy link
Collaborator Author

Commented by: Derryck welas (welas)

Looking forward to RC3 !

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

Made any running instance of firebird perodically touch mentioned file.

@firebird-automations
Copy link
Collaborator Author

Modified by: @AlexPeshkoff

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 2.5 RC3 [ 10381 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Closed [ 6 ] => Closed [ 6 ]

QA Status: Cannot be tested

@firebird-automations
Copy link
Collaborator Author

Commented by: John Franck (bozzy)

Almost 9 years later, I'm experiencing this same behaviour.

Firebird 3.0.4 SS 64bit on CentOS 7.

As reported here (in russian): http://www.sql.ru/forum/1292505/sever-hangs-on-i-o-error-during-open-operation-for-file-tmp-firebird-fb-trace-ksvdoc
it seems that "something" (I don't know what and I don't know how to find it out - I'm not a linux guru) is cleaning the /tmp directory every now and then (again, I don'k know how often, maybe it's a week like in the above russian report), then the absence of the file /tmp/firebird/fb_trace_* prevents local connections to work raising the error:

I/O error during "open" operation for file "/tmp/firebird/fb_trace_1tdYvH" <<--- (my example)
-Error while trying to open file
-No such file or directory

Luckily enough, remote connections continue to work. As this happens on our production server, it would be disastrous if it would block remote connections.

On this server, tracking is disabled, so these fb_trace files seems to be just spurious. As others mentioned, it's quite strange that the server stops responding to local connections if a seemingly useless file is not where it pretends to be, and if this is really needed to exist, it could be recreated by the engine if it doesn't find it.

Side note: I know that on SuperServer it's not a good idea to connect to databases locally, but for some batch operations that we run overnight when our service is already offline, this doesn't represent a problem, as there are no other connections to the DBs.

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

If there are files /tmp/firebird/fb_trace* that means that there are connections to the server. And if some file is deleted that connection(s) remain active for a very long time (about a week). But may be not to database you are going to access locally.

On the other hand if there are connections and if it's not super (but classic) missing such file may become a problem. I.e. I see need to fix an issue. Please create new ticket in the tracker.

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

BTW - this is not actually a bug in FB but a bug in linux - it may delete opened file dduring regular /tmp cleanup. That actions are performed by script in /tmp/cron.* directory, unfotunately I can't say for sure what script to delete to quickly fix your issue (I use gentoo which by default miss that broken feature) but useally looking at scripts and appluing some common sense to them helps to find that script easily.

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

assignee: Alexander Peshkov [ alexpeshkoff ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

Alex, I've just discovered that your commit for this issue (touchThreadFunc for trace) was never frontported into v3/v4. This may be the reason why the bug strikes back for v3 users. Or am I missing something?

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

IIRC, there was intention to move touch function to the new timer thread

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

yes, will take a look after beta1

@AlexPeshkoff
Copy link
Member

Fix from 2.5 got broken in 3.0 due to bug (integer overflow) in ITimer code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment